> Hi Anthony, > Did you decide that it's not a feature to be implemented?
That isn't up to me. > I'm asking about this so I can offer options here. > > I'd not be confortable to enable "mon_allow_pool_size_one" at a specific > pool. > > It would be better if this feature could make a replica at a second time on > selected pool. > Thanks. > Rafael. > > > > De: "Anthony D'Atri" <anthony.da...@gmail.com> > Enviada: 2024/02/01 15:00:59 > Para: quag...@bol.com.br > Cc: ceph-users@ceph.io > Assunto: [ceph-users] Re: Performance improvement suggestion > > I'd totally defer to the RADOS folks. > > One issue might be adding a separate code path, which can have all sorts of > problems. > > > On Feb 1, 2024, at 12:53, quag...@bol.com.br wrote: > > > > > > > > Ok Anthony, > > > > I understood what you said. I also believe in all the professional history > > and experience you have. > > > > Anyway, could there be a configuration flag to make this happen? > > > > As well as those that already exist: "--yes-i-really-mean-it". > > > > This way, the storage pattern would remain as it is. However, it would > > allow situations like the one I mentioned to be possible. > > > > This situation will permit some rules to be relaxed (even if they are not > > ok at first). > > Likewise, there are already situations like lazyio that make some > > exceptions to standard procedures. > > Remembering: it's just a suggestion. > > If this type of functionality is not interesting, it is ok. > > > > > > > > Rafael. > > > > > > De: "Anthony D'Atri" <anthony.da...@gmail.com> > > Enviada: 2024/02/01 12:10:30 > > Para: quag...@bol.com.br > > Cc: ceph-users@ceph.io > > Assunto: [ceph-users] Re: Performance improvement suggestion > > > > > > > > > I didn't say I would accept the risk of losing data. > > > > That's implicit in what you suggest, though. > > > > > I just said that it would be interesting if the objects were first > > > recorded only in the primary OSD. > > > > What happens when that host / drive smokes before it can replicate? What > > happens if a secondary OSD gets a read op before the primary updates it? > > Swift object storage users have to code around this potential. It's a > > non-starter for block storage. > > > > This is similar to why RoC HBAs (which are a badly outdated thing to begin > > with) will only enter writeback mode if they have a BBU / supercap -- and > > of course if their firmware and hardware isn't pervasively buggy. Guess how > > I know this? > > > > > This way it would greatly increase performance (both for iops and > > > throuput). > > > > It might increase low-QD IOPS for a single client on slow media with > > certain networking. Depending on media, it wouldn't increase throughput. > > > > Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x > > the network resources between the client and the servers. > > > > > Later (in the background), record the replicas. This situation would > > > avoid leaving users/software waiting for the recording response from all > > > replicas when the storage is overloaded. > > > > If one makes the mistake of using HDDs, they're going to be overloaded no > > matter how one slices and dices the ops. Ya just canna squeeze IOPS from a > > stone. Throughput is going to be limited by the SATA interface and seeking > > no matter what. > > > > > Where I work, performance is very important and we don't have money to > > > make a entire cluster only with NVMe. > > > > If there isn't money, then it isn't very important. But as I've written > > before, NVMe clusters *do not cost appreciably more than spinners* unless > > your procurement processes are bad. In fact they can cost significantly > > less. This is especially true with object storage and archival where one > > can leverage QLC. > > > > * Buy generic drives from a VAR, not channel drives through a chassis > > brand. Far less markup, and moreover you get the full 5 year warranty, not > > just 3 years. And you can painlessly RMA drives yourself - you don't have > > to spend hours going back and forth with $chassisvendor's TAC arguing about > > every single RMA. I've found that this is so bad that it is more economical > > to just throw away a failed component worth < USD 500 than to RMA it. Do > > you pay for extended warranty / support? That's expensive too. > > > > * Certain chassis brands who shall remain nameless push RoC HBAs hard with > > extreme markups. List prices as high as USD2000. Per server, eschewing > > those abominations makes up for a lot of the drive-only unit economics > > > > * But this is the part that lots of people don't get: You don't just stack > > up the drives on a desk and use them. They go into *servers* that cost > > money and *racks* that cost money. They take *power* that costs money. > > > > * $ / IOPS are FAR better for ANY SSD than for HDDs > > > > * RUs cost money, so do chassis and switches > > > > * Drive failures cost money > > > > * So does having your people and applications twiddle their thumbs waiting > > for stuff to happen. I worked for a supercomputer company who put > > low-memory low-end diskless workstations on engineer's desks. They spent > > lots of time doing nothing waiting for their applications to respond. This > > company no longer exists. > > > > * So does the risk of taking *weeks* to heal from a drive failure > > > > Punch honest numbers into https://www.snia.org/forums/cmsi/programs/TCOcalc > > > > I walked through this with a certain global company. QLC SSDs were > > demonstrated to have like 30% lower TCO than spinners. Part of the equation > > is that they were accustomed to limiting HDD size to 8 TB because of the > > bottlenecks, and thus requiring more servers, more switch ports, more DC > > racks, more rack/stack time, more administrative overhead. You can fit 1.9 > > PB of raw SSD capacity in a 1U server. That same RU will hold at most 88 TB > > of the largest spinners you can get today. 22 TIMES the density. And since > > many applications can't even barely tolerate the spinner bottlenecks, > > capping spinner size at even 10T makes that like 40 TIMES better density > > with SSDs. > > > > > > > However, I don't think it's interesting to lose the functionality of the > > > replicas. > > > I'm just suggesting another way to increase performance without losing > > > the functionality of replicas. > > > > > > > > > Rafael. > > > > > > > > > De: "Anthony D'Atri" <anthony.da...@gmail.com> > > > Enviada: 2024/01/31 17:04:08 > > > Para: quag...@bol.com.br > > > Cc: ceph-users@ceph.io > > > Assunto: Re: [ceph-users] Performance improvement suggestion > > > > > > Would you be willing to accept the risk of data loss? > > > > > >> > > >> On Jan 31, 2024, at 2:48 PM, quag...@bol.com.br wrote: > > >> > > >> Hello everybody, > > >> I would like to make a suggestion for improving performance in Ceph > > >> architecture. > > >> I don't know if this group would be the best place or if my proposal is > > >> correct. > > >> > > >> My suggestion would be in the item > > >> https://docs.ceph.com/en/latest/architecture/, at the end of the topic > > >> "Smart Daemons Enable Hyperscale". > > >> > > >> The Client needs to "wait" for the configured amount of replicas to be > > >> written (so that the client receives an ok and continues). This way, if > > >> there is slowness on any of the disks on which the PG will be updated, > > >> the client is left waiting. > > >> > > >> It would be possible: > > >> > > >> 1-) Only record on the primary OSD > > >> 2-) Write other replicas in background (like the same way as when an OSD > > >> fails: "degraded" ). > > >> > > >> This way, client has a faster response when writing to storage: > > >> improving latency and performance (throughput and IOPS). > > >> > > >> I would find it plausible to accept a period of time (seconds) until all > > >> replicas are ok (written asynchronously) at the expense of improving > > >> performance. > > >> > > >> Could you evaluate this scenario? > > >> > > >> > > >> Rafael. > > >> > > >> _______________________________________________ > > >> ceph-users mailing list -- ceph-users@ceph.io > > >> To unsubscribe send an email to ceph-users-le...@ceph.io > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to > > ceph-users-leave@ceph.io_______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io