[ceph-users] Re: Performance improvement suggestion
On 3/4/24 08:40, Maged Mokhtar wrote: On 04/03/2024 15:37, Frank Schilder wrote: Fast write enabled would mean that the primary OSD sends #size copies to the entire active set (including itself) in parallel and sends an ACK to the client as soon as min_size ACKs have been received from the peers (including itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever reason) without suffering performance penalties immediately (only after too many requests started piling up, which will show as a slow requests warning). What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client? This should not be different than what exists today..unless of-course if the error happens on the local/primary osd Can this be addressed with reasonable effort? I don't expect this to be a quick-fix and it should be tested. However, beating the tail-latency statistics with the extra redundancy should be worth it. I observe fluctuations of latencies, OSDs become randomly slow for whatever reason for short time intervals and then return to normal. A reason for this could be DB compaction. I think during compaction latency tends to spike. A fast-write option would effectively remove the impact of this. Best regards and thanks for considering this! i think this is something the rados devs need to say. it does sound worth investigating. it is not just for cases with db compaction but more importantly the normal(happy) io path as it will have the most impact. Typically a L0->L1 compaction will have two primary effects: 1) It will cause large IO read/write traffic to the disk potentially impacting other IO taking place if the disk is already saturated. 2) It will block memtable flushes until the compaction finishes. This means that more and more data will accumulate in the memtables/WAL which can trigger throttling and eventually stalls if you run out of buffer space. By default, we allow up to 1GB of writes to WAL/memtables before writes are fully stalled, but RocksDB will typlically throttle writes before you get to that point. It's possible a larger buffer may allow you to absorb traffic spikes for longer at the expense of more disk and memory usage. Ultimately though, if you are hitting throttling, it means that the DB can't keep up with the WAL ingestion rate. Mark ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Best Regards, Mark Nelson Head of Research and Development Clyso GmbH p: +49 89 21552391 12 | a: Minnesota, USA w: https://clyso.com | e: mark.nel...@clyso.com We are hiring: https://www.clyso.com/jobs/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
On 04/03/2024 15:37, Frank Schilder wrote: Fast write enabled would mean that the primary OSD sends #size copies to the entire active set (including itself) in parallel and sends an ACK to the client as soon as min_size ACKs have been received from the peers (including itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever reason) without suffering performance penalties immediately (only after too many requests started piling up, which will show as a slow requests warning). What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client? This should not be different than what exists today..unless of-course if the error happens on the local/primary osd Can this be addressed with reasonable effort? I don't expect this to be a quick-fix and it should be tested. However, beating the tail-latency statistics with the extra redundancy should be worth it. I observe fluctuations of latencies, OSDs become randomly slow for whatever reason for short time intervals and then return to normal. A reason for this could be DB compaction. I think during compaction latency tends to spike. A fast-write option would effectively remove the impact of this. Best regards and thanks for considering this! i think this is something the rados devs need to say. it does sound worth investigating. it is not just for cases with db compaction but more importantly the normal(happy) io path as it will have the most impact. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
>>> Fast write enabled would mean that the primary OSD sends #size copies to the >>> entire active set (including itself) in parallel and sends an ACK to the >>> client as soon as min_size ACKs have been received from the peers (including >>> itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow >>> for whatever reason) without suffering performance penalties immediately >>> (only after too many requests started piling up, which will show as a slow >>> requests warning). >>> >> What happens if there occurs an error on the slowest osd after the min_size >> ACK has already been send to the client? >> >This should not be different than what exists today..unless of-course if >the error happens on the local/primary osd Can this be addressed with reasonable effort? I don't expect this to be a quick-fix and it should be tested. However, beating the tail-latency statistics with the extra redundancy should be worth it. I observe fluctuations of latencies, OSDs become randomly slow for whatever reason for short time intervals and then return to normal. A reason for this could be DB compaction. I think during compaction latency tends to spike. A fast-write option would effectively remove the impact of this. Best regards and thanks for considering this! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
On 04/03/2024 13:35, Marc wrote: Fast write enabled would mean that the primary OSD sends #size copies to the entire active set (including itself) in parallel and sends an ACK to the client as soon as min_size ACKs have been received from the peers (including itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever reason) without suffering performance penalties immediately (only after too many requests started piling up, which will show as a slow requests warning). What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client? This should not be different than what exists today..unless of-course if the error happens on the local/primary osd ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
> > Fast write enabled would mean that the primary OSD sends #size copies to the > entire active set (including itself) in parallel and sends an ACK to the > client as soon as min_size ACKs have been received from the peers (including > itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow > for whatever reason) without suffering performance penalties immediately > (only after too many requests started piling up, which will show as a slow > requests warning). > What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
Hi all, coming late to the party but want to ship in as well with some experience. The problem of tail latencies of individual OSDs is a real pain for any redundant storage system. However, there is a way to deal with this in an elegant way when using large replication factors. The idea is to use the counterpart of the "fast read" option that exists for EC pools and: 1) make this option available to replicated pools as well (is on the road map as far as I know), but also 2) implement an option "fast write" for all pool types. Fast write enabled would mean that the primary OSD sends #size copies to the entire active set (including itself) in parallel and sends an ACK to the client as soon as min_size ACKs have been received from the peers (including itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever reason) without suffering performance penalties immediately (only after too many requests started piling up, which will show as a slow requests warning). I have fast read enabled on all EC pools. This does increase the cluster-internal network traffic, which is nowadays absolutely no problem (in the good old 1G times it potentially would be). In return, the read latencies on the client side are lower and much more predictable. In effect, the user experience improved dramatically. I would really wish that such an option gets added as we use wide replication profiles (rep-(4,2) and EC(8+3), each with 2 "spare" OSDs) and exploiting large replication factors (more precisely, large (size-min_size)) to mitigate the impact of slow OSDs would be awesome. It would also add some incentive to stop the ridiculous size=2 min_size=1 habit, because one gets an extra gain from replication on top of redundancy. In the long run, the ceph write path should try to deal with a-priori known different-latency connections (fast local ACK with async remote completion, was asked for a couple of times), for example, for stretched clusters where one has an internal connection for the local part and external connections for the remote parts. It would be great to have similar ways of mitigating some penalties of the slow write paths to remote sites. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Peter Grandi Sent: Wednesday, February 21, 2024 1:10 PM To: list Linux fs Ceph Subject: [ceph-users] Re: Performance improvement suggestion > 1. Write object A from client. > 2. Fsync to primary device completes. > 3. Ack to client. > 4. Writes sent to replicas. [...] As mentioned in the discussion this proposal is the opposite of what the current policy, is, which is to wait for all replicas to be written before writes are acknowledged to the client: https://github.com/ceph/ceph/blob/main/doc/architecture.rst "After identifying the target placement group, the client writes the object to the identified placement group's primary OSD. The primary OSD then [...] confirms that the object was stored successfully in the secondary and tertiary OSDs, and reports to the client that the object was stored successfully." A more revolutionary option would be for 'librados' to write in parallel to all the "active set" OSDs and report this to the primary, but that would greatly increase client-Ceph traffic, while the current logic increases traffic only among OSDs. > So I think that to maintain any semblance of reliability, > you'd need to at least wait for a commit ack from the first > replica (i.e. min_size=2). Perhaps it could be similar to 'k'+'m' for EC, that is 'k' synchronous (write completes to the client only when all at least 'k' replicas, including primary, have been committed) and 'm' asynchronous, instead of 'k' being just 1 or 2. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
> 1. Write object A from client. > 2. Fsync to primary device completes. > 3. Ack to client. > 4. Writes sent to replicas. [...] As mentioned in the discussion this proposal is the opposite of what the current policy, is, which is to wait for all replicas to be written before writes are acknowledged to the client: https://github.com/ceph/ceph/blob/main/doc/architecture.rst "After identifying the target placement group, the client writes the object to the identified placement group's primary OSD. The primary OSD then [...] confirms that the object was stored successfully in the secondary and tertiary OSDs, and reports to the client that the object was stored successfully." A more revolutionary option would be for 'librados' to write in parallel to all the "active set" OSDs and report this to the primary, but that would greatly increase client-Ceph traffic, while the current logic increases traffic only among OSDs. > So I think that to maintain any semblance of reliability, > you'd need to at least wait for a commit ack from the first > replica (i.e. min_size=2). Perhaps it could be similar to 'k'+'m' for EC, that is 'k' synchronous (write completes to the client only when all at least 'k' replicas, including primary, have been committed) and 'm' asynchronous, instead of 'k' being just 1 or 2. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
Hi, I just want to echo what the others are saying. Keep in mind that RADOS needs to guarantee read-after-write consistency for the higher level apps to work (RBD, RGW, CephFS). If you corrupt VM block devices, S3 objects or bucket metadata/indexes, or CephFS metadata, you're going to suffer some long days and nights recovering. Anyway, I think that what you proposed has at best a similar reliability to min_size=1. And note that min_size=1 is strongly discouraged because of the very high likelihood that a device/network/power failure turns into a visible outage. In short: your idea would turn every OSD into a SPoF. How would you handle this very common scenario: a power outage followed by at least one device failing to start afterwards? 1. Write object A from client. 2. Fsync to primary device completes. 3. Ack to client. 4. Writes sent to replicas. 5. Cluster wide power outage (before replicas committed). 6. Power restored, but the primary osd does not start (e.g. permanent hdd failure). 7. Client tries to read object A. Today, with min_size=1 such a scenario manifests as data loss: you get either a down PG (with many many objects offline/IO blocked until you manually decide which data loss mode to accept) or unfounded objects (with IO blocked until you accept data loss). With min_size=2 the likelihood of data loss is dramatically reduced. Another thing about that power loss scenario is that all dirty PGs would need to be recovered when the cluster reboots. You'd lose all the writes in transit and have to replay them from the primary's pg_log, or backfill if the pg_log was too short. Again, any failure during that recovery would lead to data loss. So I think that to maintain any semblance of reliability, you'd need to at least wait for a commit ack from the first replica (i.e. min_size=2). But since the replica writes are dispatched in parallel, your speedup would evaporate. Another thing: I suspect this idea would result in many inconsistencies from transient issues. You'd need to ramp up the number of parallel deep-scrubs to look for those inconsistencies quickly, which would also work against any potential speedup. Cheers, Dan -- Dan van der Ster CTO Clyso GmbH w: https://clyso.com | e: dan.vanders...@clyso.com Try our Ceph Analyzer!: https://analyzer.clyso.com/ We are hiring: https://www.clyso.com/jobs/ On Wed, Jan 31, 2024, 11:49 quag...@bol.com.br wrote: > Hello everybody, > I would like to make a suggestion for improving performance in Ceph > architecture. > I don't know if this group would be the best place or if my proposal > is correct. > > My suggestion would be in the item > https://docs.ceph.com/en/latest/architecture/, at the end of the topic > "Smart Daemons Enable Hyperscale". > > The Client needs to "wait" for the configured amount of replicas to > be written (so that the client receives an ok and continues). This way, if > there is slowness on any of the disks on which the PG will be updated, the > client is left waiting. > > It would be possible: > > 1-) Only record on the primary OSD > 2-) Write other replicas in background (like the same way as when an > OSD fails: "degraded" ). > > This way, client has a faster response when writing to storage: > improving latency and performance (throughput and IOPS). > > I would find it plausible to accept a period of time (seconds) until > all replicas are ok (written asynchronously) at the expense of improving > performance. > > Could you evaluate this scenario? > > > Rafael. > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
I would be against such an option, because it introduces a significant risk of data loss. Ceph has made a name for itself as a very reliable system, where almost no one lost data, no matter how bad of a decision they made with architecture and design. This is what you pay for in commercial systems, to "not be allowed a bad choice", and this is what everyone gets with Ceph for free (if they so choose). Allowing a change like this would likely be the beginning of the end of Ceph. It is a bad idea in the extreme. Ceph reliability should never be compromised. There are other options for storage that are robust and do not require as much investment. Use ZFS, with NFS if needed. Use bcache/flashcache, or something similar on the client side. Use proper RAM caching in databases and applications. -- Alex Gorbachev Intelligent Systems Services Inc. STORCIUM On Tue, Feb 20, 2024 at 3:04 PM Anthony D'Atri wrote: > > > > Hi Anthony, > > Did you decide that it's not a feature to be implemented? > > That isn't up to me. > > > I'm asking about this so I can offer options here. > > > > I'd not be confortable to enable "mon_allow_pool_size_one" at a > specific pool. > > > > It would be better if this feature could make a replica at a second time > on selected pool. > > Thanks. > > Rafael. > > > > > > > > De: "Anthony D'Atri" > > Enviada: 2024/02/01 15:00:59 > > Para: quag...@bol.com.br > > Cc: ceph-users@ceph.io > > Assunto: [ceph-users] Re: Performance improvement suggestion > > > > I'd totally defer to the RADOS folks. > > > > One issue might be adding a separate code path, which can have all sorts > of problems. > > > > > On Feb 1, 2024, at 12:53, quag...@bol.com.br wrote: > > > > > > > > > > > > Ok Anthony, > > > > > > I understood what you said. I also believe in all the professional > history and experience you have. > > > > > > Anyway, could there be a configuration flag to make this happen? > > > > > > As well as those that already exist: "--yes-i-really-mean-it". > > > > > > This way, the storage pattern would remain as it is. However, it would > allow situations like the one I mentioned to be possible. > > > > > > This situation will permit some rules to be relaxed (even if they are > not ok at first). > > > Likewise, there are already situations like lazyio that make some > exceptions to standard procedures. > > > Remembering: it's just a suggestion. > > > If this type of functionality is not interesting, it is ok. > > > > > > > > > > > > Rafael. > > > > > > > > > De: "Anthony D'Atri" > > > Enviada: 2024/02/01 12:10:30 > > > Para: quag...@bol.com.br > > > Cc: ceph-users@ceph.io > > > Assunto: [ceph-users] Re: Performance improvement suggestion > > > > > > > > > > > > > I didn't say I would accept the risk of losing data. > > > > > > That's implicit in what you suggest, though. > > > > > > > I just said that it would be interesting if the objects were first > recorded only in the primary OSD. > > > > > > What happens when that host / drive smokes before it can replicate? > What happens if a secondary OSD gets a read op before the primary updates > it? Swift object storage users have to code around this potential. It's a > non-starter for block storage. > > > > > > This is similar to why RoC HBAs (which are a badly outdated thing to > begin with) will only enter writeback mode if they have a BBU / supercap -- > and of course if their firmware and hardware isn't pervasively buggy. Guess > how I know this? > > > > > > > This way it would greatly increase performance (both for iops and > throuput). > > > > > > It might increase low-QD IOPS for a single client on slow media with > certain networking. Depending on media, it wouldn't increase throughput. > > > > > > Consider QEMU drive-mirror. If you're doing RF=3 replication, you use > 3x the network resources between the client and the servers. > > > > > > > Later (in the background), record the replicas. This situation would > avoid leaving users/software waiting for the recording response from all > replicas when the storage is overloaded. > > > > > > If one makes the mistake of using HDDs, they're going to be overloaded > no matter how one slices and dices the ops. Ya just canna squeeze IOPS from > a stone. Throug
[ceph-users] Re: Performance improvement suggestion
Cache tiering is deprecated. > On Feb 20, 2024, at 17:03, Özkan Göksu wrote: > > Hello. > > I didn't test it personally but what about rep 1 write cache pool with nvme > backed by another rep 2 pool? > > It has the potential exactly what you are looking for in theory. > > > 1 Şub 2024 Per 20:54 tarihinde quag...@bol.com.br şunu > yazdı: > >> >> >> Ok Anthony, >> >> I understood what you said. I also believe in all the professional history >> and experience you have. >> >> Anyway, could there be a configuration flag to make this happen? >> >> As well as those that already exist: "--yes-i-really-mean-it". >> >> This way, the storage pattern would remain as it is. However, it would >> allow situations like the one I mentioned to be possible. >> >> This situation will permit some rules to be relaxed (even if they are not >> ok at first). >> Likewise, there are already situations like lazyio that make some >> exceptions to standard procedures. >> >> >> Remembering: it's just a suggestion. >> If this type of functionality is not interesting, it is ok. >> >> >> Rafael. >> >> -- >> >> *De: *"Anthony D'Atri" >> *Enviada: *2024/02/01 12:10:30 >> *Para: *quag...@bol.com.br >> *Cc: * ceph-users@ceph.io >> *Assunto: * [ceph-users] Re: Performance improvement suggestion >> >> >> >>> I didn't say I would accept the risk of losing data. >> >> That's implicit in what you suggest, though. >> >>> I just said that it would be interesting if the objects were first >> recorded only in the primary OSD. >> >> What happens when that host / drive smokes before it can replicate? What >> happens if a secondary OSD gets a read op before the primary updates it? >> Swift object storage users have to code around this potential. It's a >> non-starter for block storage. >> >> This is similar to why RoC HBAs (which are a badly outdated thing to begin >> with) will only enter writeback mode if they have a BBU / supercap -- and >> of course if their firmware and hardware isn't pervasively buggy. Guess how >> I know this? >> >>> This way it would greatly increase performance (both for iops and >> throuput). >> >> It might increase low-QD IOPS for a single client on slow media with >> certain networking. Depending on media, it wouldn't increase throughput. >> >> Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x >> the network resources between the client and the servers. >> >>> Later (in the background), record the replicas. This situation would >> avoid leaving users/software waiting for the recording response from all >> replicas when the storage is overloaded. >> >> If one makes the mistake of using HDDs, they're going to be overloaded no >> matter how one slices and dices the ops. Ya just canna squeeze IOPS from a >> stone. Throughput is going to be limited by the SATA interface and seeking >> no matter what. >> >>> Where I work, performance is very important and we don't have money to >> make a entire cluster only with NVMe. >> >> If there isn't money, then it isn't very important. But as I've written >> before, NVMe clusters *do not cost appreciably more than spinners* unless >> your procurement processes are bad. In fact they can cost significantly >> less. This is especially true with object storage and archival where one >> can leverage QLC. >> >> * Buy generic drives from a VAR, not channel drives through a chassis >> brand. Far less markup, and moreover you get the full 5 year warranty, not >> just 3 years. And you can painlessly RMA drives yourself - you don't have >> to spend hours going back and forth with $chassisvendor's TAC arguing about >> every single RMA. I've found that this is so bad that it is more economical >> to just throw away a failed component worth < USD 500 than to RMA it. Do >> you pay for extended warranty / support? That's expensive too. >> >> * Certain chassis brands who shall remain nameless push RoC HBAs hard with >> extreme markups. List prices as high as USD2000. Per server, eschewing >> those abominations makes up for a lot of the drive-only unit economics >> >> * But this is the part that lots of people don't get: You don't just stack >> up the drives on a desk and use them. They go into *servers* that cost >> money and *racks* that cost money. They take *power* that cost
[ceph-users] Re: Performance improvement suggestion
Hello. I didn't test it personally but what about rep 1 write cache pool with nvme backed by another rep 2 pool? It has the potential exactly what you are looking for in theory. 1 Şub 2024 Per 20:54 tarihinde quag...@bol.com.br şunu yazdı: > > > Ok Anthony, > > I understood what you said. I also believe in all the professional history > and experience you have. > > Anyway, could there be a configuration flag to make this happen? > > As well as those that already exist: "--yes-i-really-mean-it". > > This way, the storage pattern would remain as it is. However, it would > allow situations like the one I mentioned to be possible. > > This situation will permit some rules to be relaxed (even if they are not > ok at first). > Likewise, there are already situations like lazyio that make some > exceptions to standard procedures. > > > Remembering: it's just a suggestion. > If this type of functionality is not interesting, it is ok. > > > Rafael. > > -- > > *De: *"Anthony D'Atri" > *Enviada: *2024/02/01 12:10:30 > *Para: *quag...@bol.com.br > *Cc: * ceph-users@ceph.io > *Assunto: * [ceph-users] Re: Performance improvement suggestion > > > > > I didn't say I would accept the risk of losing data. > > That's implicit in what you suggest, though. > > > I just said that it would be interesting if the objects were first > recorded only in the primary OSD. > > What happens when that host / drive smokes before it can replicate? What > happens if a secondary OSD gets a read op before the primary updates it? > Swift object storage users have to code around this potential. It's a > non-starter for block storage. > > This is similar to why RoC HBAs (which are a badly outdated thing to begin > with) will only enter writeback mode if they have a BBU / supercap -- and > of course if their firmware and hardware isn't pervasively buggy. Guess how > I know this? > > > This way it would greatly increase performance (both for iops and > throuput). > > It might increase low-QD IOPS for a single client on slow media with > certain networking. Depending on media, it wouldn't increase throughput. > > Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x > the network resources between the client and the servers. > > > Later (in the background), record the replicas. This situation would > avoid leaving users/software waiting for the recording response from all > replicas when the storage is overloaded. > > If one makes the mistake of using HDDs, they're going to be overloaded no > matter how one slices and dices the ops. Ya just canna squeeze IOPS from a > stone. Throughput is going to be limited by the SATA interface and seeking > no matter what. > > > Where I work, performance is very important and we don't have money to > make a entire cluster only with NVMe. > > If there isn't money, then it isn't very important. But as I've written > before, NVMe clusters *do not cost appreciably more than spinners* unless > your procurement processes are bad. In fact they can cost significantly > less. This is especially true with object storage and archival where one > can leverage QLC. > > * Buy generic drives from a VAR, not channel drives through a chassis > brand. Far less markup, and moreover you get the full 5 year warranty, not > just 3 years. And you can painlessly RMA drives yourself - you don't have > to spend hours going back and forth with $chassisvendor's TAC arguing about > every single RMA. I've found that this is so bad that it is more economical > to just throw away a failed component worth < USD 500 than to RMA it. Do > you pay for extended warranty / support? That's expensive too. > > * Certain chassis brands who shall remain nameless push RoC HBAs hard with > extreme markups. List prices as high as USD2000. Per server, eschewing > those abominations makes up for a lot of the drive-only unit economics > > * But this is the part that lots of people don't get: You don't just stack > up the drives on a desk and use them. They go into *servers* that cost > money and *racks* that cost money. They take *power* that costs money. > > * $ / IOPS are FAR better for ANY SSD than for HDDs > > * RUs cost money, so do chassis and switches > > * Drive failures cost money > > * So does having your people and applications twiddle their thumbs waiting > for stuff to happen. I worked for a supercomputer company who put > low-memory low-end diskless workstations on engineer's desks. They spent > lots of time doing nothing waiting for their applications to respond. This > company no longer exists. > > * So does the risk of taking *
[ceph-users] Re: Performance improvement suggestion
> Hi Anthony, > Did you decide that it's not a feature to be implemented? That isn't up to me. > I'm asking about this so I can offer options here. > > I'd not be confortable to enable "mon_allow_pool_size_one" at a specific > pool. > > It would be better if this feature could make a replica at a second time on > selected pool. > Thanks. > Rafael. > > > > De: "Anthony D'Atri" > Enviada: 2024/02/01 15:00:59 > Para: quag...@bol.com.br > Cc: ceph-users@ceph.io > Assunto: [ceph-users] Re: Performance improvement suggestion > > I'd totally defer to the RADOS folks. > > One issue might be adding a separate code path, which can have all sorts of > problems. > > > On Feb 1, 2024, at 12:53, quag...@bol.com.br wrote: > > > > > > > > Ok Anthony, > > > > I understood what you said. I also believe in all the professional history > > and experience you have. > > > > Anyway, could there be a configuration flag to make this happen? > > > > As well as those that already exist: "--yes-i-really-mean-it". > > > > This way, the storage pattern would remain as it is. However, it would > > allow situations like the one I mentioned to be possible. > > > > This situation will permit some rules to be relaxed (even if they are not > > ok at first). > > Likewise, there are already situations like lazyio that make some > > exceptions to standard procedures. > > Remembering: it's just a suggestion. > > If this type of functionality is not interesting, it is ok. > > > > > > > > Rafael. > > > > > > De: "Anthony D'Atri" > > Enviada: 2024/02/01 12:10:30 > > Para: quag...@bol.com.br > > Cc: ceph-users@ceph.io > > Assunto: [ceph-users] Re: Performance improvement suggestion > > > > > > > > > I didn't say I would accept the risk of losing data. > > > > That's implicit in what you suggest, though. > > > > > I just said that it would be interesting if the objects were first > > > recorded only in the primary OSD. > > > > What happens when that host / drive smokes before it can replicate? What > > happens if a secondary OSD gets a read op before the primary updates it? > > Swift object storage users have to code around this potential. It's a > > non-starter for block storage. > > > > This is similar to why RoC HBAs (which are a badly outdated thing to begin > > with) will only enter writeback mode if they have a BBU / supercap -- and > > of course if their firmware and hardware isn't pervasively buggy. Guess how > > I know this? > > > > > This way it would greatly increase performance (both for iops and > > > throuput). > > > > It might increase low-QD IOPS for a single client on slow media with > > certain networking. Depending on media, it wouldn't increase throughput. > > > > Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x > > the network resources between the client and the servers. > > > > > Later (in the background), record the replicas. This situation would > > > avoid leaving users/software waiting for the recording response from all > > > replicas when the storage is overloaded. > > > > If one makes the mistake of using HDDs, they're going to be overloaded no > > matter how one slices and dices the ops. Ya just canna squeeze IOPS from a > > stone. Throughput is going to be limited by the SATA interface and seeking > > no matter what. > > > > > Where I work, performance is very important and we don't have money to > > > make a entire cluster only with NVMe. > > > > If there isn't money, then it isn't very important. But as I've written > > before, NVMe clusters *do not cost appreciably more than spinners* unless > > your procurement processes are bad. In fact they can cost significantly > > less. This is especially true with object storage and archival where one > > can leverage QLC. > > > > * Buy generic drives from a VAR, not channel drives through a chassis > > brand. Far less markup, and moreover you get the full 5 year warranty, not > > just 3 years. And you can painlessly RMA drives yourself - you don't have > > to spend hours going back and forth with $chassisvendor's TAC arguing about > > every single RMA. I've found that this is so bad that it is more economical > > to just throw away a failed component worth < USD 500 than to RMA it. Do > > you pay for extended warranty / support?
[ceph-users] Re: Performance improvement suggestion
___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
I'd totally defer to the RADOS folks. One issue might be adding a separate code path, which can have all sorts of problems. > On Feb 1, 2024, at 12:53, quag...@bol.com.br wrote: > > > > Ok Anthony, > > I understood what you said. I also believe in all the professional history > and experience you have. > > Anyway, could there be a configuration flag to make this happen? > > As well as those that already exist: "--yes-i-really-mean-it". > > This way, the storage pattern would remain as it is. However, it would allow > situations like the one I mentioned to be possible. > > This situation will permit some rules to be relaxed (even if they are not ok > at first). > Likewise, there are already situations like lazyio that make some exceptions > to standard procedures. > Remembering: it's just a suggestion. > If this type of functionality is not interesting, it is ok. > > > > Rafael. > > > De: "Anthony D'Atri" > Enviada: 2024/02/01 12:10:30 > Para: quag...@bol.com.br > Cc: ceph-users@ceph.io > Assunto: [ceph-users] Re: Performance improvement suggestion > > > > > I didn't say I would accept the risk of losing data. > > That's implicit in what you suggest, though. > > > I just said that it would be interesting if the objects were first recorded > > only in the primary OSD. > > What happens when that host / drive smokes before it can replicate? What > happens if a secondary OSD gets a read op before the primary updates it? > Swift object storage users have to code around this potential. It's a > non-starter for block storage. > > This is similar to why RoC HBAs (which are a badly outdated thing to begin > with) will only enter writeback mode if they have a BBU / supercap -- and of > course if their firmware and hardware isn't pervasively buggy. Guess how I > know this? > > > This way it would greatly increase performance (both for iops and throuput). > > It might increase low-QD IOPS for a single client on slow media with certain > networking. Depending on media, it wouldn't increase throughput. > > Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x the > network resources between the client and the servers. > > > Later (in the background), record the replicas. This situation would avoid > > leaving users/software waiting for the recording response from all replicas > > when the storage is overloaded. > > If one makes the mistake of using HDDs, they're going to be overloaded no > matter how one slices and dices the ops. Ya just canna squeeze IOPS from a > stone. Throughput is going to be limited by the SATA interface and seeking no > matter what. > > > Where I work, performance is very important and we don't have money to make > > a entire cluster only with NVMe. > > If there isn't money, then it isn't very important. But as I've written > before, NVMe clusters *do not cost appreciably more than spinners* unless > your procurement processes are bad. In fact they can cost significantly less. > This is especially true with object storage and archival where one can > leverage QLC. > > * Buy generic drives from a VAR, not channel drives through a chassis brand. > Far less markup, and moreover you get the full 5 year warranty, not just 3 > years. And you can painlessly RMA drives yourself - you don't have to spend > hours going back and forth with $chassisvendor's TAC arguing about every > single RMA. I've found that this is so bad that it is more economical to just > throw away a failed component worth < USD 500 than to RMA it. Do you pay for > extended warranty / support? That's expensive too. > > * Certain chassis brands who shall remain nameless push RoC HBAs hard with > extreme markups. List prices as high as USD2000. Per server, eschewing those > abominations makes up for a lot of the drive-only unit economics > > * But this is the part that lots of people don't get: You don't just stack up > the drives on a desk and use them. They go into *servers* that cost money and > *racks* that cost money. They take *power* that costs money. > > * $ / IOPS are FAR better for ANY SSD than for HDDs > > * RUs cost money, so do chassis and switches > > * Drive failures cost money > > * So does having your people and applications twiddle their thumbs waiting > for stuff to happen. I worked for a supercomputer company who put low-memory > low-end diskless workstations on engineer's desks. They spent lots of time > doing nothing waiting for their applications to respond. This company no > longer exists. > > * So does the risk of taking *weeks* to heal from a d
[ceph-users] Re: Performance improvement suggestion
Ok Anthony, I understood what you said. I also believe in all the professional history and experience you have. Anyway, could there be a configuration flag to make this happen? As well as those that already exist: "--yes-i-really-mean-it". This way, the storage pattern would remain as it is. However, it would allow situations like the one I mentioned to be possible. This situation will permit some rules to be relaxed (even if they are not ok at first). Likewise, there are already situations like lazyio that make some exceptions to standard procedures. Remembering: it's just a suggestion. If this type of functionality is not interesting, it is ok. Rafael. De: "Anthony D'Atri" Enviada: 2024/02/01 12:10:30 Para: quag...@bol.com.br Cc: ceph-users@ceph.io Assunto: [ceph-users] Re: Performance improvement suggestion > I didn't say I would accept the risk of losing data. That's implicit in what you suggest, though. > I just said that it would be interesting if the objects were first recorded only in the primary OSD. What happens when that host / drive smokes before it can replicate? What happens if a secondary OSD gets a read op before the primary updates it? Swift object storage users have to code around this potential. It's a non-starter for block storage. This is similar to why RoC HBAs (which are a badly outdated thing to begin with) will only enter writeback mode if they have a BBU / supercap -- and of course if their firmware and hardware isn't pervasively buggy. Guess how I know this? > This way it would greatly increase performance (both for iops and throuput). It might increase low-QD IOPS for a single client on slow media with certain networking. Depending on media, it wouldn't increase throughput. Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x the network resources between the client and the servers. > Later (in the background), record the replicas. This situation would avoid leaving users/software waiting for the recording response from all replicas when the storage is overloaded. If one makes the mistake of using HDDs, they're going to be overloaded no matter how one slices and dices the ops. Ya just canna squeeze IOPS from a stone. Throughput is going to be limited by the SATA interface and seeking no matter what. > Where I work, performance is very important and we don't have money to make a entire cluster only with NVMe. If there isn't money, then it isn't very important. But as I've written before, NVMe clusters *do not cost appreciably more than spinners* unless your procurement processes are bad. In fact they can cost significantly less. This is especially true with object storage and archival where one can leverage QLC. * Buy generic drives from a VAR, not channel drives through a chassis brand. Far less markup, and moreover you get the full 5 year warranty, not just 3 years. And you can painlessly RMA drives yourself - you don't have to spend hours going back and forth with $chassisvendor's TAC arguing about every single RMA. I've found that this is so bad that it is more economical to just throw away a failed component worth < USD 500 than to RMA it. Do you pay for extended warranty / support? That's expensive too. * Certain chassis brands who shall remain nameless push RoC HBAs hard with extreme markups. List prices as high as USD2000. Per server, eschewing those abominations makes up for a lot of the drive-only unit economics * But this is the part that lots of people don't get: You don't just stack up the drives on a desk and use them. They go into *servers* that cost money and *racks* that cost money. They take *power* that costs money. * $ / IOPS are FAR better for ANY SSD than for HDDs * RUs cost money, so do chassis and switches * Drive failures cost money * So does having your people and applications twiddle their thumbs waiting for stuff to happen. I worked for a supercomputer company who put low-memory low-end diskless workstations on engineer's desks. They spent lots of time doing nothing waiting for their applications to respond. This company no longer exists. * So does the risk of taking *weeks* to heal from a drive failure Punch honest numbers into https://www.snia.org/forums/cmsi/programs/TCOcalc I walked through this with a certain global company. QLC SSDs were demonstrated to have like 30% lower TCO than spinners. Part of the equation is that they were accustomed to limiting HDD size to 8 TB because of the bottlenecks, and thus requiring more servers, more switch ports, more DC racks, more rack/stack time, more administrative overhead. You can fit 1.9 PB of raw SSD capacity in a 1U server. That same RU will hold at most 88 TB of the largest spinners you can get today. 22 TIMES the density. And since many applications can't even barely tolerate the spinner bottlenecks, capping spinner size at even 10T makes that like 40 TIMES better density with SSDs. > Howeve
[ceph-users] Re: Performance improvement suggestion
___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
> I didn't say I would accept the risk of losing data. That's implicit in what you suggest, though. > I just said that it would be interesting if the objects were first > recorded only in the primary OSD. What happens when that host / drive smokes before it can replicate? What happens if a secondary OSD gets a read op before the primary updates it? Swift object storage users have to code around this potential. It's a non-starter for block storage. This is similar to why RoC HBAs (which are a badly outdated thing to begin with) will only enter writeback mode if they have a BBU / supercap -- and of course if their firmware and hardware isn't pervasively buggy. Guess how I know this? > This way it would greatly increase performance (both for iops and > throuput). It might increase low-QD IOPS for a single client on slow media with certain networking. Depending on media, it wouldn't increase throughput. Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x the network resources between the client and the servers. > Later (in the background), record the replicas. This situation would > avoid leaving users/software waiting for the recording response from all > replicas when the storage is overloaded. If one makes the mistake of using HDDs, they're going to be overloaded no matter how one slices and dices the ops. Ya just canna squeeze IOPS from a stone. Throughput is going to be limited by the SATA interface and seeking no matter what. > Where I work, performance is very important and we don't have money to > make a entire cluster only with NVMe. If there isn't money, then it isn't very important. But as I've written before, NVMe clusters *do not cost appreciably more than spinners* unless your procurement processes are bad. In fact they can cost significantly less. This is especially true with object storage and archival where one can leverage QLC. * Buy generic drives from a VAR, not channel drives through a chassis brand. Far less markup, and moreover you get the full 5 year warranty, not just 3 years. And you can painlessly RMA drives yourself - you don't have to spend hours going back and forth with $chassisvendor's TAC arguing about every single RMA. I've found that this is so bad that it is more economical to just throw away a failed component worth < USD 500 than to RMA it. Do you pay for extended warranty / support? That's expensive too. * Certain chassis brands who shall remain nameless push RoC HBAs hard with extreme markups. List prices as high as USD2000. Per server, eschewing those abominations makes up for a lot of the drive-only unit economics * But this is the part that lots of people don't get: You don't just stack up the drives on a desk and use them. They go into *servers* that cost money and *racks* that cost money. They take *power* that costs money. * $ / IOPS are FAR better for ANY SSD than for HDDs * RUs cost money, so do chassis and switches * Drive failures cost money * So does having your people and applications twiddle their thumbs waiting for stuff to happen. I worked for a supercomputer company who put low-memory low-end diskless workstations on engineer's desks. They spent lots of time doing nothing waiting for their applications to respond. This company no longer exists. * So does the risk of taking *weeks* to heal from a drive failure Punch honest numbers into https://www.snia.org/forums/cmsi/programs/TCOcalc I walked through this with a certain global company. QLC SSDs were demonstrated to have like 30% lower TCO than spinners. Part of the equation is that they were accustomed to limiting HDD size to 8 TB because of the bottlenecks, and thus requiring more servers, more switch ports, more DC racks, more rack/stack time, more administrative overhead. You can fit 1.9 PB of raw SSD capacity in a 1U server. That same RU will hold at most 88 TB of the largest spinners you can get today. 22 TIMES the density. And since many applications can't even barely tolerate the spinner bottlenecks, capping spinner size at even 10T makes that like 40 TIMES better density with SSDs. > However, I don't think it's interesting to lose the functionality of the > replicas. > I'm just suggesting another way to increase performance without losing > the functionality of replicas. > > > Rafael. > > > De: "Anthony D'Atri" > Enviada: 2024/01/31 17:04:08 > Para: quag...@bol.com.br > Cc: ceph-users@ceph.io > Assunto: Re: [ceph-users] Performance improvement suggestion > > Would you be willing to accept the risk of data loss? > >> >> On Jan 31, 2024, at 2:48 PM, quag...@bol.com.br wrote: >> >> Hello everybody, >> I would like to make a suggestion for improving performance in Ceph >> architecture. >> I don't know if this group would be the best place or if my proposal is >> correct. >> >> My suggestion would be in the item >>
[ceph-users] Re: Performance improvement suggestion
Hi Janne, thanks for your reply. I think that it would be good to maintain the number of configured replicas. I don't think it's interesting to decrease to size=1. However, I think it is not necessary to write to all disks to release the client's request. Replicas could be recorded immediately in a second step. Nowadays, more and more software are implementing parallelism for writing through specific libraries. Examples: MPI-IO, HDF5, pnetCDF, etc... This way, even if the cluster has multiple disks, the objects will be written in PARALLEL. The greater the number of processes recording at the same time, the greater the storage load, regardless of the type of disk used (HDD, SSD or NVMe). I think and suggest that it is very useful to have the initial recording only be done on one disk and the replicas be done after the client is released (asynchronously). Rafael. De: "Janne Johansson" Enviada: 2024/02/01 04:08:05 Para: anthony.da...@gmail.com Cc: acozy...@gmail.com, quag...@bol.com.br, ceph-users@ceph.io Assunto: Re: [ceph-users] Re: Performance improvement suggestion > I’ve heard conflicting asserts on whether the write returns with min_size shards have been persisted, or all of them. I think it waits until all replicas have written the data, but from simplistic tests with fast network and slow drives, the extra time taken to write many copies is not linear to what it takes to write the first, so unless you do go min_size=1 (not recommended at all), the extra copies are not slowing you down as much as you'd expect. At least not if the other drives are not 100% busy. I get that this thread started on having one bad drive, and that is another scenario of course, but having repl=2 or repl=3 is not about writes taking 100% - 200% more time than the single write, it is less. -- May the most significant bit of your life be positive.___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
Hi Anthony, Thanks for your reply. I didn't say I would accept the risk of losing data. I just said that it would be interesting if the objects were first recorded only in the primary OSD. This way it would greatly increase performance (both for iops and throuput). Later (in the background), record the replicas. This situation would avoid leaving users/software waiting for the recording response from all replicas when the storage is overloaded. Where I work, performance is very important and we don't have money to make a entire cluster only with NVMe. However, I don't think it's interesting to lose the functionality of the replicas. I'm just suggesting another way to increase performance without losing the functionality of replicas. Rafael. De: "Anthony D'Atri" Enviada: 2024/01/31 17:04:08 Para: quag...@bol.com.br Cc: ceph-users@ceph.io Assunto: Re: [ceph-users] Performance improvement suggestion Would you be willing to accept the risk of data loss? On Jan 31, 2024, at 2:48 PM, quag...@bol.com.br wrote: Hello everybody, I would like to make a suggestion for improving performance in Ceph architecture. I don't know if this group would be the best place or if my proposal is correct. My suggestion would be in the item https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart Daemons Enable Hyperscale". The Client needs to "wait" for the configured amount of replicas to be written (so that the client receives an ok and continues). This way, if there is slowness on any of the disks on which the PG will be updated, the client is left waiting. It would be possible: 1-) Only record on the primary OSD 2-) Write other replicas in background (like the same way as when an OSD fails: "degraded" ). This way, client has a faster response when writing to storage: improving latency and performance (throughput and IOPS). I would find it plausible to accept a period of time (seconds) until all replicas are ok (written asynchronously) at the expense of improving performance. Could you evaluate this scenario? Rafael. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
> I’ve heard conflicting asserts on whether the write returns with min_size > shards have been persisted, or all of them. I think it waits until all replicas have written the data, but from simplistic tests with fast network and slow drives, the extra time taken to write many copies is not linear to what it takes to write the first, so unless you do go min_size=1 (not recommended at all), the extra copies are not slowing you down as much as you'd expect. At least not if the other drives are not 100% busy. I get that this thread started on having one bad drive, and that is another scenario of course, but having repl=2 or repl=3 is not about writes taking 100% - 200% more time than the single write, it is less. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
I’ve heard conflicting asserts on whether the write returns with min_size shards have been persisted, or all of them. > On Jan 31, 2024, at 2:58 PM, Can Özyurt wrote: > > I never tried this myself but "min_size = 1" should do what you want to > achieve. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
Would you be willing to accept the risk of data loss? > On Jan 31, 2024, at 2:48 PM, quag...@bol.com.br wrote: > > Hello everybody, > I would like to make a suggestion for improving performance in Ceph > architecture. > I don't know if this group would be the best place or if my proposal is > correct. > > My suggestion would be in the item > https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart > Daemons Enable Hyperscale". > > The Client needs to "wait" for the configured amount of replicas to be > written (so that the client receives an ok and continues). This way, if there > is slowness on any of the disks on which the PG will be updated, the client > is left waiting. > > It would be possible: > > 1-) Only record on the primary OSD > 2-) Write other replicas in background (like the same way as when an OSD > fails: "degraded" ). > > This way, client has a faster response when writing to storage: > improving latency and performance (throughput and IOPS). > > I would find it plausible to accept a period of time (seconds) until all > replicas are ok (written asynchronously) at the expense of improving > performance. > > Could you evaluate this scenario? > > > Rafael. > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
I never tried this myself but "min_size = 1" should do what you want to achieve. On Wed, 31 Jan 2024 at 22:48, quag...@bol.com.br wrote: > > Hello everybody, > I would like to make a suggestion for improving performance in Ceph > architecture. > I don't know if this group would be the best place or if my proposal is > correct. > > My suggestion would be in the item > https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart > Daemons Enable Hyperscale". > > The Client needs to "wait" for the configured amount of replicas to be > written (so that the client receives an ok and continues). This way, if there > is slowness on any of the disks on which the PG will be updated, the client > is left waiting. > > It would be possible: > > 1-) Only record on the primary OSD > 2-) Write other replicas in background (like the same way as when an OSD > fails: "degraded" ). > > This way, client has a faster response when writing to storage: > improving latency and performance (throughput and IOPS). > > I would find it plausible to accept a period of time (seconds) until all > replicas are ok (written asynchronously) at the expense of improving > performance. > > Could you evaluate this scenario? > > > Rafael. > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
Hello everybody, I would like to make a suggestion for improving performance in Ceph architecture. I don't know if this group would be the best place or if my proposal is correct. My suggestion would be in the item https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart Daemons Enable Hyperscale". The Client needs to "wait" for the configured amount of replicas to be written (so that the client receives an ok and continues). This way, if there is slowness on any of the disks on which the PG will be updated, the client is left waiting. It would be possible: 1-) Only record on the primary OSD 2-) Write other replicas in background (like the same way as when an OSD fails: "degraded" ). This way, client has a faster response when writing to storage: improving latency and performance (throughput and IOPS). I would find it plausible to accept a period of time (seconds) until all replicas are ok (written asynchronously) at the expense of improving performance. Could you evaluate this scenario? Rafael. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io