[ceph-users] Re: Performance improvement suggestion

2024-03-04 Thread Mark Nelson


On 3/4/24 08:40, Maged Mokhtar wrote:


On 04/03/2024 15:37, Frank Schilder wrote:
Fast write enabled would mean that the primary OSD sends #size 
copies to the
entire active set (including itself) in parallel and sends an ACK 
to the
client as soon as min_size ACKs have been received from the peers 
(including
itself). In this way, one can tolerate (size-min_size) slow(er) 
OSDs (slow
for whatever reason) without suffering performance penalties 
immediately
(only after too many requests started piling up, which will show 
as a slow

requests warning).

What happens if there occurs an error on the slowest osd after the 
min_size ACK has already been send to the client?


This should not be different than what exists today..unless 
of-course if

the error happens on the local/primary osd
Can this be addressed with reasonable effort? I don't expect this to 
be a quick-fix and it should be tested. However, beating the 
tail-latency statistics with the extra redundancy should be worth it. 
I observe fluctuations of latencies, OSDs become randomly slow for 
whatever reason for short time intervals and then return to normal.


A reason for this could be DB compaction. I think during compaction 
latency tends to spike.


A fast-write option would effectively remove the impact of this.

Best regards and thanks for considering this!


i think this is something the rados devs need to say. it does sound 
worth investigating. it is not just for cases with db compaction but 
more importantly the normal(happy) io path as it will have the most 
impact.



Typically a L0->L1 compaction will have two primary effects:


1) It will cause large IO read/write traffic to the disk potentially 
impacting other IO taking place if the disk is already saturated.


2) It will block memtable flushes until the compaction finishes. This 
means that more and more data will accumulate in the memtables/WAL which 
can trigger throttling and eventually stalls if you run out of buffer 
space.  By default, we allow up to 1GB of writes to WAL/memtables before 
writes are fully stalled, but RocksDB will typlically throttle writes 
before you get to that point.  It's possible a larger buffer may allow 
you to absorb traffic spikes for longer at the expense of more disk and 
memory usage.  Ultimately though, if you are hitting throttling, it 
means that the DB can't keep up with the WAL ingestion rate.



Mark





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-03-04 Thread Maged Mokhtar



On 04/03/2024 15:37, Frank Schilder wrote:

Fast write enabled would mean that the primary OSD sends #size copies to the
entire active set (including itself) in parallel and sends an ACK to the
client as soon as min_size ACKs have been received from the peers (including
itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow
for whatever reason) without suffering performance penalties immediately
(only after too many requests started piling up, which will show as a slow
requests warning).


What happens if there occurs an error on the slowest osd after the min_size ACK 
has already been send to the client?


This should not be different than what exists today..unless of-course if
the error happens on the local/primary osd

Can this be addressed with reasonable effort? I don't expect this to be a 
quick-fix and it should be tested. However, beating the tail-latency statistics 
with the extra redundancy should be worth it. I observe fluctuations of 
latencies, OSDs become randomly slow for whatever reason for short time 
intervals and then return to normal.

A reason for this could be DB compaction. I think during compaction latency 
tends to spike.

A fast-write option would effectively remove the impact of this.

Best regards and thanks for considering this!


i think this is something the rados devs need to say. it does sound 
worth investigating. it is not just for cases with db compaction but 
more importantly the normal(happy) io path as it will have the most impact.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-03-04 Thread Frank Schilder
>>> Fast write enabled would mean that the primary OSD sends #size copies to the
>>> entire active set (including itself) in parallel and sends an ACK to the
>>> client as soon as min_size ACKs have been received from the peers (including
>>> itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow
>>> for whatever reason) without suffering performance penalties immediately
>>> (only after too many requests started piling up, which will show as a slow
>>> requests warning).
>>>
>> What happens if there occurs an error on the slowest osd after the min_size 
>> ACK has already been send to the client?
>>
>This should not be different than what exists today..unless of-course if
>the error happens on the local/primary osd

Can this be addressed with reasonable effort? I don't expect this to be a 
quick-fix and it should be tested. However, beating the tail-latency statistics 
with the extra redundancy should be worth it. I observe fluctuations of 
latencies, OSDs become randomly slow for whatever reason for short time 
intervals and then return to normal.

A reason for this could be DB compaction. I think during compaction latency 
tends to spike.

A fast-write option would effectively remove the impact of this.

Best regards and thanks for considering this!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-03-04 Thread Maged Mokhtar



On 04/03/2024 13:35, Marc wrote:

Fast write enabled would mean that the primary OSD sends #size copies to the
entire active set (including itself) in parallel and sends an ACK to the
client as soon as min_size ACKs have been received from the peers (including
itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow
for whatever reason) without suffering performance penalties immediately
(only after too many requests started piling up, which will show as a slow
requests warning).


What happens if there occurs an error on the slowest osd after the min_size ACK 
has already been send to the client?

This should not be different than what exists today..unless of-course if 
the error happens on the local/primary osd

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-03-04 Thread Marc
> 
> Fast write enabled would mean that the primary OSD sends #size copies to the
> entire active set (including itself) in parallel and sends an ACK to the
> client as soon as min_size ACKs have been received from the peers (including
> itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow
> for whatever reason) without suffering performance penalties immediately
> (only after too many requests started piling up, which will show as a slow
> requests warning).
> 

What happens if there occurs an error on the slowest osd after the min_size ACK 
has already been send to the client? 


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-03-04 Thread Frank Schilder
Hi all, coming late to the party but want to ship in as well with some 
experience.

The problem of tail latencies of individual OSDs is a real pain for any 
redundant storage system. However, there is a way to deal with this in an 
elegant way when using large replication factors. The idea is to use the 
counterpart of the "fast read" option that exists for EC pools and:

1) make this option available to replicated pools as well (is on the road map 
as far as I know), but also
2) implement an option "fast write" for all pool types.

Fast write enabled would mean that the primary OSD sends #size copies to the 
entire active set (including itself) in parallel and sends an ACK to the client 
as soon as min_size ACKs have been received from the peers (including itself). 
In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever 
reason) without suffering performance penalties immediately (only after too 
many requests started piling up, which will show as a slow requests warning).

I have fast read enabled on all EC pools. This does increase the 
cluster-internal network traffic, which is nowadays absolutely no problem (in 
the good old 1G times it potentially would be). In return, the read latencies 
on the client side are lower and much more predictable. In effect, the user 
experience improved dramatically.

I would really wish that such an option gets added as we use wide replication 
profiles (rep-(4,2) and EC(8+3), each with 2 "spare" OSDs) and exploiting large 
replication factors (more precisely, large (size-min_size)) to mitigate the 
impact of slow OSDs would be awesome. It would also add some incentive to stop 
the ridiculous size=2 min_size=1 habit, because one gets an extra gain from 
replication on top of redundancy.

In the long run, the ceph write path should try to deal with a-priori known 
different-latency connections (fast local ACK with async remote completion, was 
asked for a couple of times), for example, for stretched clusters where one has 
an internal connection for the local part and external connections for the 
remote parts. It would be great to have similar ways of mitigating some 
penalties of the slow write paths to remote sites.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Peter Grandi 
Sent: Wednesday, February 21, 2024 1:10 PM
To: list Linux fs Ceph
Subject: [ceph-users] Re: Performance improvement suggestion

> 1. Write object A from client.
> 2. Fsync to primary device completes.
> 3. Ack to client.
> 4. Writes sent to replicas.
[...]

As mentioned in the discussion this proposal is the opposite of
what the current policy, is, which is to wait for all replicas
to be written before writes are acknowledged to the client:

https://github.com/ceph/ceph/blob/main/doc/architecture.rst

   "After identifying the target placement group, the client
   writes the object to the identified placement group's primary
   OSD. The primary OSD then [...] confirms that the object was
   stored successfully in the secondary and tertiary OSDs, and
   reports to the client that the object was stored
   successfully."

A more revolutionary option would be for 'librados' to write in
parallel to all the "active set" OSDs and report this to the
primary, but that would greatly increase client-Ceph traffic,
while the current logic increases traffic only among OSDs.

> So I think that to maintain any semblance of reliability,
> you'd need to at least wait for a commit ack from the first
> replica (i.e. min_size=2).

Perhaps it could be similar to 'k'+'m' for EC, that is 'k'
synchronous (write completes to the client only when all at
least 'k' replicas, including primary, have been committed) and
'm' asynchronous, instead of 'k' being just 1 or 2.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-21 Thread Peter Grandi
> 1. Write object A from client.
> 2. Fsync to primary device completes.
> 3. Ack to client.
> 4. Writes sent to replicas.
[...]

As mentioned in the discussion this proposal is the opposite of
what the current policy, is, which is to wait for all replicas
to be written before writes are acknowledged to the client:

https://github.com/ceph/ceph/blob/main/doc/architecture.rst

   "After identifying the target placement group, the client
   writes the object to the identified placement group's primary
   OSD. The primary OSD then [...] confirms that the object was
   stored successfully in the secondary and tertiary OSDs, and
   reports to the client that the object was stored
   successfully."

A more revolutionary option would be for 'librados' to write in
parallel to all the "active set" OSDs and report this to the
primary, but that would greatly increase client-Ceph traffic,
while the current logic increases traffic only among OSDs.

> So I think that to maintain any semblance of reliability,
> you'd need to at least wait for a commit ack from the first
> replica (i.e. min_size=2).

Perhaps it could be similar to 'k'+'m' for EC, that is 'k'
synchronous (write completes to the client only when all at
least 'k' replicas, including primary, have been committed) and
'm' asynchronous, instead of 'k' being just 1 or 2.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread Dan van der Ster
Hi,

I just want to echo what the others are saying.

Keep in mind that RADOS needs to guarantee read-after-write consistency for
the higher level apps to work (RBD, RGW, CephFS). If you corrupt VM block
devices, S3 objects or bucket metadata/indexes, or CephFS metadata, you're
going to suffer some long days and nights recovering.

Anyway, I think that what you proposed has at best a similar reliability to
min_size=1. And note that min_size=1 is strongly discouraged because of the
very high likelihood that a device/network/power failure turns into a
visible outage. In short: your idea would turn every OSD into a SPoF.

How would you handle this very common scenario: a power outage followed by
at least one device failing to start afterwards?

1. Write object A from client.
2. Fsync to primary device completes.
3. Ack to client.
4. Writes sent to replicas.
5. Cluster wide power outage (before replicas committed).
6. Power restored, but the primary osd does not start (e.g. permanent hdd
failure).
7. Client tries to read object A.

Today, with min_size=1 such a scenario manifests as data loss: you get
either a down PG (with many many objects offline/IO blocked until you
manually decide which data loss mode to accept) or unfounded objects (with
IO blocked until you accept data loss). With min_size=2 the likelihood of
data loss is dramatically reduced.

Another thing about that power loss scenario is that all dirty PGs would
need to be recovered when the cluster reboots. You'd lose all the writes in
transit and have to replay them from the primary's pg_log, or backfill if
the pg_log was too short. Again, any failure during that recovery would
lead to data loss.

So I think that to maintain any semblance of reliability, you'd need to at
least wait for a commit ack from the first replica (i.e. min_size=2). But
since the replica writes are dispatched in parallel, your speedup would
evaporate.

Another thing: I suspect this idea would result in many inconsistencies
from transient issues. You'd need to ramp up the number of parallel
deep-scrubs to look for those inconsistencies quickly, which would also
work against any potential speedup.

Cheers, Dan

--
Dan van der Ster
CTO

Clyso GmbH
w: https://clyso.com | e: dan.vanders...@clyso.com

Try our Ceph Analyzer!: https://analyzer.clyso.com/
We are hiring: https://www.clyso.com/jobs/


On Wed, Jan 31, 2024, 11:49 quag...@bol.com.br  wrote:

> Hello everybody,
>  I would like to make a suggestion for improving performance in Ceph
> architecture.
>  I don't know if this group would be the best place or if my proposal
> is correct.
>
>  My suggestion would be in the item
> https://docs.ceph.com/en/latest/architecture/, at the end of the topic
> "Smart Daemons Enable Hyperscale".
>
>  The Client needs to "wait" for the configured amount of replicas to
> be written (so that the client receives an ok and continues). This way, if
> there is slowness on any of the disks on which the PG will be updated, the
> client is left waiting.
>
>  It would be possible:
>
>  1-) Only record on the primary OSD
>  2-) Write other replicas in background (like the same way as when an
> OSD fails: "degraded" ).
>
>  This way, client has a faster response when writing to storage:
> improving latency and performance (throughput and IOPS).
>
>  I would find it plausible to accept a period of time (seconds) until
> all replicas are ok (written asynchronously) at the expense of improving
> performance.
>
>  Could you evaluate this scenario?
>
>
> Rafael.
>
>  ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread Alex Gorbachev
I would be against such an option, because it introduces a significant risk
of data loss.  Ceph has made a name for itself as a very reliable system,
where almost no one lost data, no matter how bad of a decision they made
with architecture and design.  This is what you pay for in commercial
systems, to "not be allowed a bad choice", and this is what everyone gets
with Ceph for free (if they so choose).

Allowing a change like this would likely be the beginning of the end of
Ceph.  It is a bad idea in the extreme.  Ceph reliability should never be
compromised.

There are other options for storage that are robust and do not require as
much investment.  Use ZFS, with NFS if needed.  Use bcache/flashcache, or
something similar on the client side.  Use proper RAM caching in databases
and applications.
--
Alex Gorbachev
Intelligent Systems Services Inc.
STORCIUM



On Tue, Feb 20, 2024 at 3:04 PM Anthony D'Atri 
wrote:

>
>
> > Hi Anthony,
> >  Did you decide that it's not a feature to be implemented?
>
> That isn't up to me.
>
> >  I'm asking about this so I can offer options here.
> >
> >  I'd not be confortable to enable "mon_allow_pool_size_one" at a
> specific pool.
> >
> > It would be better if this feature could make a replica at a second time
> on selected pool.
> > Thanks.
> > Rafael.
> >
> >
> >
> > De: "Anthony D'Atri" 
> > Enviada: 2024/02/01 15:00:59
> > Para: quag...@bol.com.br
> > Cc: ceph-users@ceph.io
> > Assunto: [ceph-users] Re: Performance improvement suggestion
> >
> > I'd totally defer to the RADOS folks.
> >
> > One issue might be adding a separate code path, which can have all sorts
> of problems.
> >
> > > On Feb 1, 2024, at 12:53, quag...@bol.com.br wrote:
> > >
> > >
> > >
> > > Ok Anthony,
> > >
> > > I understood what you said. I also believe in all the professional
> history and experience you have.
> > >
> > > Anyway, could there be a configuration flag to make this happen?
> > >
> > > As well as those that already exist: "--yes-i-really-mean-it".
> > >
> > > This way, the storage pattern would remain as it is. However, it would
> allow situations like the one I mentioned to be possible.
> > >
> > > This situation will permit some rules to be relaxed (even if they are
> not ok at first).
> > > Likewise, there are already situations like lazyio that make some
> exceptions to standard procedures.
> > > Remembering: it's just a suggestion.
> > > If this type of functionality is not interesting, it is ok.
> > >
> > >
> > >
> > > Rafael.
> > >
> > >
> > > De: "Anthony D'Atri" 
> > > Enviada: 2024/02/01 12:10:30
> > > Para: quag...@bol.com.br
> > > Cc: ceph-users@ceph.io
> > > Assunto: [ceph-users] Re: Performance improvement suggestion
> > >
> > >
> > >
> > > > I didn't say I would accept the risk of losing data.
> > >
> > > That's implicit in what you suggest, though.
> > >
> > > > I just said that it would be interesting if the objects were first
> recorded only in the primary OSD.
> > >
> > > What happens when that host / drive smokes before it can replicate?
> What happens if a secondary OSD gets a read op before the primary updates
> it? Swift object storage users have to code around this potential. It's a
> non-starter for block storage.
> > >
> > > This is similar to why RoC HBAs (which are a badly outdated thing to
> begin with) will only enter writeback mode if they have a BBU / supercap --
> and of course if their firmware and hardware isn't pervasively buggy. Guess
> how I know this?
> > >
> > > > This way it would greatly increase performance (both for iops and
> throuput).
> > >
> > > It might increase low-QD IOPS for a single client on slow media with
> certain networking. Depending on media, it wouldn't increase throughput.
> > >
> > > Consider QEMU drive-mirror. If you're doing RF=3 replication, you use
> 3x the network resources between the client and the servers.
> > >
> > > > Later (in the background), record the replicas. This situation would
> avoid leaving users/software waiting for the recording response from all
> replicas when the storage is overloaded.
> > >
> > > If one makes the mistake of using HDDs, they're going to be overloaded
> no matter how one slices and dices the ops. Ya just canna squeeze IOPS from
> a stone. Throug

[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread Anthony D'Atri
Cache tiering is deprecated.

> On Feb 20, 2024, at 17:03, Özkan Göksu  wrote:
> 
> Hello.
> 
> I didn't test it personally but what about rep 1 write cache pool with nvme
> backed by another rep 2 pool?
> 
> It has the potential exactly what you are looking for in theory.
> 
> 
> 1 Şub 2024 Per 20:54 tarihinde quag...@bol.com.br  şunu
> yazdı:
> 
>> 
>> 
>> Ok Anthony,
>> 
>> I understood what you said. I also believe in all the professional history
>> and experience you have.
>> 
>> Anyway, could there be a configuration flag to make this happen?
>> 
>> As well as those that already exist: "--yes-i-really-mean-it".
>> 
>> This way, the storage pattern would remain as it is. However, it would
>> allow situations like the one I mentioned to be possible.
>> 
>> This situation will permit some rules to be relaxed (even if they are not
>> ok at first).
>> Likewise, there are already situations like lazyio that make some
>> exceptions to standard procedures.
>> 
>> 
>> Remembering: it's just a suggestion.
>> If this type of functionality is not interesting, it is ok.
>> 
>> 
>> Rafael.
>> 
>> --
>> 
>> *De: *"Anthony D'Atri" 
>> *Enviada: *2024/02/01 12:10:30
>> *Para: *quag...@bol.com.br
>> *Cc: * ceph-users@ceph.io
>> *Assunto: * [ceph-users] Re: Performance improvement suggestion
>> 
>> 
>> 
>>> I didn't say I would accept the risk of losing data.
>> 
>> That's implicit in what you suggest, though.
>> 
>>> I just said that it would be interesting if the objects were first
>> recorded only in the primary OSD.
>> 
>> What happens when that host / drive smokes before it can replicate? What
>> happens if a secondary OSD gets a read op before the primary updates it?
>> Swift object storage users have to code around this potential. It's a
>> non-starter for block storage.
>> 
>> This is similar to why RoC HBAs (which are a badly outdated thing to begin
>> with) will only enter writeback mode if they have a BBU / supercap -- and
>> of course if their firmware and hardware isn't pervasively buggy. Guess how
>> I know this?
>> 
>>> This way it would greatly increase performance (both for iops and
>> throuput).
>> 
>> It might increase low-QD IOPS for a single client on slow media with
>> certain networking. Depending on media, it wouldn't increase throughput.
>> 
>> Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x
>> the network resources between the client and the servers.
>> 
>>> Later (in the background), record the replicas. This situation would
>> avoid leaving users/software waiting for the recording response from all
>> replicas when the storage is overloaded.
>> 
>> If one makes the mistake of using HDDs, they're going to be overloaded no
>> matter how one slices and dices the ops. Ya just canna squeeze IOPS from a
>> stone. Throughput is going to be limited by the SATA interface and seeking
>> no matter what.
>> 
>>> Where I work, performance is very important and we don't have money to
>> make a entire cluster only with NVMe.
>> 
>> If there isn't money, then it isn't very important. But as I've written
>> before, NVMe clusters *do not cost appreciably more than spinners* unless
>> your procurement processes are bad. In fact they can cost significantly
>> less. This is especially true with object storage and archival where one
>> can leverage QLC.
>> 
>> * Buy generic drives from a VAR, not channel drives through a chassis
>> brand. Far less markup, and moreover you get the full 5 year warranty, not
>> just 3 years. And you can painlessly RMA drives yourself - you don't have
>> to spend hours going back and forth with $chassisvendor's TAC arguing about
>> every single RMA. I've found that this is so bad that it is more economical
>> to just throw away a failed component worth < USD 500 than to RMA it. Do
>> you pay for extended warranty / support? That's expensive too.
>> 
>> * Certain chassis brands who shall remain nameless push RoC HBAs hard with
>> extreme markups. List prices as high as USD2000. Per server, eschewing
>> those abominations makes up for a lot of the drive-only unit economics
>> 
>> * But this is the part that lots of people don't get: You don't just stack
>> up the drives on a desk and use them. They go into *servers* that cost
>> money and *racks* that cost money. They take *power* that cost

[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread Özkan Göksu
Hello.

I didn't test it personally but what about rep 1 write cache pool with nvme
backed by another rep 2 pool?

It has the potential exactly what you are looking for in theory.


1 Şub 2024 Per 20:54 tarihinde quag...@bol.com.br  şunu
yazdı:

>
>
> Ok Anthony,
>
> I understood what you said. I also believe in all the professional history
> and experience you have.
>
> Anyway, could there be a configuration flag to make this happen?
>
> As well as those that already exist: "--yes-i-really-mean-it".
>
> This way, the storage pattern would remain as it is. However, it would
> allow situations like the one I mentioned to be possible.
>
> This situation will permit some rules to be relaxed (even if they are not
> ok at first).
> Likewise, there are already situations like lazyio that make some
> exceptions to standard procedures.
>
>
> Remembering: it's just a suggestion.
> If this type of functionality is not interesting, it is ok.
>
>
> Rafael.
>
> --
>
> *De: *"Anthony D'Atri" 
> *Enviada: *2024/02/01 12:10:30
> *Para: *quag...@bol.com.br
> *Cc: * ceph-users@ceph.io
> *Assunto: * [ceph-users] Re: Performance improvement suggestion
>
>
>
> > I didn't say I would accept the risk of losing data.
>
> That's implicit in what you suggest, though.
>
> > I just said that it would be interesting if the objects were first
> recorded only in the primary OSD.
>
> What happens when that host / drive smokes before it can replicate? What
> happens if a secondary OSD gets a read op before the primary updates it?
> Swift object storage users have to code around this potential. It's a
> non-starter for block storage.
>
> This is similar to why RoC HBAs (which are a badly outdated thing to begin
> with) will only enter writeback mode if they have a BBU / supercap -- and
> of course if their firmware and hardware isn't pervasively buggy. Guess how
> I know this?
>
> > This way it would greatly increase performance (both for iops and
> throuput).
>
> It might increase low-QD IOPS for a single client on slow media with
> certain networking. Depending on media, it wouldn't increase throughput.
>
> Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x
> the network resources between the client and the servers.
>
> > Later (in the background), record the replicas. This situation would
> avoid leaving users/software waiting for the recording response from all
> replicas when the storage is overloaded.
>
> If one makes the mistake of using HDDs, they're going to be overloaded no
> matter how one slices and dices the ops. Ya just canna squeeze IOPS from a
> stone. Throughput is going to be limited by the SATA interface and seeking
> no matter what.
>
> > Where I work, performance is very important and we don't have money to
> make a entire cluster only with NVMe.
>
> If there isn't money, then it isn't very important. But as I've written
> before, NVMe clusters *do not cost appreciably more than spinners* unless
> your procurement processes are bad. In fact they can cost significantly
> less. This is especially true with object storage and archival where one
> can leverage QLC.
>
> * Buy generic drives from a VAR, not channel drives through a chassis
> brand. Far less markup, and moreover you get the full 5 year warranty, not
> just 3 years. And you can painlessly RMA drives yourself - you don't have
> to spend hours going back and forth with $chassisvendor's TAC arguing about
> every single RMA. I've found that this is so bad that it is more economical
> to just throw away a failed component worth < USD 500 than to RMA it. Do
> you pay for extended warranty / support? That's expensive too.
>
> * Certain chassis brands who shall remain nameless push RoC HBAs hard with
> extreme markups. List prices as high as USD2000. Per server, eschewing
> those abominations makes up for a lot of the drive-only unit economics
>
> * But this is the part that lots of people don't get: You don't just stack
> up the drives on a desk and use them. They go into *servers* that cost
> money and *racks* that cost money. They take *power* that costs money.
>
> * $ / IOPS are FAR better for ANY SSD than for HDDs
>
> * RUs cost money, so do chassis and switches
>
> * Drive failures cost money
>
> * So does having your people and applications twiddle their thumbs waiting
> for stuff to happen. I worked for a supercomputer company who put
> low-memory low-end diskless workstations on engineer's desks. They spent
> lots of time doing nothing waiting for their applications to respond. This
> company no longer exists.
>
> * So does the risk of taking *

[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread Anthony D'Atri


> Hi Anthony,
>  Did you decide that it's not a feature to be implemented?

That isn't up to me.

>  I'm asking about this so I can offer options here.
> 
>  I'd not be confortable to enable "mon_allow_pool_size_one" at a specific 
> pool.
> 
> It would be better if this feature could make a replica at a second time on 
> selected pool.
> Thanks.
> Rafael.
> 
>  
> 
> De: "Anthony D'Atri" 
> Enviada: 2024/02/01 15:00:59
> Para: quag...@bol.com.br
> Cc: ceph-users@ceph.io
> Assunto: [ceph-users] Re: Performance improvement suggestion
>  
> I'd totally defer to the RADOS folks.
> 
> One issue might be adding a separate code path, which can have all sorts of 
> problems.
> 
> > On Feb 1, 2024, at 12:53, quag...@bol.com.br wrote:
> >
> >
> >
> > Ok Anthony,
> >
> > I understood what you said. I also believe in all the professional history 
> > and experience you have.
> >
> > Anyway, could there be a configuration flag to make this happen?
> >
> > As well as those that already exist: "--yes-i-really-mean-it".
> >
> > This way, the storage pattern would remain as it is. However, it would 
> > allow situations like the one I mentioned to be possible.
> >
> > This situation will permit some rules to be relaxed (even if they are not 
> > ok at first).
> > Likewise, there are already situations like lazyio that make some 
> > exceptions to standard procedures.
> > Remembering: it's just a suggestion.
> > If this type of functionality is not interesting, it is ok.
> >
> >
> >
> > Rafael.
> >
> >
> > De: "Anthony D'Atri" 
> > Enviada: 2024/02/01 12:10:30
> > Para: quag...@bol.com.br
> > Cc: ceph-users@ceph.io
> > Assunto: [ceph-users] Re: Performance improvement suggestion
> >
> >
> >
> > > I didn't say I would accept the risk of losing data.
> >
> > That's implicit in what you suggest, though.
> >
> > > I just said that it would be interesting if the objects were first 
> > > recorded only in the primary OSD.
> >
> > What happens when that host / drive smokes before it can replicate? What 
> > happens if a secondary OSD gets a read op before the primary updates it? 
> > Swift object storage users have to code around this potential. It's a 
> > non-starter for block storage.
> >
> > This is similar to why RoC HBAs (which are a badly outdated thing to begin 
> > with) will only enter writeback mode if they have a BBU / supercap -- and 
> > of course if their firmware and hardware isn't pervasively buggy. Guess how 
> > I know this?
> >
> > > This way it would greatly increase performance (both for iops and 
> > > throuput).
> >
> > It might increase low-QD IOPS for a single client on slow media with 
> > certain networking. Depending on media, it wouldn't increase throughput.
> >
> > Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x 
> > the network resources between the client and the servers.
> >
> > > Later (in the background), record the replicas. This situation would 
> > > avoid leaving users/software waiting for the recording response from all 
> > > replicas when the storage is overloaded.
> >
> > If one makes the mistake of using HDDs, they're going to be overloaded no 
> > matter how one slices and dices the ops. Ya just canna squeeze IOPS from a 
> > stone. Throughput is going to be limited by the SATA interface and seeking 
> > no matter what.
> >
> > > Where I work, performance is very important and we don't have money to 
> > > make a entire cluster only with NVMe.
> >
> > If there isn't money, then it isn't very important. But as I've written 
> > before, NVMe clusters *do not cost appreciably more than spinners* unless 
> > your procurement processes are bad. In fact they can cost significantly 
> > less. This is especially true with object storage and archival where one 
> > can leverage QLC.
> >
> > * Buy generic drives from a VAR, not channel drives through a chassis 
> > brand. Far less markup, and moreover you get the full 5 year warranty, not 
> > just 3 years. And you can painlessly RMA drives yourself - you don't have 
> > to spend hours going back and forth with $chassisvendor's TAC arguing about 
> > every single RMA. I've found that this is so bad that it is more economical 
> > to just throw away a failed component worth < USD 500 than to RMA it. Do 
> > you pay for extended warranty / support? 

[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread quag...@bol.com.br
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-01 Thread Anthony D'Atri
I'd totally defer to the RADOS folks.

One issue might be adding a separate code path, which can have all sorts of 
problems.

> On Feb 1, 2024, at 12:53, quag...@bol.com.br wrote:
> 
>  
>  
> Ok Anthony,
> 
> I understood what you said. I also believe in all the professional history 
> and experience you have.
> 
> Anyway, could there be a configuration flag to make this happen?
> 
> As well as those that already exist: "--yes-i-really-mean-it".
> 
> This way, the storage pattern would remain as it is. However, it would allow 
> situations like the one I mentioned to be possible.
> 
> This situation will permit some rules to be relaxed (even if they are not ok 
> at first).
> Likewise, there are already situations like lazyio that make some exceptions 
> to standard procedures.
> Remembering: it's just a suggestion.
> If this type of functionality is not interesting, it is ok.
> 
> 
> 
> Rafael.
>  
> 
> De: "Anthony D'Atri" 
> Enviada: 2024/02/01 12:10:30
> Para: quag...@bol.com.br
> Cc: ceph-users@ceph.io
> Assunto: [ceph-users] Re: Performance improvement suggestion
>  
> 
> 
> > I didn't say I would accept the risk of losing data.
> 
> That's implicit in what you suggest, though.
> 
> > I just said that it would be interesting if the objects were first recorded 
> > only in the primary OSD.
> 
> What happens when that host / drive smokes before it can replicate? What 
> happens if a secondary OSD gets a read op before the primary updates it? 
> Swift object storage users have to code around this potential. It's a 
> non-starter for block storage.
> 
> This is similar to why RoC HBAs (which are a badly outdated thing to begin 
> with) will only enter writeback mode if they have a BBU / supercap -- and of 
> course if their firmware and hardware isn't pervasively buggy. Guess how I 
> know this?
> 
> > This way it would greatly increase performance (both for iops and throuput).
> 
> It might increase low-QD IOPS for a single client on slow media with certain 
> networking. Depending on media, it wouldn't increase throughput.
> 
> Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x the 
> network resources between the client and the servers.
> 
> > Later (in the background), record the replicas. This situation would avoid 
> > leaving users/software waiting for the recording response from all replicas 
> > when the storage is overloaded.
> 
> If one makes the mistake of using HDDs, they're going to be overloaded no 
> matter how one slices and dices the ops. Ya just canna squeeze IOPS from a 
> stone. Throughput is going to be limited by the SATA interface and seeking no 
> matter what.
> 
> > Where I work, performance is very important and we don't have money to make 
> > a entire cluster only with NVMe.
> 
> If there isn't money, then it isn't very important. But as I've written 
> before, NVMe clusters *do not cost appreciably more than spinners* unless 
> your procurement processes are bad. In fact they can cost significantly less. 
> This is especially true with object storage and archival where one can 
> leverage QLC.
> 
> * Buy generic drives from a VAR, not channel drives through a chassis brand. 
> Far less markup, and moreover you get the full 5 year warranty, not just 3 
> years. And you can painlessly RMA drives yourself - you don't have to spend 
> hours going back and forth with $chassisvendor's TAC arguing about every 
> single RMA. I've found that this is so bad that it is more economical to just 
> throw away a failed component worth < USD 500 than to RMA it. Do you pay for 
> extended warranty / support? That's expensive too.
> 
> * Certain chassis brands who shall remain nameless push RoC HBAs hard with 
> extreme markups. List prices as high as USD2000. Per server, eschewing those 
> abominations makes up for a lot of the drive-only unit economics
> 
> * But this is the part that lots of people don't get: You don't just stack up 
> the drives on a desk and use them. They go into *servers* that cost money and 
> *racks* that cost money. They take *power* that costs money.
> 
> * $ / IOPS are FAR better for ANY SSD than for HDDs
> 
> * RUs cost money, so do chassis and switches
> 
> * Drive failures cost money
> 
> * So does having your people and applications twiddle their thumbs waiting 
> for stuff to happen. I worked for a supercomputer company who put low-memory 
> low-end diskless workstations on engineer's desks. They spent lots of time 
> doing nothing waiting for their applications to respond. This company no 
> longer exists.
> 
> * So does the risk of taking *weeks* to heal from a d

[ceph-users] Re: Performance improvement suggestion

2024-02-01 Thread quag...@bol.com.br
 
 
Ok Anthony,

I understood what you said. I also believe in all the professional history and experience you have.

Anyway, could there be a configuration flag to make this happen?

As well as those that already exist: "--yes-i-really-mean-it".

This way, the storage pattern would remain as it is. However, it would allow situations like the one I mentioned to be possible.

This situation will permit some rules to be relaxed (even if they are not ok at first).
Likewise, there are already situations like lazyio that make some exceptions to standard procedures.
 
Remembering: it's just a suggestion.
If this type of functionality is not interesting, it is ok.


Rafael.
 


De: "Anthony D'Atri" 
Enviada: 2024/02/01 12:10:30
Para: quag...@bol.com.br
Cc:  ceph-users@ceph.io
Assunto:  [ceph-users] Re: Performance improvement suggestion
 


> I didn't say I would accept the risk of losing data.

That's implicit in what you suggest, though.

> I just said that it would be interesting if the objects were first recorded only in the primary OSD.

What happens when that host / drive smokes before it can replicate? What happens if a secondary OSD gets a read op before the primary updates it? Swift object storage users have to code around this potential. It's a non-starter for block storage.

This is similar to why RoC HBAs (which are a badly outdated thing to begin with) will only enter writeback mode if they have a BBU / supercap -- and of course if their firmware and hardware isn't pervasively buggy. Guess how I know this?

> This way it would greatly increase performance (both for iops and throuput).

It might increase low-QD IOPS for a single client on slow media with certain networking. Depending on media, it wouldn't increase throughput.

Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x the network resources between the client and the servers.

> Later (in the background), record the replicas. This situation would avoid leaving users/software waiting for the recording response from all replicas when the storage is overloaded.

If one makes the mistake of using HDDs, they're going to be overloaded no matter how one slices and dices the ops. Ya just canna squeeze IOPS from a stone. Throughput is going to be limited by the SATA interface and seeking no matter what.

> Where I work, performance is very important and we don't have money to make a entire cluster only with NVMe.

If there isn't money, then it isn't very important. But as I've written before, NVMe clusters *do not cost appreciably more than spinners* unless your procurement processes are bad. In fact they can cost significantly less. This is especially true with object storage and archival where one can leverage QLC.

* Buy generic drives from a VAR, not channel drives through a chassis brand. Far less markup, and moreover you get the full 5 year warranty, not just 3 years. And you can painlessly RMA drives yourself - you don't have to spend hours going back and forth with $chassisvendor's TAC arguing about every single RMA. I've found that this is so bad that it is more economical to just throw away a failed component worth < USD 500 than to RMA it. Do you pay for extended warranty / support? That's expensive too.

* Certain chassis brands who shall remain nameless push RoC HBAs hard with extreme markups. List prices as high as USD2000. Per server, eschewing those abominations makes up for a lot of the drive-only unit economics

* But this is the part that lots of people don't get: You don't just stack up the drives on a desk and use them. They go into *servers* that cost money and *racks* that cost money. They take *power* that costs money.

* $ / IOPS are FAR better for ANY SSD than for HDDs

* RUs cost money, so do chassis and switches

* Drive failures cost money

* So does having your people and applications twiddle their thumbs waiting for stuff to happen. I worked for a supercomputer company who put low-memory low-end diskless workstations on engineer's desks. They spent lots of time doing nothing waiting for their applications to respond. This company no longer exists.

* So does the risk of taking *weeks* to heal from a drive failure

Punch honest numbers into https://www.snia.org/forums/cmsi/programs/TCOcalc

I walked through this with a certain global company. QLC SSDs were demonstrated to have like 30% lower TCO than spinners. Part of the equation is that they were accustomed to limiting HDD size to 8 TB because of the bottlenecks, and thus requiring more servers, more switch ports, more DC racks, more rack/stack time, more administrative overhead. You can fit 1.9 PB of raw SSD capacity in a 1U server. That same RU will hold at most 88 TB of the largest spinners you can get today. 22 TIMES the density. And since many applications can't even barely tolerate the spinner bottlenecks, capping spinner size at even 10T makes that like 40 TIMES better density with SSDs.


> Howeve

[ceph-users] Re: Performance improvement suggestion

2024-02-01 Thread quag...@bol.com.br
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-01 Thread Anthony D'Atri


>  I didn't say I would accept the risk of losing data.

That's implicit in what you suggest, though.

>  I just said that it would be interesting if the objects were first 
> recorded only in the primary OSD.

What happens when that host / drive smokes before it can replicate?  What 
happens if a secondary OSD gets a read op before the primary updates it?  Swift 
object storage users have to code around this potential.  It's a non-starter 
for block storage.

This is similar to why RoC HBAs (which are a badly outdated thing to begin 
with) will only enter writeback mode if they have a BBU / supercap -- and of 
course if their firmware and hardware isn't pervasively buggy.  Guess how I 
know this?

>  This way it would greatly increase performance (both for iops and 
> throuput).

It might increase low-QD IOPS for a single client on slow media with certain 
networking.  Depending on media, it wouldn't increase throughput.

Consider QEMU drive-mirror.  If you're doing RF=3 replication, you use 3x the 
network resources between the client and the servers.

>  Later (in the background), record the replicas. This situation would 
> avoid leaving users/software waiting for the recording response from all 
> replicas when the storage is overloaded.

If one makes the mistake of using HDDs, they're going to be overloaded no 
matter how one slices and dices the ops.  Ya just canna squeeze IOPS from a 
stone.  Throughput is going to be limited by the SATA interface and seeking no 
matter what.

>  Where I work, performance is very important and we don't have money to 
> make a entire cluster only with NVMe.

If there isn't money, then it isn't very important.  But as I've written 
before, NVMe clusters *do not cost appreciably more than spinners* unless your 
procurement processes are bad.  In fact they can cost significantly less.  This 
is especially true with object storage and archival where one can leverage QLC. 

* Buy generic drives from a VAR, not channel drives through a chassis brand.  
Far less markup, and moreover you get the full 5 year warranty, not just 3 
years.  And you can painlessly RMA drives yourself - you don't have to spend 
hours going back and forth with $chassisvendor's TAC arguing about every single 
RMA.  I've found that this is so bad that it is more economical to just throw 
away a failed component worth < USD 500 than to RMA it.  Do you pay for 
extended warranty / support?  That's expensive too.

* Certain chassis brands who shall remain nameless push RoC HBAs hard with 
extreme markups.  List prices as high as USD2000.  Per server, eschewing those 
abominations makes up for a lot of the drive-only unit economics

* But this is the part that lots of people don't get:  You don't just stack up 
the drives on a desk and use them.  They go into *servers* that cost money and 
*racks* that cost money.  They take *power* that costs money.

* $ / IOPS are FAR better for ANY SSD than for HDDs

* RUs cost money, so do chassis and switches

* Drive failures cost money

* So does having your people and applications twiddle their thumbs waiting for 
stuff to happen.  I worked for a supercomputer company who put low-memory 
low-end diskless workstations on engineer's desks.  They spent lots of time 
doing nothing waiting for their applications to respond.  This company no 
longer exists.

* So does the risk of taking *weeks* to heal from a drive failure

Punch honest numbers into https://www.snia.org/forums/cmsi/programs/TCOcalc

 I walked through this with a certain global company.  QLC SSDs were 
demonstrated to have like 30% lower TCO than spinners.  Part of the equation is 
that they were accustomed to limiting HDD size to 8 TB because of the 
bottlenecks, and thus requiring more servers, more switch ports, more DC racks, 
more rack/stack time, more administrative overhead.  You can fit 1.9 PB of raw 
SSD capacity in a 1U server.  That same RU will hold at most 88 TB of the 
largest spinners you can get today.  22 TIMES the density.  And since many 
applications can't even barely tolerate the spinner bottlenecks, capping 
spinner size at even 10T makes that like 40 TIMES better density with SSDs.


> However, I don't think it's interesting to lose the functionality of the 
> replicas.
>  I'm just suggesting another way to increase performance without losing 
> the functionality of replicas.
> 
> 
> Rafael.
>  
> 
> De: "Anthony D'Atri" 
> Enviada: 2024/01/31 17:04:08
> Para: quag...@bol.com.br
> Cc: ceph-users@ceph.io
> Assunto: Re: [ceph-users] Performance improvement suggestion
>  
> Would you be willing to accept the risk of data loss?
>  
>> 
>> On Jan 31, 2024, at 2:48 PM, quag...@bol.com.br wrote:
>>  
>> Hello everybody,
>>  I would like to make a suggestion for improving performance in Ceph 
>> architecture.
>>  I don't know if this group would be the best place or if my proposal is 
>> correct.
>> 
>>  My suggestion would be in the item 
>> 

[ceph-users] Re: Performance improvement suggestion

2024-02-01 Thread quag...@bol.com.br
 
Hi Janne, thanks for your reply.

I think that it would be good to maintain the number of configured replicas. I don't think it's interesting to decrease to size=1.

However, I think it is not necessary to write to all disks to release the client's request. Replicas could be recorded immediately in a second step.

Nowadays, more and more software are implementing parallelism for writing through specific libraries. Examples: MPI-IO, HDF5, pnetCDF, etc...

This way, even if the cluster has multiple disks, the objects will be written in PARALLEL. The greater the number of processes recording at the same time, the greater the storage load, regardless of the type of disk used (HDD, SSD or NVMe).

I think and suggest that it is very useful to have the initial recording only be done on one disk and the replicas be done after the client is released (asynchronously).

Rafael.
 



De: "Janne Johansson" 
Enviada: 2024/02/01 04:08:05
Para: anthony.da...@gmail.com
Cc:  acozy...@gmail.com, quag...@bol.com.br, ceph-users@ceph.io
Assunto:  Re: [ceph-users] Re: Performance improvement suggestion
 
> I’ve heard conflicting asserts on whether the write returns with min_size shards have been persisted, or all of them.

I think it waits until all replicas have written the data, but from
simplistic tests with fast network and slow drives, the extra time
taken to write many copies is not linear to what it takes to write the
first, so unless you do go min_size=1 (not recommended at all), the
extra copies are not slowing you down as much as you'd expect. At
least not if the other drives are not 100% busy.

I get that this thread started on having one bad drive, and that is
another scenario of course, but having repl=2 or repl=3 is not about
writes taking 100% - 200% more time than the single write, it is less.

--
May the most significant bit of your life be positive.___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-01 Thread quag...@bol.com.br
 
 
Hi Anthony,
 Thanks for your reply.

 I didn't say I would accept the risk of losing data.

 I just said that it would be interesting if the objects were first recorded only in the primary OSD.
 This way it would greatly increase performance (both for iops and throuput).
 Later (in the background), record the replicas. This situation would avoid leaving users/software waiting for the recording response from all replicas when the storage is overloaded.

 Where I work, performance is very important and we don't have money to make a entire cluster only with NVMe. However, I don't think it's interesting to lose the functionality of the replicas.
 I'm just suggesting another way to increase performance without losing the functionality of replicas.


Rafael.
 


De: "Anthony D'Atri" 
Enviada: 2024/01/31 17:04:08
Para: quag...@bol.com.br
Cc:  ceph-users@ceph.io
Assunto:  Re: [ceph-users] Performance improvement suggestion
 
Would you be willing to accept the risk of data loss?

 

On Jan 31, 2024, at 2:48 PM, quag...@bol.com.br wrote:
 

Hello everybody,
 I would like to make a suggestion for improving performance in Ceph architecture.
 I don't know if this group would be the best place or if my proposal is correct.

 My suggestion would be in the item https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart Daemons Enable Hyperscale".

 The Client needs to "wait" for the configured amount of replicas to be written (so that the client receives an ok and continues). This way, if there is slowness on any of the disks on which the PG will be updated, the client is left waiting.
     
 It would be possible:
     
 1-) Only record on the primary OSD
 2-) Write other replicas in background (like the same way as when an OSD fails: "degraded" ).

 This way, client has a faster response when writing to storage: improving latency and performance (throughput and IOPS).
     
 I would find it plausible to accept a period of time (seconds) until all replicas are ok (written asynchronously) at the expense of improving performance.
     
 Could you evaluate this scenario?


Rafael.

 ___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-01 Thread quag...@bol.com.br
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-01 Thread quag...@bol.com.br
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-01-31 Thread Janne Johansson
> I’ve heard conflicting asserts on whether the write returns with min_size 
> shards have been persisted, or all of them.

I think it waits until all replicas have written the data, but from
simplistic tests with fast network and slow drives, the extra time
taken to write many copies is not linear to what it takes to write the
first, so unless you do go min_size=1 (not recommended at all), the
extra copies are not slowing you down as much as you'd expect. At
least not if the other drives are not 100% busy.

I get that this thread started on having one bad drive, and that is
another scenario of course, but having repl=2 or repl=3 is not about
writes taking 100% - 200% more time than the single write, it is less.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-01-31 Thread Anthony D'Atri
I’ve heard conflicting asserts on whether the write returns with min_size 
shards have been persisted, or all of them.


> On Jan 31, 2024, at 2:58 PM, Can Özyurt  wrote:
> 
> I never tried this myself but "min_size = 1" should do what you want to 
> achieve.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-01-31 Thread Anthony D'Atri
Would you be willing to accept the risk of data loss?

> On Jan 31, 2024, at 2:48 PM, quag...@bol.com.br wrote:
> 
> Hello everybody,
>  I would like to make a suggestion for improving performance in Ceph 
> architecture.
>  I don't know if this group would be the best place or if my proposal is 
> correct.
> 
>  My suggestion would be in the item 
> https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart 
> Daemons Enable Hyperscale".
> 
>  The Client needs to "wait" for the configured amount of replicas to be 
> written (so that the client receives an ok and continues). This way, if there 
> is slowness on any of the disks on which the PG will be updated, the client 
> is left waiting.
>  
>  It would be possible:
>  
>  1-) Only record on the primary OSD
>  2-) Write other replicas in background (like the same way as when an OSD 
> fails: "degraded" ).
> 
>  This way, client has a faster response when writing to storage: 
> improving latency and performance (throughput and IOPS).
>  
>  I would find it plausible to accept a period of time (seconds) until all 
> replicas are ok (written asynchronously) at the expense of improving 
> performance.
>  
>  Could you evaluate this scenario?
> 
> 
> Rafael.
> 
>  ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-01-31 Thread Can Özyurt
I never tried this myself but "min_size = 1" should do what you want to achieve.

On Wed, 31 Jan 2024 at 22:48, quag...@bol.com.br  wrote:
>
> Hello everybody,
>  I would like to make a suggestion for improving performance in Ceph 
> architecture.
>  I don't know if this group would be the best place or if my proposal is 
> correct.
>
>  My suggestion would be in the item 
> https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart 
> Daemons Enable Hyperscale".
>
>  The Client needs to "wait" for the configured amount of replicas to be 
> written (so that the client receives an ok and continues). This way, if there 
> is slowness on any of the disks on which the PG will be updated, the client 
> is left waiting.
>
>  It would be possible:
>
>  1-) Only record on the primary OSD
>  2-) Write other replicas in background (like the same way as when an OSD 
> fails: "degraded" ).
>
>  This way, client has a faster response when writing to storage: 
> improving latency and performance (throughput and IOPS).
>
>  I would find it plausible to accept a period of time (seconds) until all 
> replicas are ok (written asynchronously) at the expense of improving 
> performance.
>
>  Could you evaluate this scenario?
>
>
> Rafael.
>
>  ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-01-31 Thread quag...@bol.com.br
Hello everybody,
 I would like to make a suggestion for improving performance in Ceph architecture.
 I don't know if this group would be the best place or if my proposal is correct.

 My suggestion would be in the item https://docs.ceph.com/en/latest/architecture/, at the end of the topic "Smart Daemons Enable Hyperscale".

 The Client needs to "wait" for the configured amount of replicas to be written (so that the client receives an ok and continues). This way, if there is slowness on any of the disks on which the PG will be updated, the client is left waiting.
     
 It would be possible:
     
 1-) Only record on the primary OSD
 2-) Write other replicas in background (like the same way as when an OSD fails: "degraded" ).

 This way, client has a faster response when writing to storage: improving latency and performance (throughput and IOPS).
     
 I would find it plausible to accept a period of time (seconds) until all replicas are ok (written asynchronously) at the expense of improving performance.
     
 Could you evaluate this scenario?


Rafael.

 ___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io