[ceph-users] Re: What's the best practice for Erasure Coding

Frank Schilder Wed, 06 May 2020 16:58:19 -0700

Dear Alex,

I don't really have a reference for this set up. The ceph documentation 
describes this as the simplest possible set up and back then it was basically 
dictated by budget. Everything else was several months of experimentation and 
benchmarking. I had scripts running for several weeks just doing rbd bench on 
all sorts of combinations of parameters. I was testing for aggregated 
sequential large write (aggregated bandwidth) and aggregated random small write 
(aggregated IOP/s).


In my experience, write performance with bluestore OSDs and everything 
collocated is quite good. One of the design goals of bluestore was to provide a 
more constant and predictable throughput and, as far as I can tell, this works 
quite well. We have 150 spindles and I can get a sustained aggregated 
sequential write performance of 6GB/s. This is quite good for our purposes and 
we had only very few users who managed to fill this bandwidth.

To be on the safe side, I promise no more than 30MB/s per disk. This is a 
pretty good lower bound and if you put enough disks together, it will add up.

Latency is a different story. There is an extreme skew between write and read. 
When I do a tar-untar test with an archive containing something like 100.000 
small files, the tar is up to 20 times slower than the untar (on ceph fs, clean 
client, MDS and client cache flushed). I didn't test RBD read latency with RBD 
bench.

I don't have SSDs for WAL/DB, so some statements below this line are a bit 
speculative and based on snippets I picked up in other conversations.

The most noticeable improvement with using SSD for WAL/DB is probably a 
reduction in read-latency due to the faster DB lookup. WAL has actually only 
limited influence. If I understand correctly, it only speeds things up when its 
not running full. As soon as write load is high enough, the backing disk 
becomes the effective bottleneck.

I have some anecdotal wisdom that WAL+DB on SSD improves performance by a 
factor of 2. Performance is here undefined. Its probably something like 
"general user experience". A factor of 2 is not very tempting given the 
architectural complication. I would rather double the size of my cluster.

For latency and IOP/s sensitive applications (KVMs on RBD) we actually went for 
all-flash running on cheap MICRON PRO SSDs, which are QLC and can drop in 
bandwidth down to 80MB/s. However, I have never seen more than 5MB/s per disk 
with this workload. KVMs on RBD is really IOP/s intensive and bandwidth is 
secondary. The MICRON PRO disks provide very good IOP/s per TB already with a 
single OSD per disk. My benchmarks show that running 2 OSDs per disks doubles 
that and running 4 OSDs per disks saturates the disk spec performance. For the 
number of VMs we run per SSD, these disks are completely sufficient, I can live 
with the single-OSD per disk deployment. This all-flash setup is simpler and 
probably also cheaper than a hybrid OSD setup. I think Kingston SSDs are a bit 
more expensive but equally suitable. Make sure you disable volatile write cache.

In my experience, ceph fs with SSD meta data pool and data pool on HDD only is 
fine. For RBD backing VMs all flash with single-thread high IOP/s SSDs works 
really well for us. You need to test this though. Many SSDs have surprisingly 
poor single-thread performance compared with specs.

For the future, I consider LVM with dm_cache on SSD. This sounds a bit more 
flexible than the WAL+DB approach and also reduces read-latency.

Finally, yes, on our ceph fs we will accumulate a lot of cold data. Its going 
to be the dump yard. This means we will eventually get really good performance 
for the small amount of warm/hot data once the cluster grows enough.

Hope that answered your questions.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Alex Gorbachev <a...@iss-integration.com>
Sent: 04 May 2020 04:21
To: Frank Schilder
Cc: David; ceph-users
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

Hi Frank,

Reviving this old thread as to whether the performance on these raw NL-SAS 
drives is adequate?  I was wondering if this is a deep archive with almost no 
retrieval, or how many drives are used?  In my experience with large parallel 
writes, WAL/DB with bluestore, or journal drives on SSD with filestore have 
always been needed to sustain a reasonably consistent transfer rate.
Very much appreciate any reference info as to your design.

Best regards,
Alex

On Mon, Jul 8, 2019 at 4:30 AM Frank Schilder 
<fr...@dtu.dk<mailto:fr...@dtu.dk>> wrote:
Hi David,

I'm running a cluster with bluestore on raw devices (no lvm) and all journals 
collocated on the same disk with the data. Disks are spinning NL-SAS. Our goal 
was to build storage at lowest cost, therefore all data on HDD only. I got a 
few SSDs that I'm using for FS and RBD meta data. All large pools are EC on 
spinning disk.

I spent at least one month to run detailed benchmarks (rbd bench) depending on 
EC profile, object size, write size, etc. Results were varying a lot. My advice 
would be to run benchmarks with your hardware. If there was a single perfect 
choice, there wouldn't be so many options. For example, my tests will not be 
valid when using separate fast disks for WAL and DB.

There are some results though that might be valid in general:

1) EC pools have high throughput but low IOP/s compared with replicated pools

I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which is 
probably the network limit and not the disk limit. IOP/s get better with more 
disks, but are way lower than what replicated pools can provide. On a cephfs 
with EC data pool, small-file IO will be comparably slow and eat a lot of 
resources.

2) I observe massive network traffic amplification on small IO sizes, which is 
due to the way EC overwrites are handled. This is one bottleneck for IOP/s. We 
have 10G infrastructure and use 2x10G client and 4x10G OSD network. OSD 
bandwidth at least 2x client network, better 4x or more.

3) k should only have small prime factors, power of 2 if possible

I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All other 
choices were poor. The value of m seems not relevant for performance. Larger k 
will require more failure domains (more hardware).

4) object size matters

The best throughput (1M write size) I see with object sizes of 4MB or 8MB, with 
IOP/s getting somewhat better with slower object sizes but throughput dropping 
fast. I use the default of 4MB in production. Works well for us.

5) jerasure is quite good and seems most flexible

jerasure is quite CPU efficient and can handle smaller chunk sizes than other 
plugins, which is preferrable for IOP/s. However, CPU usage can become a 
problem and a plugin optimized for specific values of k and m might help here. 
Under usual circumstances I see very low load on all OSD hosts, even under 
rebalancing. However, I remember that once I needed to rebuild something on all 
OSDs (I don't remember what it was, sorry). In this situation, CPU load went up 
to 30-50% (meaning up to half the cores were at 100%), which is really high 
considering that each server has only 16 disks at the moment and is sized to 
handle up to 100. CPU power could become a bottle for us neck in the future.

These are some general observations and do not replace benchmarks for specific 
use cases. I was hunting for a specific performance pattern, which might not be 
what you want to optimize for. I would recommend to run extensive benchmarks if 
you have to live with a configuration for a long time - EC profiles cannot be 
changed.

We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also use 
bluestore compression. All meta data pools are on SSD, only very little SSD 
space is required. This choice works well for the majority of our use cases. We 
can still build small expensive pools to accommodate special performance 
requests.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of David <xiaomajia...@gmail.com<mailto:xiaomajia...@gmail.com>>
Sent: 07 July 2019 20:01:18
To: ceph-us...@lists.ceph.com<mailto:ceph-us...@lists.ceph.com>
Subject: [ceph-users]  What's the best practice for Erasure Coding

Hi Ceph-Users,

I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).
Recently, I'm trying to use the Erasure Code pool.
My question is "what's the best practice for using EC pools ?".
More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should I 
adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), (k=6,m=3) ).

Does anyone share some experience?

Thanks for any help.

Regards,
David

_______________________________________________
ceph-users mailing list
ceph-us...@lists.ceph.com<mailto:ceph-us...@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: What's the best practice for Erasure Coding

Reply via email to