[ceph-users] Re: NVMe's

vitalif Wed, 23 Sep 2020 08:59:14 -0700

I have no idea how you get 66k write iops with one OSD )

I've just repeated a test by creating a test pool on one NVMe OSD with 8 PGs 
(all pinned to the same OSD with pg-upmap). Then I ran 4x fio randwrite q128 
over 4 RBD images. I got 17k iops.


OK, in fact that's not the worst result for Ceph, but problem is that I only 
get 30k write iops when benchmarking 4 RBD images spread over all OSDs 
_in_the_same_cluster_. And there are 14 of them.

>> I've just finishing doing our own benchmarking, and I can say, you
>> want to do something very unbalanced and CPU bounded.
>> 
>> 1. Ceph consume a LOT of CPU. My peak value was around 500% CPU per
>> ceph-osd at top-performance (see the recent thread on 'ceph on brd')
>> with more realistic numbers around 300-400% CPU per device.
>>> In fact in isolation on the test setup that Intel donated for
>>> community ceph R&D we've pushed a single OSD to consume around 1400%
>>> CPU at 80K write IOPS! :)  I agree though, we typical see a peak of
>>> about 500-600% CPU per OSD on multi-node clusters with a
>>> correspondingly lower write throughput.  I do believe that in some
>>> cases the mix of IO we are doing is causing us to at least be
>>> partially bound by disk write latency with the single writer thread
>>> in the rocksdb WAL though.
>> 
>> I'd really like to see how they done this without offloading (their
>> configuration).
> 
> I went back and looked over some of the old results. I didn't find the
> really high test scores (and now that I'm thinking about it they may
> have been from when I was ripping out pglog OMAP updates!), but here's
> one example I did find from earlier testing last winter that at least
> got into roughly the right ballpark with stock master from last December
> (~66K IOPS):
> 
> Avg 4K FIO randwrite IOPS: 65841.7
> 
> - 1 p4510 NVMe backed OSD
> 
> - 8GB osd memory target
> 
> - 4K min alloc size
> 
> - 4 clients, 1 128GB RBD volume per client, io_depth=128, time=300s
> 
> - 128 PGs (fixed)
> 
> - latency-network tuned profile
> 
> - bluestore_rocksdb_options =
> "compression=kNoCompression,max_total_wal_size=1073741824,max_write_buffer_number=16,min_write_buffe
> _number_to_merge=3,recycle_log_file_num=4,write_buffer_size=67108864,writable_file_max_buffer_size=0
> compaction_readahead_size=2097152,max_background_compactions=2,compaction_style=kCompactionStyleUniv
> rsal"
> 
> - bluestore_default_buffered_write = true
> 
> - bluestore_default_buffered_read = true
> 
> - rbd cache = false
> 
> Beyond that general stuff like background scrubbing and pg autoscaling
> was disabled.  I should note that these results are using universal
> compaction in rocksdb which you probably don't want to do in production
> because it can require 2x the total DB space to perform a compaction. 
> It might actually be feasible now that we are doing column family
> sharding thanks to Adam's PR because you will only need 2x the space of
> any individual column family for compaction rather than the whole DB,
> but it's still unsupported for now.
> 
> Mark
> 
>>> 
>> 
>> 2. Ceph is unable to deliver more than 12k IOPS per ceph-osd (may be
>> a little more with top-tier low-core high-frequency CPU, but not
>> much). So, super-duper-nvme wont make difference. (btw, I have a
>> stupid idea to try to run two ceph-osd from the same LV with a
>> single PV underneath VG, but it not tested).
>>> I'm curious if you've tried octopus+ yet?  We refactored bluestore's
>>> caches which internally has proven to help quite a bit with latency
>>> bound workloads as it reduces lock contention in onode cache shards
>>> and the impact of cache trimming (no more single trimming trim thread
>>> constantly grabbing the lock for long periods of time!).  In a 64
>>> NVMe drive setup (P4510s), we were able to do a little north of 400K
>>> write IOPS with 3x replication, so about 19K IOPs per OSD once you
>>> factor rep in. Also, in Nautilus you can see real benefits wtih
>>> running multiple OSDs on a single device but with Octopus and master
>>> we've pretty much closed the gap on our test setup:
>> 
>> It's octopus. I was doing single-osd benchmark, removing all movable
>> parts (brd instead of nvme, no network, size=1, etc). Moreover, I've
>> focused on rados benchmark, as RBD is just a derivative from rados
>> performance.
>> 
>> Anyway, big thank you for input.
>> 
>>> https://docs.google.com/spreadsheets/d/1e5eTeHdZnSizoY6AUjH0knb4jTCW7KMU4RoryLX9EHQ/edit?usp=sharing
>>> 
>>> Generally speaking using the latency-performance or latency-network
>>> tuned profiles helps (mostly due to avoid C state CPU transitions) as
>>> does higher clock speeds.  Not using replication helps but that's
>>> obviously not a realistic solution for most people. :)
>> 
>> I used size=1 and 'no ssd, no network' as upper bound. If allows to
>> find limits for ceph-osd performance. Any real-life things
>> (replication, network, real block devices) will make things worse, not
>> better. Knowing upper performance bound is really nice when start to
>> choose server configuration.
>> 
>>> 
>> 
>> 3. You wll find that any given client performance is heavily limited
>> by sum of all RTT in the network, plus own latencies of ceph, so
>> very fast NVME give a diminishing return.
>> 4. CPU bounded ceph-osd completely wipe any differences for
>> underlying devices (except for desktop-class crawlers).
>> 
>> You can run your own tests, even without fancy 48-nvme boxes - just
>> run ceph-osd on brd (block ram disk). ceph-osd won't run any faster
>> on anything else (ramdisk is the fastest), so numbers you get from
>> brd is supremum (upper bound) for theoretical performance.
>> 
>> Given max 400-500% CPU per ceph-osd I'd say you need to keep number
>> of NVME in server below 12, or, 15 (but sometimes you'll get CPU
>> saturation).
>> 
>> In my opinion less fancy boxes with smaller number of drives per
>> server (but larger number of servers) would make your (or your
>> operation team's) life much less stressful.
>>> That's pretty much the advice I've been giving people since the
>>> Inktank days.  It costs more and is lower density, but the design is
>>> simpler, you are less likely to under provision CPU, less likely to
>>> run into memory bandwidth bottlenecks, and you have less recovery to
>>> do when a node fails.  Especially now with how many NVMe drives you
>>> can fit in a single 1U server!
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: NVMe's

Reply via email to