Re: [ceph-users] optimize bluestore for random write i/o

Mark Nelson Tue, 12 Mar 2019 06:36:08 -0700


On 3/12/19 7:31 AM, vita...@yourcmc.ru wrote:

Decreasing the min_alloc size isn't always a win, but ican be in some
cases.  Originally bluestore_min_alloc_size_ssd was set to 4096 but we
increased it to 16384 because at the time our metadata path was slow
and increasing it resulted in a pretty significant performance win
(along with increasing the WAL buffers in rocksdb to reduce write
amplification).  Since then we've improved the metadata path to the
point where at least on our test nodes performance is pretty close
between with min_alloc size = 16k and min_alloc size = 4k the last
time I looked.  It might be a good idea to drop it down to 4k now but
I think we need to be careful because there are tradeoffs.
I think it's all about your disks' latency. Deferred write is 1IO+sync and redirect-write is 2 IOs+syncs. So if your IO or sync isslow (like it is on HDDs and bad SSDs) then the deferred write isbetter in terms of latency. If your IO is fast then you're onlybottlenecked by the OSD code itself eating a lot of CPU and thendirect write may be better. By the way, I think OSD itself is way TOOslow currently (see below).



Don't disagree, bluestore's write path has gotten *really* complicated.

The idea I was talking about turned out to be only viable for HDD/slowSSDs and only for low iodepths. But the gain is huge - somethingbetween +50% iops to +100% iops (2x less latency). There is a stupidproblem in current bluestore implementation which makes it do 2journal writes and FSYNCs instead of one for every incomingtransaction. The details are here: https://tracker.ceph.com/issues/38559
The unnecessary commit is the BlueFS's WAL. All it's doing isrecording the increased size of a RocksDB WAL file. Which obviouslyshouldn't be required with RocksDB as its default setting is"kTolerateCorruptedTailRecords". However, without this setting the WALis not synced to the disk with every write because by some cleverlogic sync_file_range is called only with SYNC_FILE_RANGE_WRITE in thecorresponding piece of code. Thus the OSD's database gets corruptedwhen you kill it with -9 and thus it's impossible to set`bluefs_preextend_wal_files` to true. And thus you get two writes andcommits instead of one.
I don't know the exact idea behind doing only SYNC_FILE_RANGE_WRITE -as I understand there is currently no benefit in doing this. It couldbe a benefit if RocksDB was writing journal in small parts and thendoing a single sync - but it's always flushing the newly written partof a journal to disk as a whole.
The simplest way to fix it is just to add SYNC_FILE_RANGE_WAIT_BEFOREand SYNC_FILE_RANGE_WAIT_AFTER to sync_file_range in KernelDevice.cc.My pull request is here: https://github.com/ceph/ceph/pull/26909 -I've tested this change with 13.2.4 Mimic and 14.1.0 Nautilus and yes,it does increase single-thread iops on HDDs two times (!). After thischange BlueStore becomes actually better than FileStore at least on HDDs.
Another way of fixing it would be to add an explicit bdev->flush atthe end of the kv_sync_thread, after db->submit_transaction_sync(),and possibly remove the redundant sync_file_range at all. But then youmust do the same in another place in _txc_state_proc, because it'salso sometimes doing submit_transaction_sync(). In the end Ipersonally think that to add flags to sync_file_range is betterbecause a function named "submit_transaction_sync" should be in factSYNC! It shouldn't require additional steps from the caller to makethe data durable.

I'm glad you are peaking under the covers here. :) There's a lot goingon here, and it's not immediate obvious what the intent is and thefailure conditions are. I suspect the intent here was to error on theside of caution but we really need to document this better. To be fairit's not just us, there's confusion and terribleness all the way up tothe kernel and beyond.

Also I have a small funny test result to share.
I've created one OSD on my laptop on a loop device in a tmpfs (i.e.RAM), created 1 RBD image inside it and tested it with `fio-ioengine=rbd -direct=1 -bs=4k -rw=randwrite`. Before doing the testI've turned off CPU power saving with `cpupower idle-set -D 0`.
The results are:
- filestore: 2200 iops with -iodepth=1 (0.454ms average latency). 8500iops with -iodepth=128.- bluestore: 1800 iops with -iodepth=1 (0.555ms average latency). 9000iops with -iodepth=128.- memstore: 3000 iops with -iodepth=1 (0.333ms average latency). 11000iops with -iodepth=128.
If we can think of memstore being a "minimal possible /dev/null" then:
- OSD overhead is 1/3000 = 0.333ms (maybe slighly less, but thatdoesn't matter).
- filestore overhead is 1/2200-1/3000 = 0.121ms
- bluestore overhead is 1/1800-1/3000 = 0.222ms
The conclusion is that bluestore is actually almost TWO TIMES slowerthan filestore in terms of pure latency, and the throughput is onlyslightly better. How could it happen? How could a newly written storebecome two times slower than the old one? ) that's pretty annoying...

I bet you'd see better memstore results with my vector based objectimplementation instead of bufferlists. Nick Fisk noticed the same thingyou did. One interesting observation he made was that disabling CPU C/Pstates helped bluestore immensely in the iodepth=1 case. IE, bluestoredoes so much in it's write path that it's really sensitive to latencyintroduced by C state transitions. Just more fodder showing that thebluestore write path is really complicated. I think bluestore was stillthe right way to go vs filestore (for a variety of reasons!) but I thinkthere would be significant benefit to auditing the write path.

Could it be because bluestore is doing a lot of threading? I meancould it be because each write operation goes through 5 threads duringits execution? (tp_osd_tp -> aio -> kv_sync_thread ->kv_finalize_thread -> finisher)? Maybe just remove aio and kv threadsand process all operations directly in tp_osd_tp then?

One way or another we can only have a single thread sending writes torocksdb. A lot of the prior optimization work on the write side was toget as much processing out of the kv_sync_thread as possible. That'sstill a worthwhile goal as it's typically what bottlenecks with highamounts of concurrency. What I think would be very interesting thoughis if we moved more toward a model where we had lots shards (OSDs orshards of an OSD) with independent rocksdb instances and less threadingoverhead per shard. That's the way the seastar work is going, and alsosort of the model I've been thinking about for a very simplesingle-threaded OSD.



Mark

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] optimize bluestore for random write i/o

Reply via email to