Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
On Mon, Sep 19, 2016 at 12:38:05PM +0200, Alexander Gordeev wrote: > On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote: > > > Having a 1:1 already seemed like the ideal solution since you can't > > simultaneously utilize more than that from the host, so there's no more > > h/w parallelisms from we can exploit. On the controller side, fetching > > commands is serialized memory reads, so I don't think spreading IO > > among more h/w queues helps the target over posting more commands to a > > single queue. > > I take a notion of un-ordered commands completion you described below. > But I fail to realize why a CPU would not simultaneously utilize more > than one queue by posting to multiple. Is it due to nvme specifics or > you assume the host would not issue that many commands? What I mean is that if you have N CPUs, you can't possibly simultaneously write more than N submission queue entries. The benefit of having 1:1 for the queue <-> CPU mapping is that each CPU can post a command to its queue without lock contention at the same time as another thread. Having more to choose from doesn't let the host post commands any faster than we can today. When we're out of tags, the request currently just waits for one to become available, increasing submission latency. You can fix that by increasing the available tags with deeper or more h/w queues, but that just increases completion latency since the device can't process them any faster. It's six of one, half dozen of the other. The depth per queue defaults to 1k. If your process really is able to use all those resources, the hardware is completely saturated and you're not going to benefit from introducing more tags [1]. It could conceivably be worse by reducing cache-hits, or hit inappropriate timeout handling with the increased completion latency. [1] http://lists.infradead.org/pipermail/linux-nvme/2014-July/001064.html
Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
On 09/19/16 03:38, Alexander Gordeev wrote: > On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote: > > CC-ing linux-bl...@vger.kernel.org > >> I'm not sure I see how this helps. That probably means I'm not considering >> the right scenario. Could you elaborate on when having multiple hardware >> queues to choose from a given CPU will provide a benefit? > > No, I do not keep in mind any particular scenario besides common > sense. Just an assumption deeper queues are better (in this RFC > a virtual combined queue consisting of multipe h/w queues). > > Apparently, there could be positive effects only in systems where > # of queues / # of CPUs > 1 or # of queues / # of cores > 1. But > I do not happen to have ones. If I had numbers this would not be > the RFC and I probably would not have posted in the first place ;) > > Would it be possible to give it a try on your hardware? Hello Alexander, It is your task to measure the performance impact of these patches and not Keith's task. BTW, I'm not convinced that multiple hardware queues per CPU will result in a performance improvement. I have not yet seen any SSD for which a queue depth above 512 results in better performance than queue depth equal to 512. Which applications do you think will generate and sustain a queue depth above 512? Additionally, my experience from another high performance context (RDMA) is that reducing the number of queues can result in higher IOPS due to fewer interrupts per I/O. Bart.
Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote: CC-ing linux-bl...@vger.kernel.org > I'm not sure I see how this helps. That probably means I'm not considering > the right scenario. Could you elaborate on when having multiple hardware > queues to choose from a given CPU will provide a benefit? No, I do not keep in mind any particular scenario besides common sense. Just an assumption deeper queues are better (in this RFC a virtual combined queue consisting of multipe h/w queues). Apparently, there could be positive effects only in systems where # of queues / # of CPUs > 1 or # of queues / # of cores > 1. But I do not happen to have ones. If I had numbers this would not be the RFC and I probably would not have posted in the first place ;) Would it be possible to give it a try on your hardware? > If we're out of avaliable h/w tags, having more queues shouldn't > improve performance. The tag depth on each nvme hw context is already > deep enough that it should mean even one full queue has saturated the > device capabilities. Am I getting you right - a single full nvme hardware queue makes other queues stalled? > Having a 1:1 already seemed like the ideal solution since you can't > simultaneously utilize more than that from the host, so there's no more > h/w parallelisms from we can exploit. On the controller side, fetching > commands is serialized memory reads, so I don't think spreading IO > among more h/w queues helps the target over posting more commands to a > single queue. I take a notion of un-ordered commands completion you described below. But I fail to realize why a CPU would not simultaneously utilize more than one queue by posting to multiple. Is it due to nvme specifics or you assume the host would not issue that many commands? Besides, blk-mq-tag re-uses the latest freed tag and IO should not actually get spred. Instead, if only currently used hardware queue is full, the next available queue is chosen. But this is a speculation without real benchmarks, of course. > If a CPU has more than one to choose from, a command sent to a less > used queue would be serviced ahead of previously issued commands on a > more heavily used one from the same CPU thread due to how NVMe command > arbitraration works, so it sounds like this would create odd latency > outliers. Yep, that sounds scary indeed. Still, any hints on benchmarking are welcomed. Many thanks! > Thanks, > Keith
Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
On Fri, Sep 16, 2016 at 10:51:11AM +0200, Alexander Gordeev wrote: > Linux block device layer limits number of hardware contexts queues > to number of CPUs in the system. That looks like suboptimal hardware > utilization in systems where number of CPUs is (significantly) less > than number of hardware queues. > > In addition, there is a need to deal with tag starvation (see commit > 0d2602ca "blk-mq: improve support for shared tags maps"). While unused > hardware queues stay idle, extra efforts are taken to maintain a notion > of fairness between queue users. Deeper queue depth could probably > mitigate the whole issue sometimes. > > That all brings a straightforward idea that hardware queues provided by > a device should be utilized as much as possible. Hi Alex, I'm not sure I see how this helps. That probably means I'm not considering the right scenario. Could you elaborate on when having multiple hardware queues to choose from a given CPU will provide a benefit? If we're out of avaliable h/w tags, having more queues shouldn't improve performance. The tag depth on each nvme hw context is already deep enough that it should mean even one full queue has saturated the device capabilities. Having a 1:1 already seemed like the ideal solution since you can't simultaneously utilize more than that from the host, so there's no more h/w parallelisms from we can exploit. On the controller side, fetching commands is serialized memory reads, so I don't think spreading IO among more h/w queues helps the target over posting more commands to a single queue. If a CPU has more than one to choose from, a command sent to a less used queue would be serviced ahead of previously issued commands on a more heavily used one from the same CPU thread due to how NVMe command arbitraration works, so it sounds like this would create odd latency outliers. Thanks, Keith
Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
On Fri, Sep 16, 2016 at 02:27:43AM -0700, Christoph Hellwig wrote: > Hi Alex, > > this clashes badly with the my queue mapping rework that went into > Jens tree recently. Yeah, I fully aware the RFC-marked patches would clash with your works. I will surely rework them if the proposal considered worthwhile. > But in the meantime: there seem to be lots of little bugfixes and > cleanups in the series, any chance you could send them out as a first > series while updating the rest? [He-he :) I can even see you removed map_queue() as well] I will rebase the cleanups on top of your tree. > Also please Cc the linux-block list for block layer patches that don't > even seem to touch the nvme driver. Just wanted let NVMe people know, as this h/w is presumably main beneficiary. Thanks!
Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
Hi Alex, this clashes badly with the my queue mapping rework that went into Jens tree recently. But in the meantime: there seem to be lots of little bugfixes and cleanups in the series, any chance you could send them out as a first series while updating the rest? Also please Cc the linux-block list for block layer patches that don't even seem to touch the nvme driver.
[PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
Linux block device layer limits number of hardware contexts queues to number of CPUs in the system. That looks like suboptimal hardware utilization in systems where number of CPUs is (significantly) less than number of hardware queues. In addition, there is a need to deal with tag starvation (see commit 0d2602ca "blk-mq: improve support for shared tags maps"). While unused hardware queues stay idle, extra efforts are taken to maintain a notion of fairness between queue users. Deeper queue depth could probably mitigate the whole issue sometimes. That all brings a straightforward idea that hardware queues provided by a device should be utilized as much as possible. This series is an attempt to introduce 1:N mapping between CPUs and hardware queues. The code is experimental and hence some checks and sysfs interfaces and are withdrawn as blocking the demo implementation. The implementation evenly distributes hardware queues by CPUs, with moderate changes to the existing codebase. But further developments of the design are possible if needed. I.e. complete device utilization, CPU and/or interrupt topology-driven queue distribution, workload-driven queue redistribution. Comments and suggestions are very welcomed! The series is against linux-block tree. Thanks! CC: Jens Axboe CC: linux-n...@lists.infradead.org Alexander Gordeev (21): blk-mq: Fix memory leaks on a queue cleanup blk-mq: Fix a potential NULL pointer assignment to hctx tags block: Get rid of unused request_queue::nr_queues member blk-mq: Do not limit number of queues to 'nr_cpu_ids' in allocations blk-mq: Update hardware queue map after q->nr_hw_queues is set block: Remove redundant blk_mq_ops::map_queue() interface blk-mq: Remove a redundant assignment blk-mq: Cleanup hardware context data node selection blk-mq: Cleanup a loop exit condition blk-mq: Get rid of unnecessary blk_mq_free_hw_queues() blk-mq: Move duplicating code to blk_mq_exit_hctx() blk-mq: Uninit hardware context in order reverse to init blk-mq: Move hardware context init code into blk_mq_init_hctx() blk-mq: Rework blk_mq_init_hctx() function blk-mq: Pair blk_mq_hctx_kobj_init() with blk_mq_hctx_kobj_put() blk-mq: Set flush_start_tag to BLK_MQ_MAX_DEPTH blk-mq: Introduce a 1:N hardware contexts blk-mq: Enable tag numbers exceed hardware queue depth blk-mq: Enable combined hardware queues blk-mq: Allow combined hardware queues null_blk: Do not limit # of hardware queues to # of CPUs block/blk-core.c | 5 +- block/blk-flush.c | 6 +- block/blk-mq-cpumap.c | 49 +++-- block/blk-mq-sysfs.c | 5 + block/blk-mq-tag.c| 9 +- block/blk-mq.c| 373 +++--- block/blk-mq.h| 4 +- block/blk.h | 2 +- drivers/block/loop.c | 3 +- drivers/block/mtip32xx/mtip32xx.c | 4 +- drivers/block/null_blk.c | 16 +- drivers/block/rbd.c | 3 +- drivers/block/virtio_blk.c| 6 +- drivers/block/xen-blkfront.c | 6 +- drivers/md/dm-rq.c| 4 +- drivers/mtd/ubi/block.c | 1 - drivers/nvme/host/pci.c | 29 +-- drivers/nvme/host/rdma.c | 2 - drivers/nvme/target/loop.c| 2 - drivers/scsi/scsi_lib.c | 4 +- include/linux/blk-mq.h| 51 -- include/linux/blkdev.h| 1 - 22 files changed, 279 insertions(+), 306 deletions(-) -- 1.8.3.1