Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-20 Thread Keith Busch
On Mon, Sep 19, 2016 at 12:38:05PM +0200, Alexander Gordeev wrote:
> On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:
> 
> > Having a 1:1 already seemed like the ideal solution since you can't
> > simultaneously utilize more than that from the host, so there's no more
> > h/w parallelisms from we can exploit. On the controller side, fetching
> > commands is serialized memory reads, so I don't think spreading IO
> > among more h/w queues helps the target over posting more commands to a
> > single queue.
> 
> I take a notion of un-ordered commands completion you described below.
> But I fail to realize why a CPU would not simultaneously utilize more
> than one queue by posting to multiple. Is it due to nvme specifics or
> you assume the host would not issue that many commands?

What I mean is that if you have N CPUs, you can't possibly simultaneously
write more than N submission queue entries. The benefit of having 1:1
for the queue <-> CPU mapping is that each CPU can post a command to
its queue without lock contention at the same time as another thread.
Having more to choose from doesn't let the host post commands any faster
than we can today.

When we're out of tags, the request currently just waits for one to
become available, increasing submission latency. You can fix that by
increasing the available tags with deeper or more h/w queues, but that
just increases completion latency since the device can't process them
any faster. It's six of one, half dozen of the other.

The depth per queue defaults to 1k. If your process really is able to use
all those resources, the hardware is completely saturated and you're not
going to benefit from introducing more tags [1]. It could conceivably
be worse by reducing cache-hits, or hit inappropriate timeout handling
with the increased completion latency.

 [1] http://lists.infradead.org/pipermail/linux-nvme/2014-July/001064.html


Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-20 Thread Keith Busch
On Mon, Sep 19, 2016 at 12:38:05PM +0200, Alexander Gordeev wrote:
> On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:
> 
> > Having a 1:1 already seemed like the ideal solution since you can't
> > simultaneously utilize more than that from the host, so there's no more
> > h/w parallelisms from we can exploit. On the controller side, fetching
> > commands is serialized memory reads, so I don't think spreading IO
> > among more h/w queues helps the target over posting more commands to a
> > single queue.
> 
> I take a notion of un-ordered commands completion you described below.
> But I fail to realize why a CPU would not simultaneously utilize more
> than one queue by posting to multiple. Is it due to nvme specifics or
> you assume the host would not issue that many commands?

What I mean is that if you have N CPUs, you can't possibly simultaneously
write more than N submission queue entries. The benefit of having 1:1
for the queue <-> CPU mapping is that each CPU can post a command to
its queue without lock contention at the same time as another thread.
Having more to choose from doesn't let the host post commands any faster
than we can today.

When we're out of tags, the request currently just waits for one to
become available, increasing submission latency. You can fix that by
increasing the available tags with deeper or more h/w queues, but that
just increases completion latency since the device can't process them
any faster. It's six of one, half dozen of the other.

The depth per queue defaults to 1k. If your process really is able to use
all those resources, the hardware is completely saturated and you're not
going to benefit from introducing more tags [1]. It could conceivably
be worse by reducing cache-hits, or hit inappropriate timeout handling
with the increased completion latency.

 [1] http://lists.infradead.org/pipermail/linux-nvme/2014-July/001064.html


Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-19 Thread Bart Van Assche
On 09/19/16 03:38, Alexander Gordeev wrote:
> On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:
>
> CC-ing linux-bl...@vger.kernel.org
>
>> I'm not sure I see how this helps. That probably means I'm not considering
>> the right scenario. Could you elaborate on when having multiple hardware
>> queues to choose from a given CPU will provide a benefit?
>
> No, I do not keep in mind any particular scenario besides common
> sense. Just an assumption deeper queues are better (in this RFC
> a virtual combined queue consisting of multipe h/w queues).
>
> Apparently, there could be positive effects only in systems where
> # of queues / # of CPUs > 1 or # of queues / # of cores > 1. But
> I do not happen to have ones. If I had numbers this would not be
> the RFC and I probably would not have posted in the first place ;)
>
> Would it be possible to give it a try on your hardware?

Hello Alexander,

It is your task to measure the performance impact of these patches and 
not Keith's task. BTW, I'm not convinced that multiple hardware queues 
per CPU will result in a performance improvement. I have not yet seen 
any SSD for which a queue depth above 512 results in better performance 
than queue depth equal to 512. Which applications do you think will 
generate and sustain a queue depth above 512? Additionally, my 
experience from another high performance context (RDMA) is that reducing 
the number of queues can result in higher IOPS due to fewer interrupts 
per I/O.

Bart.


Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-19 Thread Bart Van Assche
On 09/19/16 03:38, Alexander Gordeev wrote:
> On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:
>
> CC-ing linux-bl...@vger.kernel.org
>
>> I'm not sure I see how this helps. That probably means I'm not considering
>> the right scenario. Could you elaborate on when having multiple hardware
>> queues to choose from a given CPU will provide a benefit?
>
> No, I do not keep in mind any particular scenario besides common
> sense. Just an assumption deeper queues are better (in this RFC
> a virtual combined queue consisting of multipe h/w queues).
>
> Apparently, there could be positive effects only in systems where
> # of queues / # of CPUs > 1 or # of queues / # of cores > 1. But
> I do not happen to have ones. If I had numbers this would not be
> the RFC and I probably would not have posted in the first place ;)
>
> Would it be possible to give it a try on your hardware?

Hello Alexander,

It is your task to measure the performance impact of these patches and 
not Keith's task. BTW, I'm not convinced that multiple hardware queues 
per CPU will result in a performance improvement. I have not yet seen 
any SSD for which a queue depth above 512 results in better performance 
than queue depth equal to 512. Which applications do you think will 
generate and sustain a queue depth above 512? Additionally, my 
experience from another high performance context (RDMA) is that reducing 
the number of queues can result in higher IOPS due to fewer interrupts 
per I/O.

Bart.


Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-19 Thread Alexander Gordeev
On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:

CC-ing linux-bl...@vger.kernel.org

> I'm not sure I see how this helps. That probably means I'm not considering
> the right scenario. Could you elaborate on when having multiple hardware
> queues to choose from a given CPU will provide a benefit?

No, I do not keep in mind any particular scenario besides common
sense. Just an assumption deeper queues are better (in this RFC
a virtual combined queue consisting of multipe h/w queues).

Apparently, there could be positive effects only in systems where
# of queues / # of CPUs > 1 or # of queues / # of cores > 1. But
I do not happen to have ones. If I had numbers this would not be
the RFC and I probably would not have posted in the first place ;)

Would it be possible to give it a try on your hardware?

> If we're out of avaliable h/w tags, having more queues shouldn't
> improve performance. The tag depth on each nvme hw context is already
> deep enough that it should mean even one full queue has saturated the
> device capabilities.

Am I getting you right - a single full nvme hardware queue makes
other queues stalled?

> Having a 1:1 already seemed like the ideal solution since you can't
> simultaneously utilize more than that from the host, so there's no more
> h/w parallelisms from we can exploit. On the controller side, fetching
> commands is serialized memory reads, so I don't think spreading IO
> among more h/w queues helps the target over posting more commands to a
> single queue.

I take a notion of un-ordered commands completion you described below.
But I fail to realize why a CPU would not simultaneously utilize more
than one queue by posting to multiple. Is it due to nvme specifics or
you assume the host would not issue that many commands?

Besides, blk-mq-tag re-uses the latest freed tag and IO should not
actually get spred. Instead, if only currently used hardware queue is
full, the next available queue is chosen. But this is a speculation
without real benchmarks, of course.

> If a CPU has more than one to choose from, a command sent to a less
> used queue would be serviced ahead of previously issued commands on a
> more heavily used one from the same CPU thread due to how NVMe command
> arbitraration works, so it sounds like this would create odd latency
> outliers.

Yep, that sounds scary indeed. Still, any hints on benchmarking
are welcomed.

Many thanks!

> Thanks,
> Keith


Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-19 Thread Alexander Gordeev
On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:

CC-ing linux-bl...@vger.kernel.org

> I'm not sure I see how this helps. That probably means I'm not considering
> the right scenario. Could you elaborate on when having multiple hardware
> queues to choose from a given CPU will provide a benefit?

No, I do not keep in mind any particular scenario besides common
sense. Just an assumption deeper queues are better (in this RFC
a virtual combined queue consisting of multipe h/w queues).

Apparently, there could be positive effects only in systems where
# of queues / # of CPUs > 1 or # of queues / # of cores > 1. But
I do not happen to have ones. If I had numbers this would not be
the RFC and I probably would not have posted in the first place ;)

Would it be possible to give it a try on your hardware?

> If we're out of avaliable h/w tags, having more queues shouldn't
> improve performance. The tag depth on each nvme hw context is already
> deep enough that it should mean even one full queue has saturated the
> device capabilities.

Am I getting you right - a single full nvme hardware queue makes
other queues stalled?

> Having a 1:1 already seemed like the ideal solution since you can't
> simultaneously utilize more than that from the host, so there's no more
> h/w parallelisms from we can exploit. On the controller side, fetching
> commands is serialized memory reads, so I don't think spreading IO
> among more h/w queues helps the target over posting more commands to a
> single queue.

I take a notion of un-ordered commands completion you described below.
But I fail to realize why a CPU would not simultaneously utilize more
than one queue by posting to multiple. Is it due to nvme specifics or
you assume the host would not issue that many commands?

Besides, blk-mq-tag re-uses the latest freed tag and IO should not
actually get spred. Instead, if only currently used hardware queue is
full, the next available queue is chosen. But this is a speculation
without real benchmarks, of course.

> If a CPU has more than one to choose from, a command sent to a less
> used queue would be serviced ahead of previously issued commands on a
> more heavily used one from the same CPU thread due to how NVMe command
> arbitraration works, so it sounds like this would create odd latency
> outliers.

Yep, that sounds scary indeed. Still, any hints on benchmarking
are welcomed.

Many thanks!

> Thanks,
> Keith


Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-16 Thread Keith Busch
On Fri, Sep 16, 2016 at 10:51:11AM +0200, Alexander Gordeev wrote:
> Linux block device layer limits number of hardware contexts queues
> to number of CPUs in the system. That looks like suboptimal hardware
> utilization in systems where number of CPUs is (significantly) less
> than number of hardware queues.
> 
> In addition, there is a need to deal with tag starvation (see commit
> 0d2602ca "blk-mq: improve support for shared tags maps"). While unused
> hardware queues stay idle, extra efforts are taken to maintain a notion
> of fairness between queue users. Deeper queue depth could probably
> mitigate the whole issue sometimes.
> 
> That all brings a straightforward idea that hardware queues provided by
> a device should be utilized as much as possible.

Hi Alex,

I'm not sure I see how this helps. That probably means I'm not considering
the right scenario. Could you elaborate on when having multiple hardware
queues to choose from a given CPU will provide a benefit?

If we're out of avaliable h/w tags, having more queues shouldn't
improve performance. The tag depth on each nvme hw context is already
deep enough that it should mean even one full queue has saturated the
device capabilities.

Having a 1:1 already seemed like the ideal solution since you can't
simultaneously utilize more than that from the host, so there's no more
h/w parallelisms from we can exploit. On the controller side, fetching
commands is serialized memory reads, so I don't think spreading IO
among more h/w queues helps the target over posting more commands to a
single queue.

If a CPU has more than one to choose from, a command sent to a less
used queue would be serviced ahead of previously issued commands on a
more heavily used one from the same CPU thread due to how NVMe command
arbitraration works, so it sounds like this would create odd latency
outliers.

Thanks,
Keith


Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-16 Thread Keith Busch
On Fri, Sep 16, 2016 at 10:51:11AM +0200, Alexander Gordeev wrote:
> Linux block device layer limits number of hardware contexts queues
> to number of CPUs in the system. That looks like suboptimal hardware
> utilization in systems where number of CPUs is (significantly) less
> than number of hardware queues.
> 
> In addition, there is a need to deal with tag starvation (see commit
> 0d2602ca "blk-mq: improve support for shared tags maps"). While unused
> hardware queues stay idle, extra efforts are taken to maintain a notion
> of fairness between queue users. Deeper queue depth could probably
> mitigate the whole issue sometimes.
> 
> That all brings a straightforward idea that hardware queues provided by
> a device should be utilized as much as possible.

Hi Alex,

I'm not sure I see how this helps. That probably means I'm not considering
the right scenario. Could you elaborate on when having multiple hardware
queues to choose from a given CPU will provide a benefit?

If we're out of avaliable h/w tags, having more queues shouldn't
improve performance. The tag depth on each nvme hw context is already
deep enough that it should mean even one full queue has saturated the
device capabilities.

Having a 1:1 already seemed like the ideal solution since you can't
simultaneously utilize more than that from the host, so there's no more
h/w parallelisms from we can exploit. On the controller side, fetching
commands is serialized memory reads, so I don't think spreading IO
among more h/w queues helps the target over posting more commands to a
single queue.

If a CPU has more than one to choose from, a command sent to a less
used queue would be serviced ahead of previously issued commands on a
more heavily used one from the same CPU thread due to how NVMe command
arbitraration works, so it sounds like this would create odd latency
outliers.

Thanks,
Keith


Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-16 Thread Alexander Gordeev
On Fri, Sep 16, 2016 at 02:27:43AM -0700, Christoph Hellwig wrote:
> Hi Alex,
> 
> this clashes badly with the my queue mapping rework that went into
> Jens tree recently.

Yeah, I fully aware the RFC-marked patches would clash with your
works. I will surely rework them if the proposal considered worthwhile.

> But in the meantime: there seem to be lots of little bugfixes and
> cleanups in the series, any chance you could send them out as a first
> series while updating the rest?

[He-he :) I can even see you removed map_queue() as well]
I will rebase the cleanups on top of your tree.

> Also please Cc the linux-block list for block layer patches that don't
> even seem to touch the nvme driver.

Just wanted let NVMe people know, as this h/w is presumably main
beneficiary.

Thanks!



Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-16 Thread Alexander Gordeev
On Fri, Sep 16, 2016 at 02:27:43AM -0700, Christoph Hellwig wrote:
> Hi Alex,
> 
> this clashes badly with the my queue mapping rework that went into
> Jens tree recently.

Yeah, I fully aware the RFC-marked patches would clash with your
works. I will surely rework them if the proposal considered worthwhile.

> But in the meantime: there seem to be lots of little bugfixes and
> cleanups in the series, any chance you could send them out as a first
> series while updating the rest?

[He-he :) I can even see you removed map_queue() as well]
I will rebase the cleanups on top of your tree.

> Also please Cc the linux-block list for block layer patches that don't
> even seem to touch the nvme driver.

Just wanted let NVMe people know, as this h/w is presumably main
beneficiary.

Thanks!



Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-16 Thread Christoph Hellwig
Hi Alex,

this clashes badly with the my queue mapping rework that went into
Jens tree recently.

But in the meantime: there seem to be lots of little bugfixes and
cleanups in the series, any chance you could send them out as a first
series while updating the rest?

Also please Cc the linux-block list for block layer patches that don't
even seem to touch the nvme driver.


Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-16 Thread Christoph Hellwig
Hi Alex,

this clashes badly with the my queue mapping rework that went into
Jens tree recently.

But in the meantime: there seem to be lots of little bugfixes and
cleanups in the series, any chance you could send them out as a first
series while updating the rest?

Also please Cc the linux-block list for block layer patches that don't
even seem to touch the nvme driver.


[PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-16 Thread Alexander Gordeev
Linux block device layer limits number of hardware contexts queues
to number of CPUs in the system. That looks like suboptimal hardware
utilization in systems where number of CPUs is (significantly) less
than number of hardware queues.

In addition, there is a need to deal with tag starvation (see commit
0d2602ca "blk-mq: improve support for shared tags maps"). While unused
hardware queues stay idle, extra efforts are taken to maintain a notion
of fairness between queue users. Deeper queue depth could probably
mitigate the whole issue sometimes.

That all brings a straightforward idea that hardware queues provided by
a device should be utilized as much as possible.

This series is an attempt to introduce 1:N mapping between CPUs and
hardware queues. The code is experimental and hence some checks and
sysfs interfaces and are withdrawn as blocking the demo implementation.

The implementation evenly distributes hardware queues by CPUs, with
moderate changes to the existing codebase. But further developments
of the design are possible if needed. I.e. complete device utilization,
CPU and/or interrupt topology-driven queue distribution, workload-driven
queue redistribution.

Comments and suggestions are very welcomed!

The series is against linux-block tree.

Thanks!

CC: Jens Axboe 
CC: linux-n...@lists.infradead.org

Alexander Gordeev (21):
  blk-mq: Fix memory leaks on a queue cleanup
  blk-mq: Fix a potential NULL pointer assignment to hctx tags
  block: Get rid of unused request_queue::nr_queues member
  blk-mq: Do not limit number of queues to 'nr_cpu_ids' in allocations
  blk-mq: Update hardware queue map after q->nr_hw_queues is set
  block: Remove redundant blk_mq_ops::map_queue() interface
  blk-mq: Remove a redundant assignment
  blk-mq: Cleanup hardware context data node selection
  blk-mq: Cleanup a loop exit condition
  blk-mq: Get rid of unnecessary blk_mq_free_hw_queues()
  blk-mq: Move duplicating code to blk_mq_exit_hctx()
  blk-mq: Uninit hardware context in order reverse to init
  blk-mq: Move hardware context init code into blk_mq_init_hctx()
  blk-mq: Rework blk_mq_init_hctx() function
  blk-mq: Pair blk_mq_hctx_kobj_init() with blk_mq_hctx_kobj_put()
  blk-mq: Set flush_start_tag to BLK_MQ_MAX_DEPTH
  blk-mq: Introduce a 1:N hardware contexts
  blk-mq: Enable tag numbers exceed hardware queue depth
  blk-mq: Enable combined hardware queues
  blk-mq: Allow combined hardware queues
  null_blk: Do not limit # of hardware queues to # of CPUs

 block/blk-core.c  |   5 +-
 block/blk-flush.c |   6 +-
 block/blk-mq-cpumap.c |  49 +++--
 block/blk-mq-sysfs.c  |   5 +
 block/blk-mq-tag.c|   9 +-
 block/blk-mq.c| 373 +++---
 block/blk-mq.h|   4 +-
 block/blk.h   |   2 +-
 drivers/block/loop.c  |   3 +-
 drivers/block/mtip32xx/mtip32xx.c |   4 +-
 drivers/block/null_blk.c  |  16 +-
 drivers/block/rbd.c   |   3 +-
 drivers/block/virtio_blk.c|   6 +-
 drivers/block/xen-blkfront.c  |   6 +-
 drivers/md/dm-rq.c|   4 +-
 drivers/mtd/ubi/block.c   |   1 -
 drivers/nvme/host/pci.c   |  29 +--
 drivers/nvme/host/rdma.c  |   2 -
 drivers/nvme/target/loop.c|   2 -
 drivers/scsi/scsi_lib.c   |   4 +-
 include/linux/blk-mq.h|  51 --
 include/linux/blkdev.h|   1 -
 22 files changed, 279 insertions(+), 306 deletions(-)

-- 
1.8.3.1



[PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-16 Thread Alexander Gordeev
Linux block device layer limits number of hardware contexts queues
to number of CPUs in the system. That looks like suboptimal hardware
utilization in systems where number of CPUs is (significantly) less
than number of hardware queues.

In addition, there is a need to deal with tag starvation (see commit
0d2602ca "blk-mq: improve support for shared tags maps"). While unused
hardware queues stay idle, extra efforts are taken to maintain a notion
of fairness between queue users. Deeper queue depth could probably
mitigate the whole issue sometimes.

That all brings a straightforward idea that hardware queues provided by
a device should be utilized as much as possible.

This series is an attempt to introduce 1:N mapping between CPUs and
hardware queues. The code is experimental and hence some checks and
sysfs interfaces and are withdrawn as blocking the demo implementation.

The implementation evenly distributes hardware queues by CPUs, with
moderate changes to the existing codebase. But further developments
of the design are possible if needed. I.e. complete device utilization,
CPU and/or interrupt topology-driven queue distribution, workload-driven
queue redistribution.

Comments and suggestions are very welcomed!

The series is against linux-block tree.

Thanks!

CC: Jens Axboe 
CC: linux-n...@lists.infradead.org

Alexander Gordeev (21):
  blk-mq: Fix memory leaks on a queue cleanup
  blk-mq: Fix a potential NULL pointer assignment to hctx tags
  block: Get rid of unused request_queue::nr_queues member
  blk-mq: Do not limit number of queues to 'nr_cpu_ids' in allocations
  blk-mq: Update hardware queue map after q->nr_hw_queues is set
  block: Remove redundant blk_mq_ops::map_queue() interface
  blk-mq: Remove a redundant assignment
  blk-mq: Cleanup hardware context data node selection
  blk-mq: Cleanup a loop exit condition
  blk-mq: Get rid of unnecessary blk_mq_free_hw_queues()
  blk-mq: Move duplicating code to blk_mq_exit_hctx()
  blk-mq: Uninit hardware context in order reverse to init
  blk-mq: Move hardware context init code into blk_mq_init_hctx()
  blk-mq: Rework blk_mq_init_hctx() function
  blk-mq: Pair blk_mq_hctx_kobj_init() with blk_mq_hctx_kobj_put()
  blk-mq: Set flush_start_tag to BLK_MQ_MAX_DEPTH
  blk-mq: Introduce a 1:N hardware contexts
  blk-mq: Enable tag numbers exceed hardware queue depth
  blk-mq: Enable combined hardware queues
  blk-mq: Allow combined hardware queues
  null_blk: Do not limit # of hardware queues to # of CPUs

 block/blk-core.c  |   5 +-
 block/blk-flush.c |   6 +-
 block/blk-mq-cpumap.c |  49 +++--
 block/blk-mq-sysfs.c  |   5 +
 block/blk-mq-tag.c|   9 +-
 block/blk-mq.c| 373 +++---
 block/blk-mq.h|   4 +-
 block/blk.h   |   2 +-
 drivers/block/loop.c  |   3 +-
 drivers/block/mtip32xx/mtip32xx.c |   4 +-
 drivers/block/null_blk.c  |  16 +-
 drivers/block/rbd.c   |   3 +-
 drivers/block/virtio_blk.c|   6 +-
 drivers/block/xen-blkfront.c  |   6 +-
 drivers/md/dm-rq.c|   4 +-
 drivers/mtd/ubi/block.c   |   1 -
 drivers/nvme/host/pci.c   |  29 +--
 drivers/nvme/host/rdma.c  |   2 -
 drivers/nvme/target/loop.c|   2 -
 drivers/scsi/scsi_lib.c   |   4 +-
 include/linux/blk-mq.h|  51 --
 include/linux/blkdev.h|   1 -
 22 files changed, 279 insertions(+), 306 deletions(-)

-- 
1.8.3.1