Re: scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Arun Easi
On Mon, 3 Apr 2017, 9:47am, Jens Axboe wrote:

> On 04/03/2017 10:41 AM, Arun Easi wrote:
> > On Mon, 3 Apr 2017, 8:20am, Bart Van Assche wrote:
> > 
> >> On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote:
> >>> On 04/03/2017 08:37 AM, Arun Easi wrote:
>  If the above is true, then for a LLD to get tag# within it's max-tasks 
>  range, it has to report max-tasks / number-of-hw-queues in can_queue, 
>  and 
>  in the I/O path, use the tag and hwq# to arrive at a index# to use. 
>  This, 
>  though, leads to a poor use of tag resources -- queue reaching it's 
>  capacity while LLD can still take it.
> >>>
> >>> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
> >>> HBAs. ATM the only 'real' solution to this problem is indeed have a
> >>> static split of the entire tag space by the number of hardware queues.
> >>> With the mentioned tag-starvation problem.
> >>
> >> Hello Arun and Hannes,
> >>
> >> Apparently the current blk_mq_alloc_tag_set() implementation is well suited
> >> for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers.
> >> How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to
> >> allocate a single set of tags for all hardware queues and also to add a 
> >> flag
> >> to struct scsi_host_template such that SCSI LLDs can enable this behavior?
> >>
> > 
> > Hi Bart,
> > 
> > This would certainly be beneficial in my case. Moreover, it certainly 
> > makes sense to move the logic up where multiple drivers can leverage. 
> > 
> > Perhaps, use percpu_ida* interfaces to do that, but I think I read 
> > somewhere that, it is not efficient (enough?) and is the reason to go the 
> > current way for block tags.
> 
> You don't have to change the underlying tag generation to solve this
> problem, Bart already pretty much outlined a fix that would work.
> percpu_ida works fine if you never use more than roughly half the
> available space, it's a poor fit for request tags where we want to
> retain good behavior and scaling at or near tag exhaustion. That's why
> blk-mq ended up rolling its own, which is now generically available as
> lib/sbitmap.c.
> 

Sounds good. Thanks for the education, Jens.

Regards,
-Arun


Re: scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Jens Axboe
On 04/03/2017 10:41 AM, Arun Easi wrote:
> On Mon, 3 Apr 2017, 8:20am, Bart Van Assche wrote:
> 
>> On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote:
>>> On 04/03/2017 08:37 AM, Arun Easi wrote:
 If the above is true, then for a LLD to get tag# within it's max-tasks 
 range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
 in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
 though, leads to a poor use of tag resources -- queue reaching it's 
 capacity while LLD can still take it.
>>>
>>> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
>>> HBAs. ATM the only 'real' solution to this problem is indeed have a
>>> static split of the entire tag space by the number of hardware queues.
>>> With the mentioned tag-starvation problem.
>>
>> Hello Arun and Hannes,
>>
>> Apparently the current blk_mq_alloc_tag_set() implementation is well suited
>> for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers.
>> How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to
>> allocate a single set of tags for all hardware queues and also to add a flag
>> to struct scsi_host_template such that SCSI LLDs can enable this behavior?
>>
> 
> Hi Bart,
> 
> This would certainly be beneficial in my case. Moreover, it certainly 
> makes sense to move the logic up where multiple drivers can leverage. 
> 
> Perhaps, use percpu_ida* interfaces to do that, but I think I read 
> somewhere that, it is not efficient (enough?) and is the reason to go the 
> current way for block tags.

You don't have to change the underlying tag generation to solve this
problem, Bart already pretty much outlined a fix that would work.
percpu_ida works fine if you never use more than roughly half the
available space, it's a poor fit for request tags where we want to
retain good behavior and scaling at or near tag exhaustion. That's why
blk-mq ended up rolling its own, which is now generically available as
lib/sbitmap.c.

-- 
Jens Axboe



Re: scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Arun Easi
On Mon, 3 Apr 2017, 8:20am, Bart Van Assche wrote:

> On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote:
> > On 04/03/2017 08:37 AM, Arun Easi wrote:
> > > If the above is true, then for a LLD to get tag# within it's max-tasks 
> > > range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
> > > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
> > > though, leads to a poor use of tag resources -- queue reaching it's 
> > > capacity while LLD can still take it.
> >
> > Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
> > HBAs. ATM the only 'real' solution to this problem is indeed have a
> > static split of the entire tag space by the number of hardware queues.
> > With the mentioned tag-starvation problem.
> 
> Hello Arun and Hannes,
> 
> Apparently the current blk_mq_alloc_tag_set() implementation is well suited
> for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers.
> How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to
> allocate a single set of tags for all hardware queues and also to add a flag
> to struct scsi_host_template such that SCSI LLDs can enable this behavior?
> 

Hi Bart,

This would certainly be beneficial in my case. Moreover, it certainly 
makes sense to move the logic up where multiple drivers can leverage. 

Perhaps, use percpu_ida* interfaces to do that, but I think I read 
somewhere that, it is not efficient (enough?) and is the reason to go the 
current way for block tags.

Regards,
-Arun

Re: scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Arun Easi
On Mon, 3 Apr 2017, 12:29am, Hannes Reinecke wrote:

> On 04/03/2017 08:37 AM, Arun Easi wrote:
> > Hi Folks,
> > 
> > I would like to seek your input on a few topics on SCSI / block 
> > multi-queue.
> > 
> > 1. Tag# generation.
> > 
> > The context is with SCSI MQ on. My question is, what should a LLD do to 
> > get request tag values in the range 0 through can_queue - 1 across *all* 
> > of the queues. In our QLogic 41XXX series of adapters, we have a per 
> > session submit queue, a shared task memory (shared across all queues) and 
> > N completion queues (separate MSI-X vectors). We report N as the 
> > nr_hw_queues. I would like to, if possible, use the block layer tags to 
> > index into the above shared task memory area.
> > 
> > From looking at the scsi/block source, it appears that when a LLD reports 
> > a value say #C, in can_queue (via scsi_host_template), that value is used 
> > as the max depth when corresponding block layer queues are created. So, 
> > while SCSI restricts the number of commands to LLD at #C, the request tag 
> > generated across any of the queues can range from 0..#C-1. Please correct 
> > me if I got this wrong.
> > 
> > If the above is true, then for a LLD to get tag# within it's max-tasks 
> > range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
> > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
> > though, leads to a poor use of tag resources -- queue reaching it's 
> > capacity while LLD can still take it.
> > 
> Yep.
> 
> > blk_mq_unique_tag() would not work here, because it just puts the hwq# in 
> > the upper 16 bits, which need not fall in the max-tasks range.
> > 
> > Perhaps the current MQ model is to cater to a queue pair 
> > (submit/completion) kind of hardware model; nevertheless I would like to 
> > know how other hardware variants can makes use of it.
> > 
> He. Welcome to the club.
> 
> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
> HBAs. ATM the only 'real' solution to this problem is indeed have a
> static split of the entire tag space by the number of hardware queues.
> With the mentioned tag-starvation problem.
> 
> If we were to continue with the tag to hardware ID mapping, we would
> need to implement a dynamic tag space mapping onto hardware queues.
> My idea to that would be to map the entire tag space, but rather the
> individual bit words onto the hardware queue. Then we could make the
> mapping dynamic, where there individual words are mapped onto the queues
> only as needed.
> However, the _one_ big problem we're facing here is timeouts.
> With the 1:1 mapping between tags and hardware IDs we can only re-use
> the tag once the timeout is _definitely_ resolved. But this means
> the command will be active, and we cannot return blk_mq_complete() until
> the timeout itself has been resolved.
> With FC this essentially means until the corresponding XIDs are safe to
> re-use, ie after all ABRT/RRQ etc processing has been completed.
> Hence we totally lose the ability to return the command itself with
> -ETIMEDOUT and continue with I/O processing even though the original XID
> is still being held by firmware.
> 
> In the light of this I wonder if it wouldn't be better to completely
> decouple block-layer tags and hardware IDs, and have an efficient
> algorithm mapping the block-layer tags onto hardware IDs.
> That should avoid the arbitrary tag starvation problem, and would allow
> us to handle timeouts efficiently.
> Of course, we don't _have_ such an efficient algorithm; maybe it's time
> to have a generic one within the kernel as quite some drivers would
> _love_ to just use the generic implementation here.
> (qla2xxx, lpfc, fcoe, mpt3sas etc all suffer from the same problem)
> 
> > 2. mq vs non-mq performance gain.
> > 
> > This is more like a poll, I guess. I was wondering what performance gains 
> > folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that 
> > has one slide that shows a 200k IOPS gain.
> > 
> > From my testing, though, I was not lucky to observe that big of a change. 
> > In fact, the difference was not even noticeable(*). For e.g., for 512 
> > bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. 
> > When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 
> > and another with 1 (LLD is reloaded when it is done). I only used one NUMA 
> > node for this run. The test was run on a x86_64 setup.
> > 
> You _really_ should have listened to my talk at VAULT.

Would you have a slide deck / minutes that could be shared?

> For 'legacy' HBAs there indeed is not much of a performance gain to be
> had; the max gain is indeed for heavy parallel I/O.

I have multiple devices (I-T nexuses) in my setup, so definitely there are 
parallel I/Os.

> And there even is a scheduler issue when running with a single
> submission thread; there I've measured a performance _drop_ by up to
> 50%. Which, as Jens claims, 

Re: scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Bart Van Assche
On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote:
> On 04/03/2017 08:37 AM, Arun Easi wrote:
> > If the above is true, then for a LLD to get tag# within it's max-tasks 
> > range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
> > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
> > though, leads to a poor use of tag resources -- queue reaching it's 
> > capacity while LLD can still take it.
>
> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
> HBAs. ATM the only 'real' solution to this problem is indeed have a
> static split of the entire tag space by the number of hardware queues.
> With the mentioned tag-starvation problem.

Hello Arun and Hannes,

Apparently the current blk_mq_alloc_tag_set() implementation is well suited
for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers.
How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to
allocate a single set of tags for all hardware queues and also to add a flag
to struct scsi_host_template such that SCSI LLDs can enable this behavior?

Bart.

Re: scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Hannes Reinecke
On 04/03/2017 08:37 AM, Arun Easi wrote:
> Hi Folks,
> 
> I would like to seek your input on a few topics on SCSI / block 
> multi-queue.
> 
> 1. Tag# generation.
> 
> The context is with SCSI MQ on. My question is, what should a LLD do to 
> get request tag values in the range 0 through can_queue - 1 across *all* 
> of the queues. In our QLogic 41XXX series of adapters, we have a per 
> session submit queue, a shared task memory (shared across all queues) and 
> N completion queues (separate MSI-X vectors). We report N as the 
> nr_hw_queues. I would like to, if possible, use the block layer tags to 
> index into the above shared task memory area.
> 
> From looking at the scsi/block source, it appears that when a LLD reports 
> a value say #C, in can_queue (via scsi_host_template), that value is used 
> as the max depth when corresponding block layer queues are created. So, 
> while SCSI restricts the number of commands to LLD at #C, the request tag 
> generated across any of the queues can range from 0..#C-1. Please correct 
> me if I got this wrong.
> 
> If the above is true, then for a LLD to get tag# within it's max-tasks 
> range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
> in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
> though, leads to a poor use of tag resources -- queue reaching it's 
> capacity while LLD can still take it.
> 
Yep.

> blk_mq_unique_tag() would not work here, because it just puts the hwq# in 
> the upper 16 bits, which need not fall in the max-tasks range.
> 
> Perhaps the current MQ model is to cater to a queue pair 
> (submit/completion) kind of hardware model; nevertheless I would like to 
> know how other hardware variants can makes use of it.
> 
He. Welcome to the club.

Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
HBAs. ATM the only 'real' solution to this problem is indeed have a
static split of the entire tag space by the number of hardware queues.
With the mentioned tag-starvation problem.

If we were to continue with the tag to hardware ID mapping, we would
need to implement a dynamic tag space mapping onto hardware queues.
My idea to that would be to map the entire tag space, but rather the
individual bit words onto the hardware queue. Then we could make the
mapping dynamic, where there individual words are mapped onto the queues
only as needed.
However, the _one_ big problem we're facing here is timeouts.
With the 1:1 mapping between tags and hardware IDs we can only re-use
the tag once the timeout is _definitely_ resolved. But this means
the command will be active, and we cannot return blk_mq_complete() until
the timeout itself has been resolved.
With FC this essentially means until the corresponding XIDs are safe to
re-use, ie after all ABRT/RRQ etc processing has been completed.
Hence we totally lose the ability to return the command itself with
-ETIMEDOUT and continue with I/O processing even though the original XID
is still being held by firmware.

In the light of this I wonder if it wouldn't be better to completely
decouple block-layer tags and hardware IDs, and have an efficient
algorithm mapping the block-layer tags onto hardware IDs.
That should avoid the arbitrary tag starvation problem, and would allow
us to handle timeouts efficiently.
Of course, we don't _have_ such an efficient algorithm; maybe it's time
to have a generic one within the kernel as quite some drivers would
_love_ to just use the generic implementation here.
(qla2xxx, lpfc, fcoe, mpt3sas etc all suffer from the same problem)

> 2. mq vs non-mq performance gain.
> 
> This is more like a poll, I guess. I was wondering what performance gains 
> folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that 
> has one slide that shows a 200k IOPS gain.
> 
> From my testing, though, I was not lucky to observe that big of a change. 
> In fact, the difference was not even noticeable(*). For e.g., for 512 
> bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. 
> When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 
> and another with 1 (LLD is reloaded when it is done). I only used one NUMA 
> node for this run. The test was run on a x86_64 setup.
> 
You _really_ should have listened to my talk at VAULT.
For 'legacy' HBAs there indeed is not much of a performance gain to be
had; the max gain is indeed for heavy parallel I/O.
And there even is a scheduler issue when running with a single
submission thread; there I've measured a performance _drop_ by up to
50%. Which, as Jens claims, really looks like a block-layer issue rather
than a generic problem.


> * See item 3 for a special handling.
> 
> 3. add_random slowness.
> 
> One thing I observed with MQ on and off was this block layer tunable, 
> add_random, which as I understand is to tune disk entropy contribution. 
> With non-MQ, it is turned on, and with MQ, it is turned off by default.
> 
> This got noticed