Re: [PATCH V4 0/4] SCSI: fix selection of reply(hw) queue

2018-03-08 Thread Ming Lei
On Fri, Mar 09, 2018 at 08:00:52AM +0100, Hannes Reinecke wrote:
> On 03/09/2018 04:32 AM, Ming Lei wrote:
> > Hi All,
> > 
> > The patches fixes reply queue(virt-queue on virtio-scsi) selection on hpsa,
> > megaraid_sa and virtio-scsi, and IO hang can be caused easily by this issue.
> > 
> > This issue is triggered by 84676c1f21e8 ("genirq/affinity: assign vectors
> > to all possible CPUs"). After 84676c1f21e8, it is easy to see one msix
> > vector mapped to all offline CPUs. If the reply queue is seleteced from
> > all allocated msix vectors(reply queues) in round-roin way, the selected
> > replay queue may not have any online CPU mapped, IO hang is caused.
> > 
> > Both hpsa and megaraid_sas uses host-wide tagset, we can't convert the
> > reply queue to blk_mq hw queue directly, otherwise IO performance is 
> > degraded
> > much, according to Kashyap's test, so this patchset sets up one mapping 
> > talbe
> > for selecting reply queue, and this approach has been used by mpt3sas 
> > already.
> > 
> > For virtio-scsi, the virt-queue is really hw queue wrt. blk-mq view, so
> > we introduce 'force_blk_mq' for fix this issue because: 1) virtio-blk
> > has been used for years in blk-mq mode; 2) we have discussed recently
> > that scsi_mq will be enabled at default soon. 
> > 
> > gitweb:
> > https://github.com/ming1/linux/tree/v4.16-rc-select-reply-queue-fix-V4
> > 
> > V4:
> > - splitted from previous patchset
> > - handle virtio-scsi by force_blk_mq
> > 
> > Ming Lei (4):
> >   scsi: hpsa: fix selection of reply queue
> >   scsi: megaraid_sas: fix selection of reply queue
> >   scsi: introduce force_blk_mq
> >   scsi: virtio_scsi: fix IO hang caused by irq vector automatic affinity
> > 
> >  drivers/scsi/hosts.c|  1 +
> >  drivers/scsi/hpsa.c | 73 
> > +
> >  drivers/scsi/hpsa.h |  1 +
> >  drivers/scsi/megaraid/megaraid_sas.h|  2 +-
> >  drivers/scsi/megaraid/megaraid_sas_base.c   | 34 +-
> >  drivers/scsi/megaraid/megaraid_sas_fusion.c | 12 ++---
> >  drivers/scsi/virtio_scsi.c  | 59 ++-
> >  include/scsi/scsi_host.h|  3 ++
> >  8 files changed, 100 insertions(+), 85 deletions(-)
> > 
> Well ... while this looks good in principle, what happens on cpu hotplug?
> Don't we have to redo the map then?

Each item in the table is used to for mapping one CPU id to the hw queue index,
and the size of the table is 'nr_cpu_id', so no need to redo the map on cpu 
hotplug,
just like the usage of set->mq_map in blk-mq.

Thank,
Ming


Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-08 Thread Ming Lei
On Fri, Mar 09, 2018 at 09:00:08AM +0200, Artem Bityutskiy wrote:
> On Fri, 2018-03-09 at 09:24 +0800, Ming Lei wrote:
> > Hi Thomas,
> > 
> > On Fri, Mar 09, 2018 at 12:20:09AM +0100, Thomas Gleixner wrote:
> > > On Thu, 8 Mar 2018, Ming Lei wrote:
> > > > Actually, it isn't a real fix, the real one is in the following
> > > > two:
> > > > 
> > > > 0c20244d458e scsi: megaraid_sas: fix selection of reply queue
> > > > ed6d043be8cd scsi: hpsa: fix selection of reply queue
> > > 
> > > Where are these commits? Neither Linus tree not -next know anything
> > > about
> > > them
> > 
> > Both aren't merged yet, but they should land V4.16, IMO.
> 
> Is it a secret where they are? If not, could you please give ma a
> pointer and I'll give them a test.

  https://marc.info/?l=linux-block=152056636717380=2

Thanks,
Ming


Re: [PATCH V4 0/4] SCSI: fix selection of reply(hw) queue

2018-03-08 Thread Hannes Reinecke
On 03/09/2018 04:32 AM, Ming Lei wrote:
> Hi All,
> 
> The patches fixes reply queue(virt-queue on virtio-scsi) selection on hpsa,
> megaraid_sa and virtio-scsi, and IO hang can be caused easily by this issue.
> 
> This issue is triggered by 84676c1f21e8 ("genirq/affinity: assign vectors
> to all possible CPUs"). After 84676c1f21e8, it is easy to see one msix
> vector mapped to all offline CPUs. If the reply queue is seleteced from
> all allocated msix vectors(reply queues) in round-roin way, the selected
> replay queue may not have any online CPU mapped, IO hang is caused.
> 
> Both hpsa and megaraid_sas uses host-wide tagset, we can't convert the
> reply queue to blk_mq hw queue directly, otherwise IO performance is degraded
> much, according to Kashyap's test, so this patchset sets up one mapping talbe
> for selecting reply queue, and this approach has been used by mpt3sas already.
> 
> For virtio-scsi, the virt-queue is really hw queue wrt. blk-mq view, so
> we introduce 'force_blk_mq' for fix this issue because: 1) virtio-blk
> has been used for years in blk-mq mode; 2) we have discussed recently
> that scsi_mq will be enabled at default soon. 
> 
> gitweb:
>   https://github.com/ming1/linux/tree/v4.16-rc-select-reply-queue-fix-V4
> 
> V4:
>   - splitted from previous patchset
>   - handle virtio-scsi by force_blk_mq
> 
> Ming Lei (4):
>   scsi: hpsa: fix selection of reply queue
>   scsi: megaraid_sas: fix selection of reply queue
>   scsi: introduce force_blk_mq
>   scsi: virtio_scsi: fix IO hang caused by irq vector automatic affinity
> 
>  drivers/scsi/hosts.c|  1 +
>  drivers/scsi/hpsa.c | 73 
> +
>  drivers/scsi/hpsa.h |  1 +
>  drivers/scsi/megaraid/megaraid_sas.h|  2 +-
>  drivers/scsi/megaraid/megaraid_sas_base.c   | 34 +-
>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 12 ++---
>  drivers/scsi/virtio_scsi.c  | 59 ++-
>  include/scsi/scsi_host.h|  3 ++
>  8 files changed, 100 insertions(+), 85 deletions(-)
> 
Well ... while this looks good in principle, what happens on cpu hotplug?
Don't we have to redo the map then?

Cheers,

Hannes

-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-08 Thread Artem Bityutskiy
On Fri, 2018-03-09 at 09:24 +0800, Ming Lei wrote:
> Hi Thomas,
> 
> On Fri, Mar 09, 2018 at 12:20:09AM +0100, Thomas Gleixner wrote:
> > On Thu, 8 Mar 2018, Ming Lei wrote:
> > > Actually, it isn't a real fix, the real one is in the following
> > > two:
> > > 
> > >   0c20244d458e scsi: megaraid_sas: fix selection of reply queue
> > >   ed6d043be8cd scsi: hpsa: fix selection of reply queue
> > 
> > Where are these commits? Neither Linus tree not -next know anything
> > about
> > them
> 
> Both aren't merged yet, but they should land V4.16, IMO.

Is it a secret where they are? If not, could you please give ma a
pointer and I'll give them a test.


RE: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via .host_tagset

2018-03-08 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Thursday, March 8, 2018 4:54 PM
> To: Kashyap Desai
> Cc: Jens Axboe; linux-block@vger.kernel.org; Christoph Hellwig; Mike
Snitzer;
> linux-s...@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Laurence Oberman
> Subject: Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance
via
> .host_tagset
>
> On Thu, Mar 08, 2018 at 07:06:25PM +0800, Ming Lei wrote:
> > On Thu, Mar 08, 2018 at 03:34:31PM +0530, Kashyap Desai wrote:
> > > > -Original Message-
> > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > Sent: Thursday, March 8, 2018 6:46 AM
> > > > To: Kashyap Desai
> > > > Cc: Jens Axboe; linux-block@vger.kernel.org; Christoph Hellwig;
> > > > Mike
> > > Snitzer;
> > > > linux-s...@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar
> > > > Sandoval; Martin K . Petersen; James Bottomley; Christoph Hellwig;
> > > > Don Brace;
> > > Peter
> > > > Rivera; Laurence Oberman
> > > > Subject: Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq
> > > > performance
> > > via
> > > > .host_tagset
> > > >
> > > > On Wed, Mar 07, 2018 at 10:58:34PM +0530, Kashyap Desai wrote:
> > > > > > >
> > > > > > > Also one observation using V3 series patch. I am seeing
> > > > > > > below Affinity mapping whereas I have only 72 logical CPUs.
> > > > > > > It means we are really not going to use all reply queues.
> > > > > > > e.a If I bind fio jobs on CPU 18-20, I am seeing only one
> > > > > > > reply queue is used and that may lead to performance drop as
well.
> > > > > >
> > > > > > If the mapping is in such shape, I guess it should be quite
> > > > > > difficult to
> > > > > figure out
> > > > > > one perfect way to solve this situation because one reply
> > > > > > queue has to
> > > > > handle
> > > > > > IOs submitted from 4~5 CPUs at average.
> > > > >
> > > > > 4.15.0-rc1 kernel has below mapping - I am not sure which commit
> > > > > id in
> > > "
> > > > > linux_4.16-rc-host-tags-v3.2" is changing the mapping of IRQ to
CPU.
> > > > > It
> > > >
> > > > I guess the mapping you posted is read from
/proc/irq/126/smp_affinity.
> > > >
> > > > If yes, no any patch in linux_4.16-rc-host-tags-v3.2 should change
> > > > IRQ
> > > affinity
> > > > code, which is done in irq_create_affinity_masks(), as you saw, no
> > > > any
> > > patch
> > > > in linux_4.16-rc-host-tags-v3.2 touches that code.
> > > >
> > > > Could you simply apply the patches in linux_4.16-rc-host-tags-v3.2
> > > against
> > > > 4.15-rc1 kernel and see any difference?
> > > >
> > > > > will be really good if we can fall back to below mapping once
again.
> > > > > Current repo linux_4.16-rc-host-tags-v3.2 is giving lots of
> > > > > random mapping of CPU - MSIx. And that will be problematic in
> > > > > performance
> > > run.
> > > > >
> > > > > As I posted earlier, latest repo will only allow us to use *18*
> > > > > reply
> > > >
> > > > Looks not see this report before, could you share us how you
> > > > conclude
> > > that?
> > > > The only patch changing reply queue is the following one:
> > > >
> > > > https://marc.info/?l=linux-block=151972611911593=2
> > > >
> > > > But not see any issue in this patch yet, can you recover to 72
> > > > reply
> > > queues
> > > > after reverting the patch in above link?
> > > Ming -
> > >
> > > While testing, my system went bad. I debug further and understood
> > > that affinity mapping was changed due to below commit -
> > > 84676c1f21e8ff54befe985f4f14dc1edc10046b
> > >
> > > [PATCH] genirq/affinity: assign vectors to all possible CPUs
> > >
> > > Because of above change, we end up using very less reply queue. Many
> > > reply queues on my setup was mapped to offline/not-available CPUs.
> > > This may be primary contributing to odd performance impact and it
> > > may not be truly due to V3/V4 patch series.
> >
> > Seems a good news, :-)
> >
> > >
> > > I am planning to check your V3 and V4 series after removing above
> > > commit ID (for performance impact.).
> >
> > You can run your test on a server in which all CPUs are kept as online
> > for avoiding this issue.
> >
> > Or you can apply the following patchset for avoiding this issue:
> >
> > https://marc.info/?l=linux-block=152050646332092=2
>
> If you want to do this way, all patches have been put into the following
> tree(V4):
>
>   https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v4

Tested above V4 commits. Now, IRQ - CPU mapping has at least one online
CPU as explained in patch " genirq/affinity: irq vector spread among
online CPUs as far as possible".
New IRQ-CPU mapping looks better.

Below is irq/cpu mapping. ( I have 0-71 online logical CPUs)
Kernel version:
Linux rhel7.3 4.16.0-rc4+ #1 SMP Thu Mar 8 10:51:56 EST 2018 x86_64 x86_64
x86_64 GNU/Linux
PCI name is 86:00.0, dump its irq affinity:
irq 218, cpu list 0,72,74,76,78
irq 219, cpu 

Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests

2018-03-08 Thread Bart Van Assche
On Thu, 2018-03-01 at 15:58 +, Stephen  Bates wrote:
> > Any plans adding the capability to nvme-rdma? Should be
> > straight-forward... In theory, the use-case would be rdma backend
> > fabric behind. Shouldn't be hard to test either...
> 
> Nice idea Sagi. Yes we have been starting to look at that. Though again we
> would probably want to impose the "attached to the same PCIe switch" rule
> which might be less common to satisfy in initiator systems. 
> 
> Down the road I would also like to discuss the best way to use this P2P
> framework to facilitate copies between NVMe namespaces (on both PCIe and
> fabric attached namespaces) without having to expose the CMB up to user
> space. Wasn't something like that done in the SCSI world at some point
> Martin?

Are you perhaps referring to the following patch series: "Copy Offload"
(https://www.spinics.net/lists/linux-scsi/msg74680.html /
https://lwn.net/Articles/592094/)? I will contact Martin off-list and in
case he wouldn't have the time myself to revive that patch series then I
will free up some time to work on this.

Bart.






[PATCH V4 3/4] scsi: introduce force_blk_mq

2018-03-08 Thread Ming Lei
>From scsi driver view, it is a bit troublesome to support both blk-mq
and non-blk-mq at the same time, especially when drivers need to support
multi hw-queue.

This patch introduces 'force_blk_mq' to scsi_host_template so that drivers
can provide blk-mq only support, so driver code can avoid the trouble
for supporting both.

Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Mike Snitzer 
Cc: Laurence Oberman 
Reviewed-by: Hannes Reinecke 
Signed-off-by: Ming Lei 
---
 drivers/scsi/hosts.c | 1 +
 include/scsi/scsi_host.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 57bf43e34863..10f04b089392 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -477,6 +477,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template 
*sht, int privsize)
shost->dma_boundary = 0x;
 
shost->use_blk_mq = scsi_use_blk_mq;
+   shost->use_blk_mq = scsi_use_blk_mq || !!shost->hostt->force_blk_mq;
 
device_initialize(>shost_gendev);
dev_set_name(>shost_gendev, "host%d", shost->host_no);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 1a1df0d21ee3..6c6366f0bd15 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -452,6 +452,9 @@ struct scsi_host_template {
/* True if the controller does not support WRITE SAME */
unsigned no_write_same:1;
 
+   /* True if the low-level driver supports blk-mq only */
+   unsigned force_blk_mq:1;
+
/*
 * Countdown for host blocking with no commands outstanding.
 */
-- 
2.9.5



[PATCH V4 4/4] scsi: virtio_scsi: fix IO hang caused by irq vector automatic affinity

2018-03-08 Thread Ming Lei
Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

For example, on a 8cores VM, 4~7 are not-present/offline, 4 queues of
virtio-scsi, the irq affinity assigned can become the following shape:

irq 36, cpu list 0-7
irq 37, cpu list 0-7
irq 38, cpu list 0-7
irq 39, cpu list 0-1
irq 40, cpu list 4,6
irq 41, cpu list 2-3
irq 42, cpu list 5,7

Then IO hang is triggered in case of non-SCSI_MQ.

Given storage IO is always C/S model, there isn't such issue with 
SCSI_MQ(blk-mq),
because no IO can be submitted to one hw queue if the hw queue isn't
mapped to online CPUs.

Fix this issue by forcing to use blk_mq.

BTW, I have been used virtio-scsi(scsi_mq) for several years, and it has
been quite stable, so it shouldn't cause extra risk.

Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Paolo Bonzini 
Cc: Mike Snitzer 
Cc: Laurence Oberman 
Reviewed-by: Hannes Reinecke 
Acked-by: Paolo Bonzini 
Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
Signed-off-by: Ming Lei 
---
 drivers/scsi/virtio_scsi.c | 59 +++---
 1 file changed, 3 insertions(+), 56 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 7c28e8d4955a..54e3a0f6844c 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -91,9 +91,6 @@ struct virtio_scsi_vq {
 struct virtio_scsi_target_state {
seqcount_t tgt_seq;
 
-   /* Count of outstanding requests. */
-   atomic_t reqs;
-
/* Currently active virtqueue for requests sent to this target. */
struct virtio_scsi_vq *req_vq;
 };
@@ -152,8 +149,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
*vscsi, void *buf)
struct virtio_scsi_cmd *cmd = buf;
struct scsi_cmnd *sc = cmd->sc;
struct virtio_scsi_cmd_resp *resp = >resp.cmd;
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
 
dev_dbg(>device->sdev_gendev,
"cmd %p response %u status %#02x sense_len %u\n",
@@ -210,8 +205,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
*vscsi, void *buf)
}
 
sc->scsi_done(sc);
-
-   atomic_dec(>reqs);
 }
 
 static void virtscsi_vq_done(struct virtio_scsi *vscsi,
@@ -580,10 +573,7 @@ static int virtscsi_queuecommand_single(struct Scsi_Host 
*sh,
struct scsi_cmnd *sc)
 {
struct virtio_scsi *vscsi = shost_priv(sh);
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
 
-   atomic_inc(>reqs);
return virtscsi_queuecommand(vscsi, >req_vqs[0], sc);
 }
 
@@ -596,55 +586,11 @@ static struct virtio_scsi_vq *virtscsi_pick_vq_mq(struct 
virtio_scsi *vscsi,
return >req_vqs[hwq];
 }
 
-static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,
-  struct virtio_scsi_target_state 
*tgt)
-{
-   struct virtio_scsi_vq *vq;
-   unsigned long flags;
-   u32 queue_num;
-
-   local_irq_save(flags);
-   if (atomic_inc_return(>reqs) > 1) {
-   unsigned long seq;
-
-   do {
-   seq = read_seqcount_begin(>tgt_seq);
-   vq = tgt->req_vq;
-   } while (read_seqcount_retry(>tgt_seq, seq));
-   } else {
-   /* no writes can be concurrent because of atomic_t */
-   write_seqcount_begin(>tgt_seq);
-
-   /* keep previous req_vq if a reader just arrived */
-   if (unlikely(atomic_read(>reqs) > 1)) {
-   vq = tgt->req_vq;
-   goto unlock;
-   }
-
-   queue_num = smp_processor_id();
-   while (unlikely(queue_num >= vscsi->num_queues))
-   queue_num -= vscsi->num_queues;
-   tgt->req_vq = vq = >req_vqs[queue_num];
- unlock:
-   write_seqcount_end(>tgt_seq);
-   }
-   local_irq_restore(flags);
-
-   return vq;
-}
-
 static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
   struct scsi_cmnd *sc)
 {
struct virtio_scsi *vscsi = shost_priv(sh);
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
-   struct virtio_scsi_vq 

[PATCH V4 1/4] scsi: hpsa: fix selection of reply queue

2018-03-08 Thread Ming Lei
>From 84676c1f21 (genirq/affinity: assign vectors to all possible CPUs),
one msix vector can be created without any online CPU mapped, then one
command's completion may not be notified.

This patch setups mapping between cpu and reply queue according to irq
affinity info retrived by pci_irq_get_affinity(), and uses this mapping
table to choose reply queue for queuing one command.

Then the chosen reply queue has to be active, and fixes IO hang caused
by using inactive reply queue which doesn't have any online CPU mapped.

Cc: Hannes Reinecke 
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Laurence Oberman 
Cc: Meelis Roos 
Cc: Artem Bityutskiy 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Tested-by: Don Brace 
Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
Signed-off-by: Ming Lei 
---
 drivers/scsi/hpsa.c | 73 +++--
 drivers/scsi/hpsa.h |  1 +
 2 files changed, 55 insertions(+), 19 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 5293e6827ce5..3a9eca163db8 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -1045,11 +1045,7 @@ static void set_performant_mode(struct ctlr_info *h, 
struct CommandList *c,
c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
if (unlikely(!h->msix_vectors))
return;
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   c->Header.ReplyQueue =
-   raw_smp_processor_id() % h->nreply_queues;
-   else
-   c->Header.ReplyQueue = reply_queue % h->nreply_queues;
+   c->Header.ReplyQueue = reply_queue;
}
 }
 
@@ -1063,10 +1059,7 @@ static void set_ioaccel1_performant_mode(struct 
ctlr_info *h,
 * Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->ReplyQueue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->ReplyQueue = reply_queue % h->nreply_queues;
+   cp->ReplyQueue = reply_queue;
/*
 * Set the bits in the address sent down to include:
 *  - performant mode bit (bit 0)
@@ -1087,10 +1080,7 @@ static void set_ioaccel2_tmf_performant_mode(struct 
ctlr_info *h,
/* Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->reply_queue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->reply_queue = reply_queue % h->nreply_queues;
+   cp->reply_queue = reply_queue;
/* Set the bits in the address sent down to include:
 *  - performant mode bit not used in ioaccel mode 2
 *  - pull count (bits 0-3)
@@ -1109,10 +1099,7 @@ static void set_ioaccel2_performant_mode(struct 
ctlr_info *h,
 * Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->reply_queue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->reply_queue = reply_queue % h->nreply_queues;
+   cp->reply_queue = reply_queue;
/*
 * Set the bits in the address sent down to include:
 *  - performant mode bit not used in ioaccel mode 2
@@ -1157,6 +1144,8 @@ static void __enqueue_cmd_and_start_io(struct ctlr_info 
*h,
 {
dial_down_lockup_detection_during_fw_flash(h, c);
atomic_inc(>commands_outstanding);
+
+   reply_queue = h->reply_map[raw_smp_processor_id()];
switch (c->cmd_type) {
case CMD_IOACCEL1:
set_ioaccel1_performant_mode(h, c, reply_queue);
@@ -7376,6 +7365,26 @@ static void hpsa_disable_interrupt_mode(struct ctlr_info 
*h)
h->msix_vectors = 0;
 }
 
+static void hpsa_setup_reply_map(struct ctlr_info *h)
+{
+   const struct cpumask *mask;
+   unsigned int queue, cpu;
+
+   for (queue = 0; queue < h->msix_vectors; queue++) {
+   mask = pci_irq_get_affinity(h->pdev, queue);
+   if (!mask)
+   goto fallback;
+
+   for_each_cpu(cpu, mask)
+   h->reply_map[cpu] = queue;
+   }
+   return;
+
+fallback:
+   for_each_possible_cpu(cpu)
+   h->reply_map[cpu] = 0;
+}
+
 /* If 

[PATCH V4 0/4] SCSI: fix selection of reply(hw) queue

2018-03-08 Thread Ming Lei
Hi All,

The patches fixes reply queue(virt-queue on virtio-scsi) selection on hpsa,
megaraid_sa and virtio-scsi, and IO hang can be caused easily by this issue.

This issue is triggered by 84676c1f21e8 ("genirq/affinity: assign vectors
to all possible CPUs"). After 84676c1f21e8, it is easy to see one msix
vector mapped to all offline CPUs. If the reply queue is seleteced from
all allocated msix vectors(reply queues) in round-roin way, the selected
replay queue may not have any online CPU mapped, IO hang is caused.

Both hpsa and megaraid_sas uses host-wide tagset, we can't convert the
reply queue to blk_mq hw queue directly, otherwise IO performance is degraded
much, according to Kashyap's test, so this patchset sets up one mapping talbe
for selecting reply queue, and this approach has been used by mpt3sas already.

For virtio-scsi, the virt-queue is really hw queue wrt. blk-mq view, so
we introduce 'force_blk_mq' for fix this issue because: 1) virtio-blk
has been used for years in blk-mq mode; 2) we have discussed recently
that scsi_mq will be enabled at default soon. 

gitweb:
https://github.com/ming1/linux/tree/v4.16-rc-select-reply-queue-fix-V4

V4:
- splitted from previous patchset
- handle virtio-scsi by force_blk_mq

Ming Lei (4):
  scsi: hpsa: fix selection of reply queue
  scsi: megaraid_sas: fix selection of reply queue
  scsi: introduce force_blk_mq
  scsi: virtio_scsi: fix IO hang caused by irq vector automatic affinity

 drivers/scsi/hosts.c|  1 +
 drivers/scsi/hpsa.c | 73 +
 drivers/scsi/hpsa.h |  1 +
 drivers/scsi/megaraid/megaraid_sas.h|  2 +-
 drivers/scsi/megaraid/megaraid_sas_base.c   | 34 +-
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 12 ++---
 drivers/scsi/virtio_scsi.c  | 59 ++-
 include/scsi/scsi_host.h|  3 ++
 8 files changed, 100 insertions(+), 85 deletions(-)

-- 
2.9.5



Re: [PATCH v4] blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()

2018-03-08 Thread Joseph Qi
Hello Tejun,
Could you please help review this version?

Thanks,
Joseph

On 18/3/6 11:53, Joseph Qi wrote:
> We've triggered a WARNING in blk_throtl_bio() when throttling writeback
> io, which complains blkg->refcnt is already 0 when calling blkg_get(),
> and then kernel crashes with invalid page request.
> After investigating this issue, we've found it is caused by a race
> between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
> below:
> 
> writeback kworker   cgroup_rmdir
>   cgroup_destroy_locked
> kill_css
>   css_killed_ref_fn
> css_killed_work_fn
>   offline_css
> blkcg_css_offline
>   blkcg_bio_issue_check
> rcu_read_lock
> blkg_lookup
>   spin_trylock(q->queue_lock)
>   blkg_destroy
>   spin_unlock(q->queue_lock)
> blk_throtl_bio
> spin_lock_irq(q->queue_lock)
> ...
> spin_unlock_irq(q->queue_lock)
>   rcu_read_unlock
> 
> Since rcu can only prevent blkg from releasing when it is being used,
> the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
> blkg release.
> Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
> And then the corresponding blkg_put() will schedule blkg release again,
> which result in double free.
> This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
> creation in blkcg_bio_issue_check()"). Before this commit, it will
> lookup first and then try to lookup/create again with queue_lock. Since
> revive this logic is a bit drastic, so fix it by only offlining pd during
> blkcg_css_offline(), and move the rest destruction (especially
> blkg_put()) into blkcg_css_free(), which should be the right way as
> discussed.
> 
> Fixes: ae1188963611 ("blkcg: consolidate blkg creation in 
> blkcg_bio_issue_check()")
> Reported-by: Jiufei Xue 
> Cc: sta...@vger.kernel.org #4.3+
> Signed-off-by: Joseph Qi 
> ---
>  block/blk-cgroup.c | 57 
> +++---
>  include/linux/blk-cgroup.h |  2 ++
>  2 files changed, 46 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index c2033a2..450cefd 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -307,11 +307,27 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>   }
>  }
>  
> +static void blkg_pd_offline(struct blkcg_gq *blkg)
> +{
> + int i;
> +
> + lockdep_assert_held(blkg->q->queue_lock);
> + lockdep_assert_held(>blkcg->lock);
> +
> + for (i = 0; i < BLKCG_MAX_POLS; i++) {
> + struct blkcg_policy *pol = blkcg_policy[i];
> +
> + if (blkg->pd[i] && !blkg->pd_offline[i] && pol->pd_offline_fn) {
> + pol->pd_offline_fn(blkg->pd[i]);
> + blkg->pd_offline[i] = true;
> + }
> + }
> +}
> +
>  static void blkg_destroy(struct blkcg_gq *blkg)
>  {
>   struct blkcg *blkcg = blkg->blkcg;
>   struct blkcg_gq *parent = blkg->parent;
> - int i;
>  
>   lockdep_assert_held(blkg->q->queue_lock);
>   lockdep_assert_held(>lock);
> @@ -320,13 +336,6 @@ static void blkg_destroy(struct blkcg_gq *blkg)
>   WARN_ON_ONCE(list_empty(>q_node));
>   WARN_ON_ONCE(hlist_unhashed(>blkcg_node));
>  
> - for (i = 0; i < BLKCG_MAX_POLS; i++) {
> - struct blkcg_policy *pol = blkcg_policy[i];
> -
> - if (blkg->pd[i] && pol->pd_offline_fn)
> - pol->pd_offline_fn(blkg->pd[i]);
> - }
> -
>   if (parent) {
>   blkg_rwstat_add_aux(>stat_bytes, >stat_bytes);
>   blkg_rwstat_add_aux(>stat_ios, >stat_ios);
> @@ -369,6 +378,7 @@ static void blkg_destroy_all(struct request_queue *q)
>   struct blkcg *blkcg = blkg->blkcg;
>  
>   spin_lock(>lock);
> + blkg_pd_offline(blkg);
>   blkg_destroy(blkg);
>   spin_unlock(>lock);
>   }
> @@ -1004,16 +1014,15 @@ static int blkcg_print_stat(struct seq_file *sf, void 
> *v)
>  static void blkcg_css_offline(struct cgroup_subsys_state *css)
>  {
>   struct blkcg *blkcg = css_to_blkcg(css);
> + struct blkcg_gq *blkg;
>  
>   spin_lock_irq(>lock);
>  
> - while (!hlist_empty(>blkg_list)) {
> - struct blkcg_gq *blkg = hlist_entry(blkcg->blkg_list.first,
> - struct blkcg_gq, blkcg_node);
> + hlist_for_each_entry(blkg, >blkg_list, blkcg_node) {
>   struct request_queue *q = blkg->q;
>  
>   if (spin_trylock(q->queue_lock)) {
> -  

[PATCH] block: Suppress kernel-doc warnings triggered by blk-zoned.c

2018-03-08 Thread Bart Van Assche
Avoid that building with W=1 causes the kernel-doc tool to complain
about undocumented function arguments for the blk-zoned.c source file.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Damien Le Moal 
---
 block/blk-zoned.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index acb7252c7e81..08e84ef2bc05 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -296,7 +296,7 @@ int blkdev_reset_zones(struct block_device *bdev,
 }
 EXPORT_SYMBOL_GPL(blkdev_reset_zones);
 
-/**
+/*
  * BLKREPORTZONE ioctl processing.
  * Called from blkdev_ioctl.
  */
@@ -355,7 +355,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, 
fmode_t mode,
return ret;
 }
 
-/**
+/*
  * BLKRESETZONE ioctl processing.
  * Called from blkdev_ioctl.
  */
-- 
2.16.2



[PATCH] blk-mq-debugfs: Show more request state information

2018-03-08 Thread Bart Van Assche
Since commit 634f9e4631a8 ("blk-mq: remove REQ_ATOM_COMPLETE usages
from blk-mq") blk_rq_is_complete() only reports whether or not a
request has completed for legacy queues. Hence modify the
blk-mq-debugfs code such that it shows the blk-mq request state
again.

Fixes: 634f9e4631a8 ("blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq")
Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
---
 block/blk-mq-debugfs.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index bd21d5b9f65f..162e09c02ae7 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -350,6 +350,13 @@ static const char *const rqf_name[] = {
 };
 #undef RQF_NAME
 
+static const char *blk_mq_rq_state_name[4] = {
+   [MQ_RQ_IDLE]= "idle",
+   [MQ_RQ_IN_FLIGHT]   = "in_flight",
+   [MQ_RQ_COMPLETE]= "complete",
+   [3] = "(?)",
+};
+
 int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq)
 {
const struct blk_mq_ops *const mq_ops = rq->q->mq_ops;
@@ -366,7 +373,7 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, struct 
request *rq)
seq_puts(m, ", .rq_flags=");
blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
   ARRAY_SIZE(rqf_name));
-   seq_printf(m, ", complete=%d", blk_rq_is_complete(rq));
+   seq_printf(m, ", .state=%s", blk_mq_rq_state_name[blk_mq_rq_state(rq)]);
seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
   rq->internal_tag);
if (mq_ops->show_rq)
-- 
2.16.2



Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-08 Thread Thomas Gleixner
On Thu, 8 Mar 2018, Ming Lei wrote:
> Actually, it isn't a real fix, the real one is in the following two:
> 
>   0c20244d458e scsi: megaraid_sas: fix selection of reply queue
>   ed6d043be8cd scsi: hpsa: fix selection of reply queue

Where are these commits? Neither Linus tree not -next know anything about
them

> This patchset can't guarantee that all IRQ vectors are assigned by one
> online CPU, for example, in a quad-socket system, if only one processor
> is present, then some of vectors are still assigned by all offline CPUs,
> and it is a valid case, but still may cause io hang if drivers(hpsa,
> megaraid_sas) select reply queue in current way.

So my understanding is that these irq patches are enhancements and not bug
fixes. I'll queue them for 4.17 then.

Thanks,

tglx


Re: [PATCH v2 00/11] Make all concurrent queue flag manipulations safe

2018-03-08 Thread Jens Axboe
On Wed, Mar 07 2018, Bart Van Assche wrote:
> Hello Jens,
> 
> As you probably know there is considerable confusion in the block
> layer core and in block drivers about how to protect queue flag
> changes against concurrent modifications. Some code protects these
> changes with the queue lock, other code uses atomic operations and
> some code does not protect queue flag changes against concurrent
> changes at all. Hence this patch series that protects all queue flag
> changes consistently with the queue lock and that removes functions
> that are not safe in a concurrent context from the public block layer
> header files.
> 
> Please consider these patches for kernel v4.17.
> 
> Note: it may be a good idea to postpone patch 11 until after the
> kernel v4.17 merge window to avoid merge conflicts.

Applied for 4.17. Not too worried about a conflict with that patch, so
applied 11/11 as well.

-- 
Jens Axboe



Re: [PATCH] block: bio_check_eod() needs to consider partition

2018-03-08 Thread Jens Axboe
On Thu, Mar 08 2018, Christoph Hellwig wrote:
> bio_check_eod() should check partiton size not the whole disk if
> bio->bi_partno is non-zero.
> 
> Based on an earlier patch from Jiufei Xue.

This doesn't apply, what did you generate it against?

-- 
Jens Axboe



Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-08 Thread Laurence Oberman
On Thu, 2018-03-08 at 21:42 +0800, Ming Lei wrote:
> On Wed, Mar 07, 2018 at 09:11:37AM -0500, Laurence Oberman wrote:
> > On Tue, 2018-03-06 at 14:24 -0500, Martin K. Petersen wrote:
> > > Ming,
> > > 
> > > > Given both Don and Laurence have verified that patch 1 and
> > > > patch 2
> > > > does fix IO hang, could you consider to merge the two first?
> > > 
> > > Oh, and I would still need a formal Acked-by: from Don and
> > > Tested-by:
> > > from Laurence.
> > > 
> > > Also, for 4.16/scsi-fixes I would prefer verification to be done
> > > with
> > > just patch 1/8 and none of the subsequent changes in place. Just
> > > to
> > > make
> > > sure we're testing the right thing.
> > > 
> > > Thanks!
> > > 
> > 
> > Hello Martin
> > 
> > I tested just Patch 1/8 from the V3 series.
> > No issues running workload and no issues booting on the DL380G7.
> > Don can you ack this so we can at least get this one in.
> > 
> > Against: 4.16.0-rc4.v31of8+ on an x86_64
> > 
> > Tested-by: Laurence Oberman 
> 
> Hi Laurence,
> 
> Thanks for your test!
> 
> Could you test patch 2 too since you have megaraid_sas controller?
> 
> Looks it is better to split the fix patches from the current
> patchset,
> since these fixes should be for V4.16.
> 
> Thanks
> Ming

Ho Ming I see a V4 now so I am going to wait until you split these and
then I will test both HPSA and megaraid_sas once Kashyap agrees.

When I see a V4 show up with the split will pull and act on it.

Thanks
Laurence


Re: [PATCH v7 2/2] Return bytes transferred for partial direct I/O

2018-03-08 Thread Goldwyn Rodrigues


On 03/07/2018 06:53 PM, Darrick J. Wong wrote:
> On Thu, Feb 08, 2018 at 12:59:48PM -0600, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues 
>>
>> In case direct I/O encounters an error midway, it returns the error.
>> Instead it should be returning the number of bytes transferred so far.
>>
>> Test case for filesystems (with ENOSPC):
>> 1. Create an almost full filesystem
>> 2. Create a file, say /mnt/lastfile, until the filesystem is full.
>> 3. Direct write() with count > sizeof /mnt/lastfile.
>>
>> Result: write() returns -ENOSPC. However, file content has data written
>> in step 3.
>>
>> Added a sysctl entry: dio_short_writes which is on by default. This is
>> to support applications which expect either and error or the bytes submitted
>> as a return value for the write calls.
>>
>> This fixes fstest generic/472.
>>
>> Signed-off-by: Goldwyn Rodrigues 
>> ---
>>  Documentation/sysctl/fs.txt | 14 ++
>>  fs/block_dev.c  |  2 +-
>>  fs/direct-io.c  |  7 +--
>>  fs/iomap.c  | 23 ---
>>  include/linux/fs.h  |  1 +
>>  kernel/sysctl.c |  9 +
>>  6 files changed, 42 insertions(+), 14 deletions(-)
>>
>> Changes since v1:
>>  - incorporated iomap and block devices
>>
>> Changes since v2:
>>  - realized that file size was not increasing when performing a (partial)
>>direct I/O because end_io function was receiving the error instead of
>>size. Fixed.
>>
>> Changes since v3:
>>  - [hch] initialize transferred with dio->size and use transferred instead
>>of dio->size.
>>
>> Changes since v4:
>>  - Refreshed to v4.14
>>
>> Changes since v5:
>>  - Added /proc/sys/fs/dio_short_writes (default 1) to guard older 
>> applications
>>which expect write(fd, buf, count) returns either count or error.
>>
>> Changes since v6:
>>  - Corrected documentation
>>  - Re-ordered patch
>>
>> diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt
>> index 6c00c1e2743f..21582f675985 100644
>> --- a/Documentation/sysctl/fs.txt
>> +++ b/Documentation/sysctl/fs.txt
>> @@ -22,6 +22,7 @@ Currently, these files are in /proc/sys/fs:
>>  - aio-max-nr
>>  - aio-nr
>>  - dentry-state
>> +- dio_short_writes
>>  - dquot-max
>>  - dquot-nr
>>  - file-max
>> @@ -76,6 +77,19 @@ dcache isn't pruned yet.
>>  
>>  ==
>>  
>> +dio_short_writes:
>> +
>> +In case Direct I/O encounters a transient error, it returns
>> +the error code, even if it has performed part of the write.
>> +This flag, if on (default), will return the number of bytes written
>> +so far, as the write(2) semantics are. However, some older applications
>> +still consider a direct write as an error if all of the I/O
>> +submitted is not complete. I.e. write(file, count, buf) != count.
>> +This option can be disabled on systems in order to support
>> +existing applications which do not expect short writes.
>> +
>> +==
>> +
>>  dquot-max & dquot-nr:
>>  
>>  The file dquot-max shows the maximum number of cached disk
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 4a181fcb5175..49d94360bb51 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -409,7 +409,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
>> *iter, int nr_pages)
>>  
>>  if (!ret)
>>  ret = blk_status_to_errno(dio->bio.bi_status);
>> -if (likely(!ret))
>> +if (likely(dio->size))
>>  ret = dio->size;
>>  
>>  bio_put(>bio);
>> diff --git a/fs/direct-io.c b/fs/direct-io.c
>> index 3aafb3343a65..9bd15be64c25 100644
>> --- a/fs/direct-io.c
>> +++ b/fs/direct-io.c
>> @@ -151,6 +151,7 @@ struct dio {
>>  } cacheline_aligned_in_smp;
>>  
>>  static struct kmem_cache *dio_cache __read_mostly;
>> +unsigned int sysctl_dio_short_writes = 1;
>>  
>>  /*
>>   * How many pages are in the queue?
>> @@ -262,7 +263,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t 
>> ret, unsigned int flags)
>>  ret = dio->page_errors;
>>  if (ret == 0)
>>  ret = dio->io_error;
>> -if (ret == 0)
>> +if (!sysctl_dio_short_writes && (ret == 0))
>>  ret = transferred;
>>  
>>  if (dio->end_io) {
>> @@ -310,7 +311,9 @@ static ssize_t dio_complete(struct dio *dio, ssize_t 
>> ret, unsigned int flags)
>>  }
>>  
>>  kmem_cache_free(dio_cache, dio);
>> -return ret;
>> +if (!sysctl_dio_short_writes)
>> +return ret;
>> +return transferred ? transferred : ret;
>>  }
>>  
>>  static void dio_aio_complete_work(struct work_struct *work)
>> diff --git a/fs/iomap.c b/fs/iomap.c
>> index 47d29ccffaef..a8d6908dc0de 100644
>> --- a/fs/iomap.c
>> +++ b/fs/iomap.c
>> @@ -716,23 +716,24 @@ static ssize_t iomap_dio_complete(struct iomap_dio 
>> *dio)
>>  struct kiocb *iocb = dio->iocb;
>>  struct inode *inode = 

Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-08 Thread Bart Van Assche
On Thu, 2018-03-08 at 09:41 +0100, Hannes Reinecke wrote:
> IE the _entire_ request set is allocated as _one_ array, making it quite
> hard to handle from the lower-level CPU caches.
> Also the 'node' indicator doesn't really help us here, as the requests
> have to be access by all CPUs in the shared tag case.
> 
> Would it be possible move tags->rqs to become a _double_ pointer?
> Then we would have only a shared lookup table, but the requests
> themselves can be allocated per node, depending on the CPU map.
> _And_ it should be easier on the CPU cache ...

That is one possible solution. Another solution is to follow the approach
from sbitmap: allocate a single array that is slightly larger than needed
and use the elements in such a way that no two CPUs use the same cache
line.

Bart.






Re: [PATCH] block: bio_check_eod() needs to consider partition

2018-03-08 Thread Christoph Hellwig
On Thu, Mar 08, 2018 at 04:17:19PM +0800, Jiufei Xue wrote:
> Hi Christoph,
> 
> On 2018/3/8 下午3:46, Christoph Hellwig wrote:
> > bio_check_eod() should check partiton size not the whole disk if
> > bio->bi_partno is non-zero.
> > 
> I think the check should be done twice if the bio->bi_partno is not zero,
> one for the partition, and another for the whole disk after remap which
> is add in the commit 5ddfe9691c91
> ("md: check bio address after mapping through partitions").

I disagree.  See my other mail - if it fits inside the partition the
partition rescan code already made sure it also fits inside the device.


Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-08 Thread Ming Lei
On Wed, Mar 07, 2018 at 09:11:37AM -0500, Laurence Oberman wrote:
> On Tue, 2018-03-06 at 14:24 -0500, Martin K. Petersen wrote:
> > Ming,
> > 
> > > Given both Don and Laurence have verified that patch 1 and patch 2
> > > does fix IO hang, could you consider to merge the two first?
> > 
> > Oh, and I would still need a formal Acked-by: from Don and Tested-by:
> > from Laurence.
> > 
> > Also, for 4.16/scsi-fixes I would prefer verification to be done with
> > just patch 1/8 and none of the subsequent changes in place. Just to
> > make
> > sure we're testing the right thing.
> > 
> > Thanks!
> > 
> 
> Hello Martin
> 
> I tested just Patch 1/8 from the V3 series.
> No issues running workload and no issues booting on the DL380G7.
> Don can you ack this so we can at least get this one in.
> 
> Against: 4.16.0-rc4.v31of8+ on an x86_64
> 
> Tested-by: Laurence Oberman 

Hi Laurence,

Thanks for your test!

Could you test patch 2 too since you have megaraid_sas controller?

Looks it is better to split the fix patches from the current patchset,
since these fixes should be for V4.16.

Thanks
Ming


Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-08 Thread Ming Lei
On Thu, Mar 08, 2018 at 03:18:33PM +0200, Artem Bityutskiy wrote:
> On Thu, 2018-03-08 at 18:53 +0800, Ming Lei wrote:
> > Hi,
> > 
> > This patchset tries to spread among online CPUs as far as possible, so
> > that we can avoid to allocate too less irq vectors with online CPUs
> > mapped.
> > 
> > For example, in a 8cores system, 4 cpu cores(4~7) are offline/non present,
> > on a device with 4 queues:
> > 
> > 1) before this patchset
> > irq 39, cpu list 0-2
> > irq 40, cpu list 3-4,6
> > irq 41, cpu list 5
> > irq 42, cpu list 7
> > 
> > 2) after this patchset
> > irq 39, cpu list 0,4
> > irq 40, cpu list 1,6
> > irq 41, cpu list 2,5
> > irq 42, cpu list 3,7
> > 
> > Without this patchset, only two vectors(39, 40) can be active, but there
> > can be 4 active irq vectors after applying this patchset.
> 
> Tested-by: Artem Bityutskiy 
> Link: https://lkml.kernel.org/r/1519311270.2535.53.ca...@intel.com

Hi Artem,

Thanks for your test!

> 
> Ming,
> 
> this patchset fixes the v4.16-rcX regression that I reported few weeks
> ago. I applied it and verified that Dell R640 server that I mentioned
> in the bug report boots up and the disk works.
> 
> So this is not just an improvement, it also includes a bugfix. 

Actually, it isn't a real fix, the real one is in the following two:

0c20244d458e scsi: megaraid_sas: fix selection of reply queue
ed6d043be8cd scsi: hpsa: fix selection of reply queue

This patchset can't guarantee that all IRQ vectors are assigned by one
online CPU, for example, in a quad-socket system, if only one processor
is present, then some of vectors are still assigned by all offline CPUs,
and it is a valid case, but still may cause io hang if drivers(hpsa, 
megaraid_sas)
select reply queue in current way.

Thanks,
Ming


Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-08 Thread Artem Bityutskiy
On Thu, 2018-03-08 at 15:18 +0200, Artem Bityutskiy wrote:
> Tested-by: Artem Bityutskiy 
> Link: https://lkml.kernel.org/r/1519311270.2535.53.ca...@intel.com

And for completeness:
Linux-Regression-ID: lr#15a115


Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-08 Thread Artem Bityutskiy
On Thu, 2018-03-08 at 18:53 +0800, Ming Lei wrote:
> Hi,
> 
> This patchset tries to spread among online CPUs as far as possible, so
> that we can avoid to allocate too less irq vectors with online CPUs
> mapped.
> 
> For example, in a 8cores system, 4 cpu cores(4~7) are offline/non present,
> on a device with 4 queues:
> 
> 1) before this patchset
>   irq 39, cpu list 0-2
>   irq 40, cpu list 3-4,6
>   irq 41, cpu list 5
>   irq 42, cpu list 7
> 
> 2) after this patchset
>   irq 39, cpu list 0,4
>   irq 40, cpu list 1,6
>   irq 41, cpu list 2,5
>   irq 42, cpu list 3,7
> 
> Without this patchset, only two vectors(39, 40) can be active, but there
> can be 4 active irq vectors after applying this patchset.

Tested-by: Artem Bityutskiy 
Link: https://lkml.kernel.org/r/1519311270.2535.53.ca...@intel.com

Ming,

this patchset fixes the v4.16-rcX regression that I reported few weeks
ago. I applied it and verified that Dell R640 server that I mentioned
in the bug report boots up and the disk works.

So this is not just an improvement, it also includes a bugfix. 

Thanks!



Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via .host_tagset

2018-03-08 Thread Ming Lei
On Thu, Mar 08, 2018 at 07:06:25PM +0800, Ming Lei wrote:
> On Thu, Mar 08, 2018 at 03:34:31PM +0530, Kashyap Desai wrote:
> > > -Original Message-
> > > From: Ming Lei [mailto:ming@redhat.com]
> > > Sent: Thursday, March 8, 2018 6:46 AM
> > > To: Kashyap Desai
> > > Cc: Jens Axboe; linux-block@vger.kernel.org; Christoph Hellwig; Mike
> > Snitzer;
> > > linux-s...@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar Sandoval;
> > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> > > Rivera; Laurence Oberman
> > > Subject: Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance
> > via
> > > .host_tagset
> > >
> > > On Wed, Mar 07, 2018 at 10:58:34PM +0530, Kashyap Desai wrote:
> > > > > >
> > > > > > Also one observation using V3 series patch. I am seeing below
> > > > > > Affinity mapping whereas I have only 72 logical CPUs.  It means we
> > > > > > are really not going to use all reply queues.
> > > > > > e.a If I bind fio jobs on CPU 18-20, I am seeing only one reply
> > > > > > queue is used and that may lead to performance drop as well.
> > > > >
> > > > > If the mapping is in such shape, I guess it should be quite
> > > > > difficult to
> > > > figure out
> > > > > one perfect way to solve this situation because one reply queue has
> > > > > to
> > > > handle
> > > > > IOs submitted from 4~5 CPUs at average.
> > > >
> > > > 4.15.0-rc1 kernel has below mapping - I am not sure which commit id in
> > "
> > > > linux_4.16-rc-host-tags-v3.2" is changing the mapping of IRQ to CPU.
> > > > It
> > >
> > > I guess the mapping you posted is read from /proc/irq/126/smp_affinity.
> > >
> > > If yes, no any patch in linux_4.16-rc-host-tags-v3.2 should change IRQ
> > affinity
> > > code, which is done in irq_create_affinity_masks(), as you saw, no any
> > patch
> > > in linux_4.16-rc-host-tags-v3.2 touches that code.
> > >
> > > Could you simply apply the patches in linux_4.16-rc-host-tags-v3.2
> > against
> > > 4.15-rc1 kernel and see any difference?
> > >
> > > > will be really good if we can fall back to below mapping once again.
> > > > Current repo linux_4.16-rc-host-tags-v3.2 is giving lots of random
> > > > mapping of CPU - MSIx. And that will be problematic in performance
> > run.
> > > >
> > > > As I posted earlier, latest repo will only allow us to use *18* reply
> > >
> > > Looks not see this report before, could you share us how you conclude
> > that?
> > > The only patch changing reply queue is the following one:
> > >
> > >   https://marc.info/?l=linux-block=151972611911593=2
> > >
> > > But not see any issue in this patch yet, can you recover to 72 reply
> > queues
> > > after reverting the patch in above link?
> > Ming -
> > 
> > While testing, my system went bad. I debug further and understood that
> > affinity mapping was changed due to below commit -
> > 84676c1f21e8ff54befe985f4f14dc1edc10046b
> > 
> > [PATCH] genirq/affinity: assign vectors to all possible CPUs
> > 
> > Because of above change, we end up using very less reply queue. Many reply
> > queues on my setup was mapped to offline/not-available CPUs. This may be
> > primary contributing to odd performance impact and it may not be truly due
> > to V3/V4 patch series.
> 
> Seems a good news, :-)
> 
> > 
> > I am planning to check your V3 and V4 series after removing above commit
> > ID (for performance impact.).
> 
> You can run your test on a server in which all CPUs are kept as online
> for avoiding this issue.
> 
> Or you can apply the following patchset for avoiding this issue:
> 
>   https://marc.info/?l=linux-block=152050646332092=2

If you want to do this way, all patches have been put into the following
tree(V4):

https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v4

#in reverse order
genirq/affinity: irq vector spread among online CPUs as far as possible
genirq/affinity: support to do irq vectors spread starting from any vector
genirq/affinity: move actual irq vector spread into one helper
genirq/affinity: rename *node_to_possible_cpumask as *node_to_cpumask
scsi: megaraid: improve scsi_mq performance via .host_tagset
scsi: hpsa: improve scsi_mq performance via .host_tagset
block: null_blk: introduce module parameter of 'g_host_tags'
scsi: Add template flag 'host_tagset'
blk-mq: introduce BLK_MQ_F_HOST_TAGS
blk-mq: introduce 'start_tag' field to 'struct blk_mq_tags'
scsi: avoid to hold host_busy for scsi_mq
scsi: read host_busy via scsi_host_busy()
scsi: introduce scsi_host_busy()
scsi: virtio_scsi: fix IO hang caused by irq vector automatic affinity
scsi: introduce force_blk_mq
scsi: megaraid_sas: fix selection of reply queue
scsi: hpsa: fix selection of reply queue


Thanks,
Ming


Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via .host_tagset

2018-03-08 Thread Ming Lei
On Thu, Mar 08, 2018 at 03:34:31PM +0530, Kashyap Desai wrote:
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Thursday, March 8, 2018 6:46 AM
> > To: Kashyap Desai
> > Cc: Jens Axboe; linux-block@vger.kernel.org; Christoph Hellwig; Mike
> Snitzer;
> > linux-s...@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar Sandoval;
> > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
> > Rivera; Laurence Oberman
> > Subject: Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance
> via
> > .host_tagset
> >
> > On Wed, Mar 07, 2018 at 10:58:34PM +0530, Kashyap Desai wrote:
> > > > >
> > > > > Also one observation using V3 series patch. I am seeing below
> > > > > Affinity mapping whereas I have only 72 logical CPUs.  It means we
> > > > > are really not going to use all reply queues.
> > > > > e.a If I bind fio jobs on CPU 18-20, I am seeing only one reply
> > > > > queue is used and that may lead to performance drop as well.
> > > >
> > > > If the mapping is in such shape, I guess it should be quite
> > > > difficult to
> > > figure out
> > > > one perfect way to solve this situation because one reply queue has
> > > > to
> > > handle
> > > > IOs submitted from 4~5 CPUs at average.
> > >
> > > 4.15.0-rc1 kernel has below mapping - I am not sure which commit id in
> "
> > > linux_4.16-rc-host-tags-v3.2" is changing the mapping of IRQ to CPU.
> > > It
> >
> > I guess the mapping you posted is read from /proc/irq/126/smp_affinity.
> >
> > If yes, no any patch in linux_4.16-rc-host-tags-v3.2 should change IRQ
> affinity
> > code, which is done in irq_create_affinity_masks(), as you saw, no any
> patch
> > in linux_4.16-rc-host-tags-v3.2 touches that code.
> >
> > Could you simply apply the patches in linux_4.16-rc-host-tags-v3.2
> against
> > 4.15-rc1 kernel and see any difference?
> >
> > > will be really good if we can fall back to below mapping once again.
> > > Current repo linux_4.16-rc-host-tags-v3.2 is giving lots of random
> > > mapping of CPU - MSIx. And that will be problematic in performance
> run.
> > >
> > > As I posted earlier, latest repo will only allow us to use *18* reply
> >
> > Looks not see this report before, could you share us how you conclude
> that?
> > The only patch changing reply queue is the following one:
> >
> > https://marc.info/?l=linux-block=151972611911593=2
> >
> > But not see any issue in this patch yet, can you recover to 72 reply
> queues
> > after reverting the patch in above link?
> Ming -
> 
> While testing, my system went bad. I debug further and understood that
> affinity mapping was changed due to below commit -
> 84676c1f21e8ff54befe985f4f14dc1edc10046b
> 
> [PATCH] genirq/affinity: assign vectors to all possible CPUs
> 
> Because of above change, we end up using very less reply queue. Many reply
> queues on my setup was mapped to offline/not-available CPUs. This may be
> primary contributing to odd performance impact and it may not be truly due
> to V3/V4 patch series.

Seems a good news, :-)

> 
> I am planning to check your V3 and V4 series after removing above commit
> ID (for performance impact.).

You can run your test on a server in which all CPUs are kept as online
for avoiding this issue.

Or you can apply the following patchset for avoiding this issue:

https://marc.info/?l=linux-block=152050646332092=2

> 
> It is good if we spread possible CPUs (instead of online cpus) to all irq
> vectors  considering -  We should have at least *one* online CPU mapped to
> the vector.

Right, that is exactly what the above patchset does.

Thanks,
Ming


Re: [PATCH V3 7/8] scsi: hpsa: improve scsi_mq performance via .host_tagset

2018-03-08 Thread Ming Lei
On Thu, Mar 08, 2018 at 08:54:43AM +0100, Christoph Hellwig wrote:
> > +   /* 256 tags should be high enough to saturate device */
> > +   int max_queues = DIV_ROUND_UP(h->scsi_host->can_queue, 256);
> > +
> > +   /* per NUMA node hw queue */
> > +   h->scsi_host->nr_hw_queues = min_t(int, nr_node_ids, max_queues);
> 
> I don't think this magic should be in a driver.  The per-node hw_queue
> selection seems like something we'd better do in the core code.

The thing is that driver code may need to know if multiple queues are used,
then driver may partition its own resource into multi hw queues, and
improve its .queuecommand and .complete_command. That seems what
megaraid_sas should do in next time.

> 
> Also the whole idea to use nr_hw_queues for just partitioning tag
> space on hardware that doesn't really support multiple hardware queues
> seems more than odd.

The per-node hw queue is used together with BLK_MQ_F_HOST_TAGS, which is
really for improving the single queue case(single tagset). If driver/device
supports real multiple hw queues, they don't need this approach.

Thanks,
Ming


[PATCH V3 2/4] genirq/affinity: move actual irq vector spread into one helper

2018-03-08 Thread Ming Lei
No functional change, just prepare for converting to 2-stage
irq vector spread.

Cc: Thomas Gleixner 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 97 +--
 1 file changed, 55 insertions(+), 42 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 4b1c4763212d..e119e86bed48 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,50 +94,19 @@ static int get_nodes_in_cpumask(cpumask_var_t 
*node_to_cpumask,
return nodes;
 }
 
-/**
- * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
- * @nvecs: The total number of vectors
- * @affd:  Description of the affinity requirements
- *
- * Returns the masks pointer or NULL if allocation failed.
- */
-struct cpumask *
-irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
+static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
+   cpumask_var_t *node_to_cpumask,
+   const struct cpumask *cpu_mask,
+   struct cpumask *nmsk,
+   struct cpumask *masks)
 {
-   int n, nodes, cpus_per_vec, extra_vecs, curvec;
int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int last_affv = affv + affd->pre_vectors;
+   int curvec = affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
-   struct cpumask *masks;
-   cpumask_var_t nmsk, *node_to_cpumask;
-
-   /*
-* If there aren't any vectors left after applying the pre/post
-* vectors don't bother with assigning affinity.
-*/
-   if (!affv)
-   return NULL;
-
-   if (!zalloc_cpumask_var(, GFP_KERNEL))
-   return NULL;
-
-   masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
-   if (!masks)
-   goto out;
+   int n, nodes, cpus_per_vec, extra_vecs;
 
-   node_to_cpumask = alloc_node_to_cpumask();
-   if (!node_to_cpumask)
-   goto out;
-
-   /* Fill out vectors at the beginning that don't need affinity */
-   for (curvec = 0; curvec < affd->pre_vectors; curvec++)
-   cpumask_copy(masks + curvec, irq_default_affinity);
-
-   /* Stabilize the cpumasks */
-   get_online_cpus();
-   build_node_to_cpumask(node_to_cpumask);
-   nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_possible_mask,
-);
+   nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, );
 
/*
 * If the number of nodes in the mask is greater than or equal the
@@ -150,7 +119,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
if (++curvec == last_affv)
break;
}
-   goto done;
+   goto out;
}
 
for_each_node_mask(n, nodemsk) {
@@ -160,7 +129,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
/* Get the cpus on this node which are in the mask */
-   cpumask_and(nmsk, cpu_possible_mask, node_to_cpumask[n]);
+   cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
 
/* Calculate the number of cpus per vector */
ncpus = cpumask_weight(nmsk);
@@ -186,7 +155,51 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
--nodes;
}
 
-done:
+out:
+   return curvec - affd->pre_vectors;
+}
+
+/**
+ * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
+ * @nvecs: The total number of vectors
+ * @affd:  Description of the affinity requirements
+ *
+ * Returns the masks pointer or NULL if allocation failed.
+ */
+struct cpumask *
+irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
+{
+   int curvec;
+   struct cpumask *masks;
+   cpumask_var_t nmsk, *node_to_cpumask;
+
+   /*
+* If there aren't any vectors left after applying the pre/post
+* vectors don't bother with assigning affinity.
+*/
+   if (nvecs == affd->pre_vectors + affd->post_vectors)
+   return NULL;
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL))
+   return NULL;
+
+   masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
+   if (!masks)
+   goto out;
+
+   node_to_cpumask = alloc_node_to_cpumask();
+   if (!node_to_cpumask)
+   goto out;
+
+   /* Fill out vectors at the beginning that don't need affinity */
+   for (curvec = 0; curvec < affd->pre_vectors; curvec++)
+   cpumask_copy(masks + curvec, irq_default_affinity);
+
+   /* Stabilize the cpumasks */
+   

[PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-08 Thread Ming Lei
84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
may cause irq vector assigned to all offline CPUs, and this kind of
assignment may cause much less irq vectors mapped to online CPUs, and
performance may get hurt.

For example, in a 8 cores system, 0~3 online, 4~8 offline/not present,
see 'lscpu':

[ming@box]$lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):4
On-line CPU(s) list:   0-3
Thread(s) per core:1
Core(s) per socket:2
Socket(s): 2
NUMA node(s):  2
...
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s):
...

For example, one device has 4 queues:

1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0
irq 40, cpu list 1
irq 41, cpu list 2
irq 42, cpu list 3

2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0-2
irq 40, cpu list 3-4,6
irq 41, cpu list 5
irq 42, cpu list 7

3) after applying this patch against V4.15+:
irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7

This patch tries to do irq vector spread among online CPUs as far as
possible by 2 stages spread.

The above assignment 3) isn't the optimal result from NUMA view, but it
returns more irq vectors with online CPU mapped, given in reality one CPU
should be enough to handle one irq vector, so it is better to do this way.

Cc: Thomas Gleixner 
Reviewed-by: Christoph Hellwig 
Reported-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 35 +--
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 616f040c5d02..253c5bf85d18 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -106,6 +106,9 @@ static int irq_build_affinity_masks(const struct 
irq_affinity *affd,
nodemask_t nodemsk = NODE_MASK_NONE;
int n, nodes, cpus_per_vec, extra_vecs, done = 0;
 
+   if (!cpumask_weight(cpu_mask))
+   return 0;
+
nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, );
 
/*
@@ -175,9 +178,9 @@ struct cpumask *
 irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 {
int affv = nvecs - affd->pre_vectors - affd->post_vectors;
-   int curvec;
+   int curvec, vecs_offline, vecs_online;
struct cpumask *masks;
-   cpumask_var_t nmsk, *node_to_cpumask;
+   cpumask_var_t nmsk, cpu_mask, *node_to_cpumask;
 
/*
 * If there aren't any vectors left after applying the pre/post
@@ -193,9 +196,12 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
if (!masks)
goto out;
 
+   if (!alloc_cpumask_var(_mask, GFP_KERNEL))
+   goto out;
+
node_to_cpumask = alloc_node_to_cpumask();
if (!node_to_cpumask)
-   goto out;
+   goto out_free_cpu_mask;
 
/* Fill out vectors at the beginning that don't need affinity */
for (curvec = 0; curvec < affd->pre_vectors; curvec++)
@@ -204,15 +210,32 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
/* Stabilize the cpumasks */
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
-   curvec += irq_build_affinity_masks(affd, curvec, affv,
-  node_to_cpumask,
-  cpu_possible_mask, nmsk, masks);
+   /* spread on online CPUs starting from the vector of affd->pre_vectors 
*/
+   vecs_online = irq_build_affinity_masks(affd, curvec, affv,
+  node_to_cpumask,
+  cpu_online_mask, nmsk, masks);
+
+   /* spread on offline CPUs starting from the next vector to be handled */
+   if (vecs_online >= affv)
+   curvec = affd->pre_vectors;
+   else
+   curvec = affd->pre_vectors + vecs_online;
+   cpumask_andnot(cpu_mask, cpu_possible_mask, cpu_online_mask);
+   vecs_offline = irq_build_affinity_masks(affd, curvec, affv,
+   node_to_cpumask,
+   cpu_mask, nmsk, masks);
put_online_cpus();
 
/* Fill out vectors at the end that don't need affinity */
+   if (vecs_online + vecs_offline >= affv)
+   curvec = affv + affd->pre_vectors;
+   else
+   curvec = affd->pre_vectors + vecs_online + vecs_offline;
for (; curvec < nvecs; curvec++)
cpumask_copy(masks + curvec, 

[PATCH V3 3/4] genirq/affinity: support to do irq vectors spread starting from any vector

2018-03-08 Thread Ming Lei
Now two parameters(start_vec, affv) are introduced to 
irq_build_affinity_masks(),
then this helper can build the affinity of each irq vector starting from
the irq vector of 'start_vec', and handle at most 'affv' vectors.

This way is required to do 2-stages irq vectors spread among all
possible CPUs.

Cc: Thomas Gleixner 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index e119e86bed48..616f040c5d02 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,17 +94,17 @@ static int get_nodes_in_cpumask(cpumask_var_t 
*node_to_cpumask,
return nodes;
 }
 
-static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
+static int irq_build_affinity_masks(const struct irq_affinity *affd,
+   int start_vec, int affv,
cpumask_var_t *node_to_cpumask,
const struct cpumask *cpu_mask,
struct cpumask *nmsk,
struct cpumask *masks)
 {
-   int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int last_affv = affv + affd->pre_vectors;
-   int curvec = affd->pre_vectors;
+   int curvec = start_vec;
nodemask_t nodemsk = NODE_MASK_NONE;
-   int n, nodes, cpus_per_vec, extra_vecs;
+   int n, nodes, cpus_per_vec, extra_vecs, done = 0;
 
nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, );
 
@@ -116,8 +116,10 @@ static int irq_build_affinity_masks(int nvecs, const 
struct irq_affinity *affd,
for_each_node_mask(n, nodemsk) {
cpumask_copy(masks + curvec,
 node_to_cpumask[n]);
-   if (++curvec == last_affv)
+   if (++done == affv)
break;
+   if (++curvec == last_affv)
+   curvec = affd->pre_vectors;
}
goto out;
}
@@ -150,13 +152,16 @@ static int irq_build_affinity_masks(int nvecs, const 
struct irq_affinity *affd,
irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec);
}
 
-   if (curvec >= last_affv)
+   done += v;
+   if (done >= affv)
break;
+   if (curvec >= last_affv)
+   curvec = affd->pre_vectors;
--nodes;
}
 
 out:
-   return curvec - affd->pre_vectors;
+   return done;
 }
 
 /**
@@ -169,6 +174,7 @@ static int irq_build_affinity_masks(int nvecs, const struct 
irq_affinity *affd,
 struct cpumask *
 irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 {
+   int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int curvec;
struct cpumask *masks;
cpumask_var_t nmsk, *node_to_cpumask;
@@ -198,7 +204,8 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
/* Stabilize the cpumasks */
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
-   curvec += irq_build_affinity_masks(nvecs, affd, node_to_cpumask,
+   curvec += irq_build_affinity_masks(affd, curvec, affv,
+  node_to_cpumask,
   cpu_possible_mask, nmsk, masks);
put_online_cpus();
 
-- 
2.9.5



[PATCH V3 1/4] genirq/affinity: rename *node_to_possible_cpumask as *node_to_cpumask

2018-03-08 Thread Ming Lei
The following patches will introduce two stage irq spread for improving
irq spread on all possible CPUs.

No funtional change.

Cc: Thomas Gleixner 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index a37a3b4b6342..4b1c4763212d 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, 
struct cpumask *nmsk,
}
 }
 
-static cpumask_var_t *alloc_node_to_possible_cpumask(void)
+static cpumask_var_t *alloc_node_to_cpumask(void)
 {
cpumask_var_t *masks;
int node;
@@ -62,7 +62,7 @@ static cpumask_var_t *alloc_node_to_possible_cpumask(void)
return NULL;
 }
 
-static void free_node_to_possible_cpumask(cpumask_var_t *masks)
+static void free_node_to_cpumask(cpumask_var_t *masks)
 {
int node;
 
@@ -71,7 +71,7 @@ static void free_node_to_possible_cpumask(cpumask_var_t 
*masks)
kfree(masks);
 }
 
-static void build_node_to_possible_cpumask(cpumask_var_t *masks)
+static void build_node_to_cpumask(cpumask_var_t *masks)
 {
int cpu;
 
@@ -79,14 +79,14 @@ static void build_node_to_possible_cpumask(cpumask_var_t 
*masks)
cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
 }
 
-static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
const struct cpumask *mask, nodemask_t *nodemsk)
 {
int n, nodes = 0;
 
/* Calculate the number of nodes in the supplied affinity mask */
for_each_node(n) {
-   if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
+   if (cpumask_intersects(mask, node_to_cpumask[n])) {
node_set(n, *nodemsk);
nodes++;
}
@@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
int last_affv = affv + affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
struct cpumask *masks;
-   cpumask_var_t nmsk, *node_to_possible_cpumask;
+   cpumask_var_t nmsk, *node_to_cpumask;
 
/*
 * If there aren't any vectors left after applying the pre/post
@@ -125,8 +125,8 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
if (!masks)
goto out;
 
-   node_to_possible_cpumask = alloc_node_to_possible_cpumask();
-   if (!node_to_possible_cpumask)
+   node_to_cpumask = alloc_node_to_cpumask();
+   if (!node_to_cpumask)
goto out;
 
/* Fill out vectors at the beginning that don't need affinity */
@@ -135,8 +135,8 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
 
/* Stabilize the cpumasks */
get_online_cpus();
-   build_node_to_possible_cpumask(node_to_possible_cpumask);
-   nodes = get_nodes_in_cpumask(node_to_possible_cpumask, 
cpu_possible_mask,
+   build_node_to_cpumask(node_to_cpumask);
+   nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_possible_mask,
 );
 
/*
@@ -146,7 +146,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
if (affv <= nodes) {
for_each_node_mask(n, nodemsk) {
cpumask_copy(masks + curvec,
-node_to_possible_cpumask[n]);
+node_to_cpumask[n]);
if (++curvec == last_affv)
break;
}
@@ -160,7 +160,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
/* Get the cpus on this node which are in the mask */
-   cpumask_and(nmsk, cpu_possible_mask, 
node_to_possible_cpumask[n]);
+   cpumask_and(nmsk, cpu_possible_mask, node_to_cpumask[n]);
 
/* Calculate the number of cpus per vector */
ncpus = cpumask_weight(nmsk);
@@ -192,7 +192,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
/* Fill out vectors at the end that don't need affinity */
for (; curvec < nvecs; curvec++)
cpumask_copy(masks + curvec, irq_default_affinity);
-   free_node_to_possible_cpumask(node_to_possible_cpumask);
+   free_node_to_cpumask(node_to_cpumask);
 out:
free_cpumask_var(nmsk);
return masks;
-- 
2.9.5



Re: [PATCH V2 3/5] genirq/affinity: move actual irq vector spread into one helper

2018-03-08 Thread Ming Lei
On Tue, Mar 06, 2018 at 12:28:32AM +0800, kbuild test robot wrote:
> Hi Ming,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on tip/irq/core]
> [also build test WARNING on v4.16-rc4 next-20180305]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Ming-Lei/genirq-affinity-irq-vector-spread-among-online-CPUs-as-far-as-possible/20180305-184912
> config: i386-randconfig-a1-201809 (attached as .config)
> compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
> 
> All warnings (new ones prefixed by >>):
> 
>kernel/irq/affinity.c: In function 'irq_create_affinity_masks':
> >> kernel/irq/affinity.c:201:50: warning: passing argument 3 of 
> >> 'irq_build_affinity_masks' from incompatible pointer type
>  curvec += irq_build_affinity_masks(nvecs, affd, node_to_cpumask,
>  ^
>kernel/irq/affinity.c:97:12: note: expected 'const struct cpumask (*)[1]' 
> but argument is of type 'struct cpumask (*)[1]'
> static int irq_build_affinity_masks(int nvecs, const struct irq_affinity 
> *affd,

Looks this warning can only be triggered on ARCH=i386 with gcc-4.X.

Can't reproduce it when building on other ARCHs, and can't reproduce
it with gcc-6 too.


Thanks,
Ming


RE: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via .host_tagset

2018-03-08 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Thursday, March 8, 2018 6:46 AM
> To: Kashyap Desai
> Cc: Jens Axboe; linux-block@vger.kernel.org; Christoph Hellwig; Mike
Snitzer;
> linux-s...@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Laurence Oberman
> Subject: Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance
via
> .host_tagset
>
> On Wed, Mar 07, 2018 at 10:58:34PM +0530, Kashyap Desai wrote:
> > > >
> > > > Also one observation using V3 series patch. I am seeing below
> > > > Affinity mapping whereas I have only 72 logical CPUs.  It means we
> > > > are really not going to use all reply queues.
> > > > e.a If I bind fio jobs on CPU 18-20, I am seeing only one reply
> > > > queue is used and that may lead to performance drop as well.
> > >
> > > If the mapping is in such shape, I guess it should be quite
> > > difficult to
> > figure out
> > > one perfect way to solve this situation because one reply queue has
> > > to
> > handle
> > > IOs submitted from 4~5 CPUs at average.
> >
> > 4.15.0-rc1 kernel has below mapping - I am not sure which commit id in
"
> > linux_4.16-rc-host-tags-v3.2" is changing the mapping of IRQ to CPU.
> > It
>
> I guess the mapping you posted is read from /proc/irq/126/smp_affinity.
>
> If yes, no any patch in linux_4.16-rc-host-tags-v3.2 should change IRQ
affinity
> code, which is done in irq_create_affinity_masks(), as you saw, no any
patch
> in linux_4.16-rc-host-tags-v3.2 touches that code.
>
> Could you simply apply the patches in linux_4.16-rc-host-tags-v3.2
against
> 4.15-rc1 kernel and see any difference?
>
> > will be really good if we can fall back to below mapping once again.
> > Current repo linux_4.16-rc-host-tags-v3.2 is giving lots of random
> > mapping of CPU - MSIx. And that will be problematic in performance
run.
> >
> > As I posted earlier, latest repo will only allow us to use *18* reply
>
> Looks not see this report before, could you share us how you conclude
that?
> The only patch changing reply queue is the following one:
>
>   https://marc.info/?l=linux-block=151972611911593=2
>
> But not see any issue in this patch yet, can you recover to 72 reply
queues
> after reverting the patch in above link?
Ming -

While testing, my system went bad. I debug further and understood that
affinity mapping was changed due to below commit -
84676c1f21e8ff54befe985f4f14dc1edc10046b

[PATCH] genirq/affinity: assign vectors to all possible CPUs

Because of above change, we end up using very less reply queue. Many reply
queues on my setup was mapped to offline/not-available CPUs. This may be
primary contributing to odd performance impact and it may not be truly due
to V3/V4 patch series.

I am planning to check your V3 and V4 series after removing above commit
ID (for performance impact.).

It is good if we spread possible CPUs (instead of online cpus) to all irq
vectors  considering -  We should have at least *one* online CPU mapped to
the vector.

>
> > queue instead of *72*.  Lots of performance related issue can be pop
> > up on different setup due to inconsistency in CPU - MSIx mapping. BTW,
> > changes in this area is intentional @" linux_4.16-rc-host-tags-v3.2".
?
>
> As you mentioned in the following link, you didn't see big performance
drop
> with linux_4.16-rc-host-tags-v3.2, right?
>
>   https://marc.info/?l=linux-block=151982993810092=2
>
>
> Thanks,
> Ming


Re: [PATCH V3 4/8] blk-mq: introduce BLK_MQ_F_HOST_TAGS

2018-03-08 Thread Ming Lei
On Thu, Mar 08, 2018 at 08:52:52AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 27, 2018 at 06:07:46PM +0800, Ming Lei wrote:
> > This patch can support to partition host-wide tags to multiple hw queues,
> > so each hw queue related data structures(tags, hctx) can be accessed in
> > NUMA locality way, for example, the hw queue can be per NUMA node.
> > 
> > It is observed IOPS can be improved much in this way on null_blk test.
> 
> null_blk isn't too interesting, so some real hardware number would
> be very useful here.

About 10~20% IOPS improvement can be observed on scsi_debug too, which is
setup on one dual-sockets system.

It needs one hpsa or megaraid_sas host with dozens of SSDs, which seems
not easy to setup for me.

And Kashyap is very cooperative to test patches, looks V3 is much
better than before by using per-node hw queue.

If atomic operations on scsi_host->host_busy are removed, and
megaraid_sas IO path can be optimized a bit, we should get some improvement
by per-node hw queue with BLK_MQ_F_HOST_TAGS on megaraid_sas.

> 
> Also the documentation should be a lot less sparse.  When are we going
> to set this flag?  What help are we going to give driver authors to
> guide chosing the option?

OK, will do that in next version.

Thanks,
Ming


Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-08 Thread Ming Lei
On Thu, Mar 08, 2018 at 09:41:16AM +0100, Hannes Reinecke wrote:
> On 03/08/2018 09:15 AM, Ming Lei wrote:
> > On Thu, Mar 08, 2018 at 08:50:35AM +0100, Christoph Hellwig wrote:
> >>> +static void hpsa_setup_reply_map(struct ctlr_info *h)
> >>> +{
> >>> + const struct cpumask *mask;
> >>> + unsigned int queue, cpu;
> >>> +
> >>> + for (queue = 0; queue < h->msix_vectors; queue++) {
> >>> + mask = pci_irq_get_affinity(h->pdev, queue);
> >>> + if (!mask)
> >>> + goto fallback;
> >>> +
> >>> + for_each_cpu(cpu, mask)
> >>> + h->reply_map[cpu] = queue;
> >>> + }
> >>> + return;
> >>> +
> >>> +fallback:
> >>> + for_each_possible_cpu(cpu)
> >>> + h->reply_map[cpu] = 0;
> >>> +}
> >>
> >> It seems a little annoying that we have to duplicate this in the driver.
> >> Wouldn't this be solved by your force_blk_mq flag and relying on the
> >> hw_ctx id?
> > 
> > This issue can be solved by force_blk_mq, but may cause performance
> > regression for host-wide tagset drivers:
> > 
> > - If the whole tagset is partitioned into each hw queue, each hw queue's
> > depth may not be high enough, especially SCSI's IO path may be not
> > efficient enough. Even though we keep each queue's depth as 256, which
> > should be high enough to exploit parallelism from device internal view,
> > but still can't get good performance.
> > 
> > - If the whole tagset is still shared among all hw queues, the shared
> > tags can be accessed from all CPUs, and IOPS is degraded.
> > 
> > Kashyap has tested the above two approaches, both hurts IOPS on 
> > megaraid_sas.
> > 
> This is precisely the issue I have been worried about, too.
> 
> The problem is not so much the tagspace (which actually is quite small
> memory footprint-wise), but rather the _requests_ indexed by the tags.

But V1 is done in this way, one shared tags is used and requests are
allocated for each hw queue in NUMA locality, finally Kashyap confirmed
that IOPS can be recovered to normal if iostats is set as 0 after V1 is
applied:

https://marc.info/?l=linux-scsi=151815231026789=2

That means the shared tags does have a big effect on performance.

> 
> We have this:
> 
> struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
> unsigned int hctx_idx,
> unsigned int nr_tags,
> unsigned int reserved_tags)
> {
> struct blk_mq_tags *tags;
> int node;
> 
> node = blk_mq_hw_queue_to_node(set->mq_map, hctx_idx);
> if (node == NUMA_NO_NODE)
> node = set->numa_node;
> 
> tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
>  BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
> if (!tags)
> return NULL;
> 
> tags->rqs = kzalloc_node(nr_tags * sizeof(struct request *),
>   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node);
> 
> 
> IE the _entire_ request set is allocated as _one_ array, making it quite
> hard to handle from the lower-level CPU caches.
> Also the 'node' indicator doesn't really help us here, as the requests
> have to be access by all CPUs in the shared tag case.
> 
> Would it be possible move tags->rqs to become a _double_ pointer?
> Then we would have only a shared lookup table, but the requests
> themselves can be allocated per node, depending on the CPU map.
> _And_ it should be easier on the CPU cache ...

That is basically same with the way in V1, even similar with V3, in
which per-node hw queue is introduced, from Kashyap's test, the
performance isn't bad. I believe finally IOPS can be improved if
scsi_host->host_busy operation is removed from IO path and
megaraid_sas driver is improved, as I mentioned earlier.

Thanks,
Ming


Re: [PATCH v2 04/11] block: Protect queue flag changes with the queue lock

2018-03-08 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH v2 09/11] block: Use blk_queue_flag_*() in drivers instead of queue_flag_*()

2018-03-08 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-08 Thread Hannes Reinecke
On 03/08/2018 09:15 AM, Ming Lei wrote:
> On Thu, Mar 08, 2018 at 08:50:35AM +0100, Christoph Hellwig wrote:
>>> +static void hpsa_setup_reply_map(struct ctlr_info *h)
>>> +{
>>> +   const struct cpumask *mask;
>>> +   unsigned int queue, cpu;
>>> +
>>> +   for (queue = 0; queue < h->msix_vectors; queue++) {
>>> +   mask = pci_irq_get_affinity(h->pdev, queue);
>>> +   if (!mask)
>>> +   goto fallback;
>>> +
>>> +   for_each_cpu(cpu, mask)
>>> +   h->reply_map[cpu] = queue;
>>> +   }
>>> +   return;
>>> +
>>> +fallback:
>>> +   for_each_possible_cpu(cpu)
>>> +   h->reply_map[cpu] = 0;
>>> +}
>>
>> It seems a little annoying that we have to duplicate this in the driver.
>> Wouldn't this be solved by your force_blk_mq flag and relying on the
>> hw_ctx id?
> 
> This issue can be solved by force_blk_mq, but may cause performance
> regression for host-wide tagset drivers:
> 
> - If the whole tagset is partitioned into each hw queue, each hw queue's
> depth may not be high enough, especially SCSI's IO path may be not
> efficient enough. Even though we keep each queue's depth as 256, which
> should be high enough to exploit parallelism from device internal view,
> but still can't get good performance.
> 
> - If the whole tagset is still shared among all hw queues, the shared
> tags can be accessed from all CPUs, and IOPS is degraded.
> 
> Kashyap has tested the above two approaches, both hurts IOPS on megaraid_sas.
> 
This is precisely the issue I have been worried about, too.

The problem is not so much the tagspace (which actually is quite small
memory footprint-wise), but rather the _requests_ indexed by the tags.

We have this:

struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
unsigned int hctx_idx,
unsigned int nr_tags,
unsigned int reserved_tags)
{
struct blk_mq_tags *tags;
int node;

node = blk_mq_hw_queue_to_node(set->mq_map, hctx_idx);
if (node == NUMA_NO_NODE)
node = set->numa_node;

tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
 BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
if (!tags)
return NULL;

tags->rqs = kzalloc_node(nr_tags * sizeof(struct request *),
  GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node);


IE the _entire_ request set is allocated as _one_ array, making it quite
hard to handle from the lower-level CPU caches.
Also the 'node' indicator doesn't really help us here, as the requests
have to be access by all CPUs in the shared tag case.

Would it be possible move tags->rqs to become a _double_ pointer?
Then we would have only a shared lookup table, but the requests
themselves can be allocated per node, depending on the CPU map.
_And_ it should be easier on the CPU cache ...

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH] block: bio_check_eod() needs to consider partition

2018-03-08 Thread Jiufei Xue
Hi Christoph,

On 2018/3/8 下午3:46, Christoph Hellwig wrote:
> bio_check_eod() should check partiton size not the whole disk if
> bio->bi_partno is non-zero.
> 
I think the check should be done twice if the bio->bi_partno is not zero,
one for the partition, and another for the whole disk after remap which
is add in the commit 5ddfe9691c91
("md: check bio address after mapping through partitions").

Thanks,
Jiufei

> Based on an earlier patch from Jiufei Xue.
> 
> Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and 
> partitions index")
> Reported-by: Jiufei Xue 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/blk-core.c | 76 
> ++--
>  1 file changed, 30 insertions(+), 46 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3ba4326a63b5..ba07f970e011 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2009,7 +2009,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
> struct bio *bio)
>   return BLK_QC_T_NONE;
>  }
>  
> -static void handle_bad_sector(struct bio *bio)
> +static void handle_bad_sector(struct bio *bio, sector_t maxsector)
>  {
>   char b[BDEVNAME_SIZE];
>  
> @@ -2017,7 +2017,7 @@ static void handle_bad_sector(struct bio *bio)
>   printk(KERN_INFO "%s: rw=%d, want=%Lu, limit=%Lu\n",
>   bio_devname(bio, b), bio->bi_opf,
>   (unsigned long long)bio_end_sector(bio),
> - (long long)get_capacity(bio->bi_disk));
> + (long long)maxsector);
>  }
>  
>  #ifdef CONFIG_FAIL_MAKE_REQUEST
> @@ -2060,57 +2060,47 @@ static inline bool should_fail_request(struct 
> hd_struct *part,
>   */
>  static inline int blk_partition_remap(struct bio *bio)
>  {
> - struct hd_struct *p;
> - int ret = 0;
> + sector_t maxsector = get_capacity(bio->bi_disk);
> + int nr_sectors = bio_sectors(bio);
>  
>   /*
>* Zone reset does not include bi_size so bio_sectors() is always 0.
>* Include a test for the reset op code and perform the remap if needed.
>*/
> - if (!bio->bi_partno ||
> - (!bio_sectors(bio) && bio_op(bio) != REQ_OP_ZONE_RESET))
> + if (!nr_sectors && bio_op(bio) != REQ_OP_ZONE_RESET)
>   return 0;
>  
> - rcu_read_lock();
> - p = __disk_get_part(bio->bi_disk, bio->bi_partno);
> - if (likely(p && !should_fail_request(p, bio->bi_iter.bi_size))) {
> + if (bio->bi_partno) {
> + struct hd_struct *p;
> +
> + rcu_read_lock();
> + p = __disk_get_part(bio->bi_disk, bio->bi_partno);
> + if (unlikely(!p ||
> +  should_fail_request(p, bio->bi_iter.bi_size))) {
> + rcu_read_unlock();
> + pr_info("%s: failing request for partition %d\n",
> + __func__, bio->bi_partno);
> + return -EIO;
> + }
> +
>   bio->bi_iter.bi_sector += p->start_sect;
>   bio->bi_partno = 0;
>   trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p),
>   bio->bi_iter.bi_sector - p->start_sect);
> - } else {
> - printk("%s: fail for partition %d\n", __func__, bio->bi_partno);
> - ret = -EIO;
> + maxsector = part_nr_sects_read(p);
> + rcu_read_unlock();
>   }
> - rcu_read_unlock();
>  
> - return ret;
> -}
> -
> -/*
> - * Check whether this bio extends beyond the end of the device.
> - */
> -static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
> -{
> - sector_t maxsector;
> -
> - if (!nr_sectors)
> - return 0;
> -
> - /* Test device or partition size, when known. */
> - maxsector = get_capacity(bio->bi_disk);
> - if (maxsector) {
> - sector_t sector = bio->bi_iter.bi_sector;
> -
> - if (maxsector < nr_sectors || maxsector - nr_sectors < sector) {
> - /*
> -  * This may well happen - the kernel calls bread()
> -  * without checking the size of the device, e.g., when
> -  * mounting a device.
> -  */
> - handle_bad_sector(bio);
> - return 1;
> - }
> + /*
> +  * Check whether this bio extends beyond the end of the device or
> +  * partition.  This may well happen - the kernel calls bread() without
> +  * checking the size of the device, e.g., when mounting a file system.
> +  */
> + if (nr_sectors && maxsector &&
> + (nr_sectors > maxsector ||
> +  bio->bi_iter.bi_sector > maxsector - nr_sectors)) {
> + handle_bad_sector(bio, maxsector);
> + return -EIO;
>   }
>  
>   return 0;
> @@ -2126,9 +2116,6 @@ 

Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-08 Thread Ming Lei
On Thu, Mar 08, 2018 at 08:50:35AM +0100, Christoph Hellwig wrote:
> > +static void hpsa_setup_reply_map(struct ctlr_info *h)
> > +{
> > +   const struct cpumask *mask;
> > +   unsigned int queue, cpu;
> > +
> > +   for (queue = 0; queue < h->msix_vectors; queue++) {
> > +   mask = pci_irq_get_affinity(h->pdev, queue);
> > +   if (!mask)
> > +   goto fallback;
> > +
> > +   for_each_cpu(cpu, mask)
> > +   h->reply_map[cpu] = queue;
> > +   }
> > +   return;
> > +
> > +fallback:
> > +   for_each_possible_cpu(cpu)
> > +   h->reply_map[cpu] = 0;
> > +}
> 
> It seems a little annoying that we have to duplicate this in the driver.
> Wouldn't this be solved by your force_blk_mq flag and relying on the
> hw_ctx id?

This issue can be solved by force_blk_mq, but may cause performance
regression for host-wide tagset drivers:

- If the whole tagset is partitioned into each hw queue, each hw queue's
depth may not be high enough, especially SCSI's IO path may be not
efficient enough. Even though we keep each queue's depth as 256, which
should be high enough to exploit parallelism from device internal view,
but still can't get good performance.

- If the whole tagset is still shared among all hw queues, the shared
tags can be accessed from all CPUs, and IOPS is degraded.

Kashyap has tested the above two approaches, both hurts IOPS on megaraid_sas.


thanks,
Ming