Re: [PATCH v2] block: BFQ default for single queue devices

2018-11-02 Thread Oleksandr Natalenko

Hi.

On 16.10.2018 19:35, Jens Axboe wrote:

Do you have anything more recent? All of these predate the current
code (by a lot), and isn't even mq. I'm mostly just interested in
plain fast NVMe device, and a big box hardware raid setup with
a ton of drives.

I do still think that this should be going through the distros, they
need to be the ones driving this, as they will ultimately be the
ones getting customer reports on regressions. The qual/test cycle
they do is useful for this. In mainline, if we make a change like
this, we'll figure out if it worked many releases down the line.


Some benchmarks here for a non-RAID setup obtained by S suite. This is 
from Lenovo T460s with SAMSUNG MZNTY256HDHP-000L7 SSD. v4.19 kernel is 
running with all recent BFQ patches applied.


# replayed gnome terminal startup throughput
# Workload   bfq mq-deadline
  0r-raw_seq 13.2617 13.4867
  10r-raw_seq512.507  539.95

# replayed gnome terminal startup time
# Workload   bfq mq-deadline
  0r-raw_seq0.43 0.4
  10r-raw_seq   0.685 4.1625

# replayed lowriter startup throughput
# Workload   bfq mq-deadline
  0r-raw_seq   9.985  10.375
  10r-raw_seq 516.62  539.61

# replayed lowriter startup time
# Workload   bfq mq-deadline
  0r-raw_seq 0.4  0.3875
  10r-raw_seq  0.535  2.3875

# replayed xterm startup throughput
# Workload   bfq mq-deadline
  0r-raw_seq 5.93833 6.10834
  10r-raw_seq524.447 539.991

# replayed xterm startup time
# Workload   bfq mq-deadline
  0r-raw_seq0.230.23
  10r-raw_seq   0.381.56

# throughput
# Workload   bfq mq-deadline
  10r-raw_rand   362.446 363.817
  10r-raw_seq537.646 540.609
  1r-raw_seq 500.733 502.526

Throughput-wise, BFQ is on-par with mq-deadline. Latency-wise, BFQ is 
much-much better.


--
  Oleksandr Natalenko (post-factum)


Re: [GIT PULL] nvme fixes for 4.20

2018-11-02 Thread Jens Axboe
On 11/2/18 12:37 AM, Christoph Hellwig wrote:
> The following changes since commit a5185607787e030fcb0009194d3b12f8bcca59d6:
> 
>   block: brd: associate with queue until adding disk (2018-10-31 08:43:09 
> -0600)
> 
> are available in the Git repository at:
> 
>   git://git.infradead.org/nvme.git nvme-4.20
> 
> for you to fetch changes up to ae172db3b3f389c363ec7f3683b2cad41091580d:
> 
>   nvme-pci: fix conflicting p2p resource adds (2018-11-01 08:44:47 +0200)
> 
> 
> James Smart (1):
>   nvme-fc: fix request private initialization
> 
> Keith Busch (1):
>   nvme-pci: fix conflicting p2p resource adds
> 
>  drivers/nvme/host/fc.c  | 2 +-
>  drivers/nvme/host/pci.c | 5 -
>  2 files changed, 5 insertions(+), 2 deletions(-)

Applied these manually, since I had rebased for-linus yesterday to drop
a buggy patch from Ming. JFYI.

-- 
Jens Axboe



[PATCH 1/4] Revert "irq: add support for allocating (and affinitizing) sets of IRQs"

2018-11-02 Thread Ming Lei
This reverts commit 1d44f6f43e229ca06bf680aa7eb5ad380eaa5d72.
---
 drivers/pci/msi.c | 14 --
 include/linux/interrupt.h |  4 
 kernel/irq/affinity.c | 40 +---
 3 files changed, 9 insertions(+), 49 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 265ed3e4c920..af24ed50a245 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -1036,13 +1036,6 @@ static int __pci_enable_msi_range(struct pci_dev *dev, 
int minvec, int maxvec,
if (maxvec < minvec)
return -ERANGE;
 
-   /*
-* If the caller is passing in sets, we can't support a range of
-* vectors. The caller needs to handle that.
-*/
-   if (affd && affd->nr_sets && minvec != maxvec)
-   return -EINVAL;
-
if (WARN_ON_ONCE(dev->msi_enabled))
return -EINVAL;
 
@@ -1094,13 +1087,6 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
if (maxvec < minvec)
return -ERANGE;
 
-   /*
-* If the caller is passing in sets, we can't support a range of
-* supported vectors. The caller needs to handle that.
-*/
-   if (affd && affd->nr_sets && minvec != maxvec)
-   return -EINVAL;
-
if (WARN_ON_ONCE(dev->msix_enabled))
return -EINVAL;
 
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index ca397ff40836..1d6711c28271 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -247,14 +247,10 @@ struct irq_affinity_notify {
  * the MSI(-X) vector space
  * @post_vectors:  Don't apply affinity to @post_vectors at end of
  * the MSI(-X) vector space
- * @nr_sets:   Length of passed in *sets array
- * @sets:  Number of affinitized sets
  */
 struct irq_affinity {
int pre_vectors;
int post_vectors;
-   int nr_sets;
-   int *sets;
 };
 
 #if defined(CONFIG_SMP)
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 2046a0f0f0f1..f4f29b9d90ee 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -180,7 +180,6 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
int curvec, usedvecs;
cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
struct cpumask *masks = NULL;
-   int i, nr_sets;
 
/*
 * If there aren't any vectors left after applying the pre/post
@@ -211,23 +210,10 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
 
-   /*
-* Spread on present CPUs starting from affd->pre_vectors. If we
-* have multiple sets, build each sets affinity mask separately.
-*/
-   nr_sets = affd->nr_sets;
-   if (!nr_sets)
-   nr_sets = 1;
-
-   for (i = 0, usedvecs = 0; i < nr_sets; i++) {
-   int this_vecs = affd->sets ? affd->sets[i] : affvecs;
-   int nr;
-
-   nr = irq_build_affinity_masks(affd, curvec, this_vecs,
- node_to_cpumask, cpu_present_mask,
- nmsk, masks + usedvecs);
-   usedvecs += nr;
-   }
+   /* Spread on present CPUs starting from affd->pre_vectors */
+   usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
+   node_to_cpumask, cpu_present_mask,
+   nmsk, masks);
 
/*
 * Spread on non present CPUs starting from the next vector to be
@@ -272,21 +258,13 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, 
const struct irq_affinity
 {
int resv = affd->pre_vectors + affd->post_vectors;
int vecs = maxvec - resv;
-   int set_vecs;
+   int ret;
 
if (resv > minvec)
return 0;
 
-   if (affd->nr_sets) {
-   int i;
-
-   for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
-   set_vecs += affd->sets[i];
-   } else {
-   get_online_cpus();
-   set_vecs = cpumask_weight(cpu_possible_mask);
-   put_online_cpus();
-   }
-
-   return resv + min(set_vecs, vecs);
+   get_online_cpus();
+   ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
+   put_online_cpus();
+   return ret;
 }
-- 
2.9.5



[GIT PULL] Final block merge window changes/fixes

2018-11-02 Thread Jens Axboe
Hi Linus,

The biggest part of this pull request is the revert of the blkcg cleanup
series. It had one fix earlier for a stacked device issue, but another
one was reported. Rather than play whack-a-mole with this, revert the
entire series and try again for the next kernel release.

Apart from that, only small fixes/changes. This pull request contains:

- Indentation fixup for mtip32xx (Colin Ian King)

- The blkcg cleanup series revert (Dennis Zhou)

- Two NVMe fixes. One fixing a regression in the nvme request
  initialization in this merge window, causing nvme-fc to not work. The
  other is a suspend/resume p2p resource issue (James, Keith)

- Fix sg discard merge, allowing us to merge in cases where we didn't
  before (Jianchao Wang)

- Call rq_qos_exit() after the queue is frozen, preventing a hang (Ming)

- Fix brd queue setup, fixing an oops if we fail setting up all devices
  (Ming)

Please pull!


  git://git.kernel.dk/linux-block.git tags/for-linus-20181102



Colin Ian King (1):
  mtip32xx: clean an indentation issue, remove extraneous tabs

Dennis Zhou (1):
  blkcg: revert blkcg cleanups series

James Smart (1):
  nvme-fc: fix request private initialization

Jianchao Wang (1):
  block: fix the DISCARD request merge

Keith Busch (1):
  nvme-pci: fix conflicting p2p resource adds

Ming Lei (2):
  block: call rq_qos_exit() after queue is frozen
  block: brd: associate with queue until adding disk

 Documentation/admin-guide/cgroup-v2.rst |   8 +-
 block/bfq-cgroup.c  |   4 +-
 block/bfq-iosched.c |   2 +-
 block/bio.c | 174 +---
 block/blk-cgroup.c  | 123 +++---
 block/blk-core.c|   4 +-
 block/blk-iolatency.c   |  26 -
 block/blk-merge.c   |  46 +++--
 block/blk-sysfs.c   |   2 -
 block/blk-throttle.c|  13 ++-
 block/bounce.c  |   4 +-
 block/cfq-iosched.c |   4 +-
 drivers/block/brd.c |  16 ++-
 drivers/block/loop.c|   5 +-
 drivers/block/mtip32xx/mtip32xx.c   |   4 +-
 drivers/md/raid0.c  |   2 +-
 drivers/nvme/host/fc.c  |   2 +-
 drivers/nvme/host/pci.c |   5 +-
 fs/buffer.c |  10 +-
 fs/ext4/page-io.c   |   2 +-
 include/linux/bio.h |  26 ++---
 include/linux/blk-cgroup.h  | 145 +-
 include/linux/blk_types.h   |   1 +
 include/linux/cgroup.h  |   2 -
 include/linux/writeback.h   |   5 +-
 kernel/cgroup/cgroup.c  |  48 ++---
 kernel/trace/blktrace.c |   4 +-
 mm/page_io.c|   2 +-
 28 files changed, 265 insertions(+), 424 deletions(-)

-- 
Jens Axboe



Re: recent issues with heavy delete's causing soft lockups

2018-11-02 Thread Thomas Fjellstrom
On Saturday, October 27, 2018 1:20:10 PM MDT Jens Axboe wrote:
> On Oct 27, 2018, at 12:40 PM, Thomas Fjellstrom  
wrote:
> > Hi
[snip explanation of problem]
> 
> Can you try 4.19? A patch went in since 4.18 that fixes a starvation issue
> around requeue conditions, which SATA is the one to most often hit.

Gave it a shot. with the vanila kernel from git linux-stable/v4.9. It was a 
bit of a pain as the amdgpu driver seems to be broken for my r9 390 on many 
kernels, including 4.19. Had to reconfigure to the radeon driver, which I must 
say seems to work a lot better than it used to.

At any rate, it doesn't seem to have helped a lot so far. I did end up adding 
"scsi_mod.use_blk_mq=0 dm_mod.use_blk_mq=0" to the default kernel boot command 
line in grub. It seems to have helped a little, but I haven't tested fully 
with a full delete of the build directory. haven't had time to sit and wait 
the 40+ minutes it takes to re build the entire thing. And I'm low enough on 
disk space that I can't easily make a copy of the 109GB build folder. I've got 
about 25GB free out of 780GB. I'll try and test some more soon.

> Jens


-- 
Thomas Fjellstrom
tho...@fjellstrom.ca





Re: [GIT PULL] Final block merge window changes/fixes

2018-11-02 Thread Linus Torvalds
On Fri, Nov 2, 2018 at 10:08 AM Jens Axboe  wrote:
>
> The biggest part of this pull request is the revert of the blkcg cleanup
> series. It had one fix earlier for a stacked device issue, but another
> one was reported. Rather than play whack-a-mole with this, revert the
> entire series and try again for the next kernel release.
>
> Apart from that, only small fixes/changes.

Pulled,

  Linus


Re: recent issues with heavy delete's causing soft lockups

2018-11-02 Thread Thomas Fjellstrom
On Saturday, October 27, 2018 1:20:10 PM MDT Jens Axboe wrote:
> On Oct 27, 2018, at 12:40 PM, Thomas Fjellstrom  
[snip]
> 
> Can you try 4.19? A patch went in since 4.18 that fixes a starvation issue
> around requeue conditions, which SATA is the one to most often hit.
> 
> Jens

I just had to do a clean, and I have the mq kernel options I mentioned in my 
previous mail enabled. (mq should be disabled) and it appears to still be 
causing issues. current io scheduler appears to be cfq, and it took that "make 
clean" about 4 minutes, a lot of that time was spent with plasma, intelij, and 
chrome all starved of IO. 

I did switch to a terminal and checked iostat -d 1, and it showed very little 
actual io for the time I was looking at it.

I have no idea what's going on.

-- 
Thomas Fjellstrom
tho...@fjellstrom.ca





Re: recent issues with heavy delete's causing soft lockups

2018-11-02 Thread Jens Axboe
On 11/2/18 2:32 PM, Thomas Fjellstrom wrote:
> On Saturday, October 27, 2018 1:20:10 PM MDT Jens Axboe wrote:
>> On Oct 27, 2018, at 12:40 PM, Thomas Fjellstrom  
> [snip]
>>
>> Can you try 4.19? A patch went in since 4.18 that fixes a starvation issue
>> around requeue conditions, which SATA is the one to most often hit.
>>
>> Jens
> 
> I just had to do a clean, and I have the mq kernel options I mentioned in my 
> previous mail enabled. (mq should be disabled) and it appears to still be 
> causing issues. current io scheduler appears to be cfq, and it took that 
> "make 
> clean" about 4 minutes, a lot of that time was spent with plasma, intelij, 
> and 
> chrome all starved of IO. 
> 
> I did switch to a terminal and checked iostat -d 1, and it showed very little 
> actual io for the time I was looking at it.
> 
> I have no idea what's going on.

If you're using cfq, then it's not using mq at all. Maybe do something ala:

# perf record -ag -- sleep 10

while the slowdown is happening and then do perf report -g --no-children and
see if that yields anything interesting. Sounds like time is being spent
elsewhere and you aren't actually waiting on IO.

-- 
Jens Axboe



[PATCH V4 0/5] blk-mq: refactor and fix on issue request directly

2018-11-02 Thread Jianchao Wang
Hi Jens

These patch set refactors the code of issueing request driectly and
fix some defects.

The 1st patch make __blk_mq_issue_directly be able to accept NULL cookie
pointer.

The 2nd patch refactors the code of issue request directly.
It merges the blk_mq_try_issue_directly and __blk_mq_try_issue_directly
and make it handle the return value of .queue_rq itself.

The 3rd patch let the requests be inserted into hctx->dispatch when
the queue is stopped or quiesced if bypass is true.

The 4th patch make blk_mq_sched_insert_requests issue requests directly
with 'bypass' false, then it needn't to handle the non-issued requests
any more.

The 5th patch ensures the hctx to be ran on its mapped cpu in issue directly
path.

V4:
 - split the original patch 1 into two patches, 1st and 2nd patch currently
 - rename the mq_decision to mq_issue_decision
 - comment changes

V3:
 - Correct the code about the case bypass_insert is true and io scheduler
   attached. The request still need to be issued in case above. (1/4)
 - Refactor the code to make code clearer. blk_mq_make_request is introduced
   to decide insert, end or just return based on the return value of .queue_rq
   and bypass_insert (1/4) 
 - Add the 2nd patch. It introduce a new decision result which indicates to
   insert request with blk_mq_request_bypass_insert.
 - Modify the code to adapt the new patch 1.

V2:
 - Add 1st and 2nd patch

Jianchao Wang (5)
blk-mq: make __blk_mq_issue_directly be able to accept
blk-mq: refactor the code of issue request directly
blk-mq: fix issue directly case when q is stopped or
blk-mq: issue directly with bypass 'false' in
blk-mq: ensure hctx to be ran on mapped cpu when issue

 block/blk-mq-sched.c |   8 ++-
 block/blk-mq.c   | 149 ++-
 2 files changed, 92 insertions(+), 65 deletions(-)

Thanks
Jianchao


[PATCH V4 5/5] blk-mq: ensure hctx to be ran on mapped cpu when issue directly

2018-11-02 Thread Jianchao Wang
When issue request directly and the task is migrated out of the
original cpu where it allocates request, hctx could be ran on
the cpu where it is not mapped. To fix this, insert the request
if BLK_MQ_F_BLOCKING is set, check whether the current is mapped
to the hctx and invoke __blk_mq_issue_directly under preemption
disabled.

Signed-off-by: Jianchao Wang 
---
 block/blk-mq.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bf8b144..4450eb6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1771,6 +1771,17 @@ static blk_status_t blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
enum mq_issue_decision dec;
int srcu_idx;
 
+   if (hctx->flags & BLK_MQ_F_BLOCKING) {
+   force = true;
+   goto out;
+   }
+
+   if (!cpumask_test_cpu(get_cpu(), hctx->cpumask)) {
+   put_cpu();
+   force = true;
+   goto out;
+   }
+
hctx_lock(hctx, &srcu_idx);
 
/*
@@ -1801,7 +1812,8 @@ static blk_status_t blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
 out_unlock:
hctx_unlock(hctx, srcu_idx);
-
+   put_cpu();
+out:
dec = blk_mq_make_dicision(ret, bypass, force);
switch(dec) {
case MQ_ISSUE_INSERT_QUEUE:
-- 
2.7.4



[PATCH V4 2/5] blk-mq: refactor the code of issue request directly

2018-11-02 Thread Jianchao Wang
Merge blk_mq_try_issue_directly and __blk_mq_try_issue_directly
into one interface which is able to handle the return value from
.queue_rq callback. To make the code clearer, introduce new helpers
enum mq_issue_decision and blk_mq_make_decision to decide how to
handle the non-issued requests.

Signed-off-by: Jianchao Wang 
---
 block/blk-mq.c | 104 +
 1 file changed, 61 insertions(+), 43 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index af5b591..962fdfc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1729,78 +1729,96 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
return ret;
 }
 
-static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+enum mq_issue_decision {
+   MQ_ISSUE_INSERT_QUEUE,
+   MQ_ISSUE_END_REQUEST,
+   MQ_ISSUE_DO_NOTHING,
+};
+
+static inline enum mq_issue_decision
+   blk_mq_make_dicision(blk_status_t ret, bool bypass)
+{
+   enum mq_issue_decision dec;
+
+   switch(ret) {
+   case BLK_STS_OK:
+   dec = MQ_ISSUE_DO_NOTHING;
+   break;
+   case BLK_STS_DEV_RESOURCE:
+   case BLK_STS_RESOURCE:
+   dec = bypass ? MQ_ISSUE_DO_NOTHING : MQ_ISSUE_INSERT_QUEUE;
+   break;
+   default:
+   dec = bypass ? MQ_ISSUE_DO_NOTHING : MQ_ISSUE_END_REQUEST;
+   break;
+   }
+
+   return dec;
+}
+
+static blk_status_t blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq,
blk_qc_t *cookie,
-   bool bypass_insert)
+   bool bypass)
 {
struct request_queue *q = rq->q;
bool run_queue = true;
+   blk_status_t ret = BLK_STS_RESOURCE;
+   enum mq_issue_decision dec;
+   int srcu_idx;
+
+   hctx_lock(hctx, &srcu_idx);
 
/*
-* RCU or SRCU read lock is needed before checking quiesced flag.
+* hctx_lock is needed before checking quiesced flag.
 *
-* When queue is stopped or quiesced, ignore 'bypass_insert' from
-* blk_mq_request_issue_directly(), and return BLK_STS_OK to caller,
-* and avoid driver to try to dispatch again.
+* When queue is stopped or quiesced, ignore 'bypass', insert
+* and return BLK_STS_OK to caller, and avoid driver to try to
+* dispatch again.
 */
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
run_queue = false;
-   bypass_insert = false;
-   goto insert;
+   bypass = false;
+   goto out_unlock;
}
 
-   if (q->elevator && !bypass_insert)
-   goto insert;
+   if (q->elevator && !bypass)
+   goto out_unlock;
 
if (!blk_mq_get_dispatch_budget(hctx))
-   goto insert;
+   goto out_unlock;
 
if (!blk_mq_get_driver_tag(rq)) {
blk_mq_put_dispatch_budget(hctx);
-   goto insert;
+   goto out_unlock;
}
 
-   return __blk_mq_issue_directly(hctx, rq, cookie);
-insert:
-   if (bypass_insert)
-   return BLK_STS_RESOURCE;
+   ret = __blk_mq_issue_directly(hctx, rq, cookie);
 
-   blk_mq_sched_insert_request(rq, false, run_queue, false);
-   return BLK_STS_OK;
-}
-
-static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq, blk_qc_t *cookie)
-{
-   blk_status_t ret;
-   int srcu_idx;
-
-   might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
-
-   hctx_lock(hctx, &srcu_idx);
+out_unlock:
+   hctx_unlock(hctx, srcu_idx);
 
-   ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
-   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
-   blk_mq_sched_insert_request(rq, false, true, false);
-   else if (ret != BLK_STS_OK)
+   dec = blk_mq_make_dicision(ret, bypass);
+   switch(dec) {
+   case MQ_ISSUE_INSERT_QUEUE:
+   blk_mq_sched_insert_request(rq, false, run_queue, false);
+   break;
+   case MQ_ISSUE_END_REQUEST:
blk_mq_end_request(rq, ret);
+   break;
+   default:
+   return ret;
+   }
 
-   hctx_unlock(hctx, srcu_idx);
+   return BLK_STS_OK;
 }
 
 blk_status_t blk_mq_request_issue_directly(struct request *rq)
 {
-   blk_status_t ret;
-   int srcu_idx;
struct blk_mq_ctx *ctx = rq->mq_ctx;
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
 
-   hctx_lock(hctx, &srcu_idx);
-   ret = __blk_mq_try_issue_directly(hctx, rq, NULL, true);
-   hctx_unlock(hctx, srcu_idx);
-
-   return ret;
+   return blk_mq_try_issue_directly(hctx, rq, NULL, true);
 }
 
 void blk_mq_try

[PATCH V4 1/5] blk-mq: make __blk_mq_issue_directly be able to accept NULL cookie pointer

2018-11-02 Thread Jianchao Wang
Make __blk_mq_issue_directly be able to accept a NULL cookie pointer
and remove the dummy unused_cookie in blk_mq_request_issue_directly.

Signed-off-by: Jianchao Wang 
---
 block/blk-mq.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index dcf10e3..af5b591 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1700,8 +1700,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
blk_qc_t new_cookie;
blk_status_t ret;
 
-   new_cookie = request_to_qc_t(hctx, rq);
-
/*
 * For OK queue, we are done. For error, caller may kill it.
 * Any other error (busy), just add it to our list as we
@@ -1711,19 +1709,23 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
switch (ret) {
case BLK_STS_OK:
blk_mq_update_dispatch_busy(hctx, false);
-   *cookie = new_cookie;
+   new_cookie = request_to_qc_t(hctx, rq);
break;
case BLK_STS_RESOURCE:
case BLK_STS_DEV_RESOURCE:
blk_mq_update_dispatch_busy(hctx, true);
__blk_mq_requeue_request(rq);
+   new_cookie = BLK_QC_T_NONE;
break;
default:
blk_mq_update_dispatch_busy(hctx, false);
-   *cookie = BLK_QC_T_NONE;
+   new_cookie = BLK_QC_T_NONE;
break;
}
 
+   if (cookie)
+   *cookie = new_cookie;
+
return ret;
 }
 
@@ -1791,12 +1793,11 @@ blk_status_t blk_mq_request_issue_directly(struct 
request *rq)
 {
blk_status_t ret;
int srcu_idx;
-   blk_qc_t unused_cookie;
struct blk_mq_ctx *ctx = rq->mq_ctx;
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
 
hctx_lock(hctx, &srcu_idx);
-   ret = __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true);
+   ret = __blk_mq_try_issue_directly(hctx, rq, NULL, true);
hctx_unlock(hctx, srcu_idx);
 
return ret;
-- 
2.7.4



[PATCH V4 3/5] blk-mq: fix issue directly case when q is stopped or quiesced

2018-11-02 Thread Jianchao Wang
When try to issue request directly, if the queue is stopped or
quiesced, 'bypass' will be ignored and return BLK_STS_OK to caller
to avoid it dispatch request again. Then the request will be
inserted with blk_mq_sched_insert_request. This is not correct
for dm-rq case where we should avoid to pass through the underlying
path's io scheduler.

To fix it, add new mq_issue_decision entry MQ_ISSUE_INSERT_DISPATCH
for above case where the request need to be inserted forcibly.
And use blk_mq_request_bypass_insert to insert the request into
hctx->dispatch directly.

Signed-off-by: Jianchao Wang 
---
 block/blk-mq.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 962fdfc..a0b9b6c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1731,12 +1731,13 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
 enum mq_issue_decision {
MQ_ISSUE_INSERT_QUEUE,
+   MQ_ISSUE_INSERT_DISPATCH,
MQ_ISSUE_END_REQUEST,
MQ_ISSUE_DO_NOTHING,
 };
 
 static inline enum mq_issue_decision
-   blk_mq_make_dicision(blk_status_t ret, bool bypass)
+   blk_mq_make_dicision(blk_status_t ret, bool bypass, bool force)
 {
enum mq_issue_decision dec;
 
@@ -1746,7 +1747,10 @@ static inline enum mq_issue_decision
break;
case BLK_STS_DEV_RESOURCE:
case BLK_STS_RESOURCE:
-   dec = bypass ? MQ_ISSUE_DO_NOTHING : MQ_ISSUE_INSERT_QUEUE;
+   if (force)
+   dec = bypass ? MQ_ISSUE_INSERT_DISPATCH : 
MQ_ISSUE_INSERT_QUEUE;
+   else
+   dec = bypass ? MQ_ISSUE_DO_NOTHING : 
MQ_ISSUE_INSERT_QUEUE;
break;
default:
dec = bypass ? MQ_ISSUE_DO_NOTHING : MQ_ISSUE_END_REQUEST;
@@ -1762,7 +1766,7 @@ static blk_status_t blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
bool bypass)
 {
struct request_queue *q = rq->q;
-   bool run_queue = true;
+   bool run_queue = true, force = false;
blk_status_t ret = BLK_STS_RESOURCE;
enum mq_issue_decision dec;
int srcu_idx;
@@ -1778,7 +1782,7 @@ static blk_status_t blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 */
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
run_queue = false;
-   bypass = false;
+   force = true;
goto out_unlock;
}
 
@@ -1798,11 +1802,14 @@ static blk_status_t blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 out_unlock:
hctx_unlock(hctx, srcu_idx);
 
-   dec = blk_mq_make_dicision(ret, bypass);
+   dec = blk_mq_make_dicision(ret, bypass, force);
switch(dec) {
case MQ_ISSUE_INSERT_QUEUE:
blk_mq_sched_insert_request(rq, false, run_queue, false);
break;
+   case MQ_ISSUE_INSERT_DISPATCH:
+   blk_mq_request_bypass_insert(rq, run_queue);
+   break;
case MQ_ISSUE_END_REQUEST:
blk_mq_end_request(rq, ret);
break;
-- 
2.7.4



[PATCH V4 4/5] blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests

2018-11-02 Thread Jianchao Wang
It is not necessary to issue request directly with bypass 'true'
in blk_mq_sched_insert_requests and insert the non-issued requests
itself. Just set bypass to 'false' and let blk_mq_try_issue_directly
handle them totally.

Signed-off-by: Jianchao Wang 
---
 block/blk-mq-sched.c |  8 +++-
 block/blk-mq.c   | 11 +--
 2 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 29bfe80..23cd97e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -411,12 +411,10 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
 * busy in case of 'none' scheduler, and this way may save
 * us one extra enqueue & dequeue to sw queue.
 */
-   if (!hctx->dispatch_busy && !e && !run_queue_async) {
+   if (!hctx->dispatch_busy && !e && !run_queue_async)
blk_mq_try_issue_list_directly(hctx, list);
-   if (list_empty(list))
-   return;
-   }
-   blk_mq_insert_requests(hctx, ctx, list);
+   else
+   blk_mq_insert_requests(hctx, ctx, list);
}
 
blk_mq_run_hw_queue(hctx, run_queue_async);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a0b9b6c..bf8b144 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1832,20 +1832,11 @@ void blk_mq_try_issue_list_directly(struct 
blk_mq_hw_ctx *hctx,
struct list_head *list)
 {
while (!list_empty(list)) {
-   blk_status_t ret;
struct request *rq = list_first_entry(list, struct request,
queuelist);
 
list_del_init(&rq->queuelist);
-   ret = blk_mq_request_issue_directly(rq);
-   if (ret != BLK_STS_OK) {
-   if (ret == BLK_STS_RESOURCE ||
-   ret == BLK_STS_DEV_RESOURCE) {
-   list_add(&rq->queuelist, list);
-   break;
-   }
-   blk_mq_end_request(rq, ret);
-   }
+   blk_mq_try_issue_directly(hctx, rq, NULL, false);
}
 }
 
-- 
2.7.4



Re: [PATCH 13/16] irq: add support for allocating (and affinitizing) sets of IRQs

2018-11-02 Thread Ming Lei
On Tue, Oct 30, 2018 at 12:32:49PM -0600, Jens Axboe wrote:
> A driver may have a need to allocate multiple sets of MSI/MSI-X
> interrupts, and have them appropriately affinitized. Add support for
> defining a number of sets in the irq_affinity structure, of varying
> sizes, and get each set affinitized correctly across the machine.
> 
> Cc: Thomas Gleixner 
> Cc: linux-ker...@vger.kernel.org
> Reviewed-by: Hannes Reinecke 
> Reviewed-by: Ming Lei 
> Signed-off-by: Jens Axboe 
> ---
>  drivers/pci/msi.c | 14 ++
>  include/linux/interrupt.h |  4 
>  kernel/irq/affinity.c | 40 ++-
>  3 files changed, 49 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index af24ed50a245..e6c6e10b9ceb 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1036,6 +1036,13 @@ static int __pci_enable_msi_range(struct pci_dev *dev, 
> int minvec, int maxvec,
>   if (maxvec < minvec)
>   return -ERANGE;
>  
> + /*
> +  * If the caller is passing in sets, we can't support a range of
> +  * vectors. The caller needs to handle that.
> +  */
> + if (affd->nr_sets && minvec != maxvec)
> + return -EINVAL;
> +
>   if (WARN_ON_ONCE(dev->msi_enabled))
>   return -EINVAL;
>  
> @@ -1087,6 +1094,13 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>   if (maxvec < minvec)
>   return -ERANGE;
>  
> + /*
> +  * If the caller is passing in sets, we can't support a range of
> +  * supported vectors. The caller needs to handle that.
> +  */
> + if (affd->nr_sets && minvec != maxvec)
> + return -EINVAL;
> +
>   if (WARN_ON_ONCE(dev->msix_enabled))
>   return -EINVAL;
>  
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 1d6711c28271..ca397ff40836 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -247,10 +247,14 @@ struct irq_affinity_notify {
>   *   the MSI(-X) vector space
>   * @post_vectors:Don't apply affinity to @post_vectors at end of
>   *   the MSI(-X) vector space
> + * @nr_sets: Length of passed in *sets array
> + * @sets:Number of affinitized sets
>   */
>  struct irq_affinity {
>   int pre_vectors;
>   int post_vectors;
> + int nr_sets;
> + int *sets;
>  };
>  
>  #if defined(CONFIG_SMP)
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index f4f29b9d90ee..2046a0f0f0f1 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>   int curvec, usedvecs;
>   cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
>   struct cpumask *masks = NULL;
> + int i, nr_sets;
>  
>   /*
>* If there aren't any vectors left after applying the pre/post
> @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>   get_online_cpus();
>   build_node_to_cpumask(node_to_cpumask);
>  
> - /* Spread on present CPUs starting from affd->pre_vectors */
> - usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
> - node_to_cpumask, cpu_present_mask,
> - nmsk, masks);
> + /*
> +  * Spread on present CPUs starting from affd->pre_vectors. If we
> +  * have multiple sets, build each sets affinity mask separately.
> +  */
> + nr_sets = affd->nr_sets;
> + if (!nr_sets)
> + nr_sets = 1;
> +
> + for (i = 0, usedvecs = 0; i < nr_sets; i++) {
> + int this_vecs = affd->sets ? affd->sets[i] : affvecs;
> + int nr;
> +
> + nr = irq_build_affinity_masks(affd, curvec, this_vecs,
> +   node_to_cpumask, cpu_present_mask,
> +   nmsk, masks + usedvecs);

The last parameter of the above function should have been 'masks',
because irq_build_affinity_masks() always treats 'masks' as the base
address of the array.

> + usedvecs += nr;
> + }

Thinking of further, one big problem in this patch is that each set of
IRQs should have been spread on all possible CPUs, which is done via
2-stages spread now.

However, this patch only spreads each set of IRQs on present CPUs, this
way may not work in case of physical CPU hotplug.

Thanks,
Ming


[PATCH 0/4] irq: fix support for allocating sets of IRQs

2018-11-02 Thread Ming Lei
Hi Jens,

As I mentioned, there are at least two issues in the patch of '
irq: add support for allocating (and affinitizing) sets of IRQs':

1) it is wrong to pass 'mask + usedvec' to irq_build_affinity_masks()

2) we should spread all possible CPUs in 2-stage way on each set of IRQs

The fix isn't trivial, and I introduce two extra patches as preparation,
then the implementation can be more clean.

The patchset is against mq-maps branch of block tree, feel free to
integrate into the whole patchset of multiple queue maps.

Thanks,
Ming


Jens Axboe (1):
  irq: add support for allocating (and affinitizing) sets of IRQs

Ming Lei (3):
  Revert "irq: add support for allocating (and affinitizing) sets of
IRQs"
  irq: move 2-stage irq spread into one helper
  irq: pass first vector to __irq_build_affinity_masks

 kernel/irq/affinity.c | 119 +++---
 1 file changed, 75 insertions(+), 44 deletions(-)

Cc: Thomas Gleixner 
Cc: linux-ker...@vger.kernel.org
Cc: Hannes Reinecke 
Cc: Ming Lei 
Cc: Keith Busch 
Cc: Sagi Grimberg 

-- 
2.9.5



[PATCH 3/4] irq: pass first vector to __irq_build_affinity_masks

2018-11-02 Thread Ming Lei
No functional change, and prepare for the following patch to
support allocating (and affinitizing) sets of IRQs, in which
each set of IRQ needs whole 2-stage spread, and the 1st vector
should point to the 1st one in this set.

Cc: Thomas Gleixner 
Cc: linux-ker...@vger.kernel.org
Cc: Hannes Reinecke 
Cc: Ming Lei 
Cc: Keith Busch 
Cc: Sagi Grimberg 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index a16b601604aa..9c74f21ab10e 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -95,14 +95,14 @@ static int get_nodes_in_cpumask(cpumask_var_t 
*node_to_cpumask,
 }
 
 static int __irq_build_affinity_masks(const struct irq_affinity *affd,
-   int startvec, int numvecs,
+   int startvec, int numvecs, int firstvec,
cpumask_var_t *node_to_cpumask,
const struct cpumask *cpu_mask,
struct cpumask *nmsk,
struct cpumask *masks)
 {
int n, nodes, cpus_per_vec, extra_vecs, done = 0;
-   int last_affv = affd->pre_vectors + numvecs;
+   int last_affv = firstvec + numvecs;
int curvec = startvec;
nodemask_t nodemsk = NODE_MASK_NONE;
 
@@ -121,7 +121,7 @@ static int __irq_build_affinity_masks(const struct 
irq_affinity *affd,
if (++done == numvecs)
break;
if (++curvec == last_affv)
-   curvec = affd->pre_vectors;
+   curvec = firstvec;
}
goto out;
}
@@ -130,7 +130,7 @@ static int __irq_build_affinity_masks(const struct 
irq_affinity *affd,
int ncpus, v, vecs_to_assign, vecs_per_node;
 
/* Spread the vectors per node */
-   vecs_per_node = (numvecs - (curvec - affd->pre_vectors)) / 
nodes;
+   vecs_per_node = (numvecs - (curvec - firstvec)) / nodes;
 
/* Get the cpus on this node which are in the mask */
cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
@@ -158,7 +158,7 @@ static int __irq_build_affinity_masks(const struct 
irq_affinity *affd,
if (done >= numvecs)
break;
if (curvec >= last_affv)
-   curvec = affd->pre_vectors;
+   curvec = firstvec;
--nodes;
}
 
@@ -191,8 +191,8 @@ static int irq_build_affinity_masks(const struct 
irq_affinity *affd,
 
/* Spread on present CPUs starting from affd->pre_vectors */
usedvecs = __irq_build_affinity_masks(affd, curvec, numvecs,
-   node_to_cpumask, cpu_present_mask,
-   nmsk, masks);
+   affd->pre_vectors, node_to_cpumask,
+   cpu_present_mask, nmsk, masks);
 
/*
 * Spread on non present CPUs starting from the next vector to be
@@ -206,8 +206,8 @@ static int irq_build_affinity_masks(const struct 
irq_affinity *affd,
curvec = affd->pre_vectors + usedvecs;
cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
usedvecs += __irq_build_affinity_masks(affd, curvec, numvecs,
-node_to_cpumask, npresmsk,
-nmsk, masks);
+   affd->pre_vectors, node_to_cpumask, 
npresmsk,
+   nmsk, masks);
put_online_cpus();
 
free_cpumask_var(npresmsk);
-- 
2.9.5



[PATCH 2/4] irq: move 2-stage irq spread into one helper

2018-11-02 Thread Ming Lei
No functional change, and prepare for the following patch to
support allocating (and affinitizing) sets of IRQs.

Cc: Thomas Gleixner 
Cc: linux-ker...@vger.kernel.org
Cc: Hannes Reinecke 
Cc: Ming Lei 
Cc: Keith Busch 
Cc: Sagi Grimberg 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 92 +++
 1 file changed, 56 insertions(+), 36 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index f4f29b9d90ee..a16b601604aa 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,7 +94,7 @@ static int get_nodes_in_cpumask(cpumask_var_t 
*node_to_cpumask,
return nodes;
 }
 
-static int irq_build_affinity_masks(const struct irq_affinity *affd,
+static int __irq_build_affinity_masks(const struct irq_affinity *affd,
int startvec, int numvecs,
cpumask_var_t *node_to_cpumask,
const struct cpumask *cpu_mask,
@@ -166,6 +166,58 @@ static int irq_build_affinity_masks(const struct 
irq_affinity *affd,
return done;
 }
 
+/*
+ * build affinity in two stages:
+ * 1) spread present CPU on these vectors
+ * 2) spread other possible CPUs on these vectors
+ */
+static int irq_build_affinity_masks(const struct irq_affinity *affd,
+   int startvec, int numvecs,
+   cpumask_var_t *node_to_cpumask,
+   struct cpumask *masks)
+{
+   int curvec = startvec, usedvecs = -1;
+   cpumask_var_t nmsk, npresmsk;
+
+   if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+   return usedvecs;
+
+   if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
+   goto fail;
+
+   /* Stabilize the cpumasks */
+   get_online_cpus();
+   build_node_to_cpumask(node_to_cpumask);
+
+   /* Spread on present CPUs starting from affd->pre_vectors */
+   usedvecs = __irq_build_affinity_masks(affd, curvec, numvecs,
+   node_to_cpumask, cpu_present_mask,
+   nmsk, masks);
+
+   /*
+* Spread on non present CPUs starting from the next vector to be
+* handled. If the spreading of present CPUs already exhausted the
+* vector space, assign the non present CPUs to the already spread
+* out vectors.
+*/
+   if (usedvecs >= numvecs)
+   curvec = affd->pre_vectors;
+   else
+   curvec = affd->pre_vectors + usedvecs;
+   cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
+   usedvecs += __irq_build_affinity_masks(affd, curvec, numvecs,
+node_to_cpumask, npresmsk,
+nmsk, masks);
+   put_online_cpus();
+
+   free_cpumask_var(npresmsk);
+
+ fail:
+   free_cpumask_var(nmsk);
+
+   return usedvecs;
+}
+
 /**
  * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
  * @nvecs: The total number of vectors
@@ -178,7 +230,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
 {
int affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
int curvec, usedvecs;
-   cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
+   cpumask_var_t *node_to_cpumask;
struct cpumask *masks = NULL;
 
/*
@@ -188,15 +240,9 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
if (nvecs == affd->pre_vectors + affd->post_vectors)
return NULL;
 
-   if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
-   return NULL;
-
-   if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
-   goto outcpumsk;
-
node_to_cpumask = alloc_node_to_cpumask();
if (!node_to_cpumask)
-   goto outnpresmsk;
+   return NULL;
 
masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
if (!masks)
@@ -206,30 +252,8 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
for (curvec = 0; curvec < affd->pre_vectors; curvec++)
cpumask_copy(masks + curvec, irq_default_affinity);
 
-   /* Stabilize the cpumasks */
-   get_online_cpus();
-   build_node_to_cpumask(node_to_cpumask);
-
-   /* Spread on present CPUs starting from affd->pre_vectors */
usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
-   node_to_cpumask, cpu_present_mask,
-   nmsk, masks);
-
-   /*
-* Spread on non present CPUs starting from the next vector to be
-* handled. If the spreading of present CPUs already exhausted the
-* vector space, assign the non present CPUs to the already spread
-* out vectors.
-*/
-   if (usedvecs >= affvecs)
- 

[PATCH 4/4] irq: add support for allocating (and affinitizing) sets of IRQs

2018-11-02 Thread Ming Lei
From: Jens Axboe 

A driver may have a need to allocate multiple sets of MSI/MSI-X
interrupts, and have them appropriately affinitized. Add support for
defining a number of sets in the irq_affinity structure, of varying
sizes, and get each set affinitized correctly across the machine.

Cc: Thomas Gleixner 
Cc: linux-ker...@vger.kernel.org
Reviewed-by: Hannes Reinecke 
Reviewed-by: Ming Lei 
Reviewed-by: Keith Busch 
Reviewed-by: Sagi Grimberg 
Signed-off-by: Jens Axboe 
Signed-off-by: Ming Lei 
---
 drivers/pci/msi.c | 14 ++
 include/linux/interrupt.h |  4 +++
 kernel/irq/affinity.c | 71 ++-
 3 files changed, 70 insertions(+), 19 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index af24ed50a245..265ed3e4c920 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -1036,6 +1036,13 @@ static int __pci_enable_msi_range(struct pci_dev *dev, 
int minvec, int maxvec,
if (maxvec < minvec)
return -ERANGE;
 
+   /*
+* If the caller is passing in sets, we can't support a range of
+* vectors. The caller needs to handle that.
+*/
+   if (affd && affd->nr_sets && minvec != maxvec)
+   return -EINVAL;
+
if (WARN_ON_ONCE(dev->msi_enabled))
return -EINVAL;
 
@@ -1087,6 +1094,13 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
if (maxvec < minvec)
return -ERANGE;
 
+   /*
+* If the caller is passing in sets, we can't support a range of
+* supported vectors. The caller needs to handle that.
+*/
+   if (affd && affd->nr_sets && minvec != maxvec)
+   return -EINVAL;
+
if (WARN_ON_ONCE(dev->msix_enabled))
return -EINVAL;
 
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 1d6711c28271..ca397ff40836 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -247,10 +247,14 @@ struct irq_affinity_notify {
  * the MSI(-X) vector space
  * @post_vectors:  Don't apply affinity to @post_vectors at end of
  * the MSI(-X) vector space
+ * @nr_sets:   Length of passed in *sets array
+ * @sets:  Number of affinitized sets
  */
 struct irq_affinity {
int pre_vectors;
int post_vectors;
+   int nr_sets;
+   int *sets;
 };
 
 #if defined(CONFIG_SMP)
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 9c74f21ab10e..d49d3bff702c 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -172,26 +172,28 @@ static int __irq_build_affinity_masks(const struct 
irq_affinity *affd,
  * 2) spread other possible CPUs on these vectors
  */
 static int irq_build_affinity_masks(const struct irq_affinity *affd,
-   int startvec, int numvecs,
+   int startvec, int numvecs, int firstvec,
cpumask_var_t *node_to_cpumask,
struct cpumask *masks)
 {
-   int curvec = startvec, usedvecs = -1;
+   int curvec = startvec, nr_present, nr_others;
+   int ret = -ENOMEM;
cpumask_var_t nmsk, npresmsk;
 
if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
-   return usedvecs;
+   return ret;
 
if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
goto fail;
 
+   ret = 0;
/* Stabilize the cpumasks */
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
 
/* Spread on present CPUs starting from affd->pre_vectors */
-   usedvecs = __irq_build_affinity_masks(affd, curvec, numvecs,
-   affd->pre_vectors, node_to_cpumask,
+   nr_present = __irq_build_affinity_masks(affd, curvec, numvecs,
+   firstvec, node_to_cpumask,
cpu_present_mask, nmsk, masks);
 
/*
@@ -200,22 +202,24 @@ static int irq_build_affinity_masks(const struct 
irq_affinity *affd,
 * vector space, assign the non present CPUs to the already spread
 * out vectors.
 */
-   if (usedvecs >= numvecs)
-   curvec = affd->pre_vectors;
+   if (nr_present >= numvecs)
+   curvec = firstvec;
else
-   curvec = affd->pre_vectors + usedvecs;
+   curvec = firstvec + nr_present;
cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
-   usedvecs += __irq_build_affinity_masks(affd, curvec, numvecs,
-   affd->pre_vectors, node_to_cpumask, 
npresmsk,
+   nr_others = __irq_build_affinity_masks(affd, curvec, numvecs,
+   firstvec, node_to_cpumask, npresmsk,
nmsk, masks);
put_online_cpus()

Re: [PATCH 13/16] irq: add support for allocating (and affinitizing) sets of IRQs

2018-11-02 Thread Keith Busch
On Fri, Nov 02, 2018 at 10:37:07PM +0800, Ming Lei wrote:
> On Tue, Oct 30, 2018 at 12:32:49PM -0600, Jens Axboe wrote:
> > A driver may have a need to allocate multiple sets of MSI/MSI-X
> > interrupts, and have them appropriately affinitized. Add support for
> > defining a number of sets in the irq_affinity structure, of varying
> > sizes, and get each set affinitized correctly across the machine.
> > 
> > Cc: Thomas Gleixner 
> > Cc: linux-ker...@vger.kernel.org
> > Reviewed-by: Hannes Reinecke 
> > Reviewed-by: Ming Lei 
> > Signed-off-by: Jens Axboe 
> > ---
> >  drivers/pci/msi.c | 14 ++
> >  include/linux/interrupt.h |  4 
> >  kernel/irq/affinity.c | 40 ++-
> >  3 files changed, 49 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index af24ed50a245..e6c6e10b9ceb 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -1036,6 +1036,13 @@ static int __pci_enable_msi_range(struct pci_dev 
> > *dev, int minvec, int maxvec,
> > if (maxvec < minvec)
> > return -ERANGE;
> >  
> > +   /*
> > +* If the caller is passing in sets, we can't support a range of
> > +* vectors. The caller needs to handle that.
> > +*/
> > +   if (affd->nr_sets && minvec != maxvec)
> > +   return -EINVAL;
> > +
> > if (WARN_ON_ONCE(dev->msi_enabled))
> > return -EINVAL;
> >  
> > @@ -1087,6 +1094,13 @@ static int __pci_enable_msix_range(struct pci_dev 
> > *dev,
> > if (maxvec < minvec)
> > return -ERANGE;
> >  
> > +   /*
> > +* If the caller is passing in sets, we can't support a range of
> > +* supported vectors. The caller needs to handle that.
> > +*/
> > +   if (affd->nr_sets && minvec != maxvec)
> > +   return -EINVAL;
> > +
> > if (WARN_ON_ONCE(dev->msix_enabled))
> > return -EINVAL;
> >  
> > diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> > index 1d6711c28271..ca397ff40836 100644
> > --- a/include/linux/interrupt.h
> > +++ b/include/linux/interrupt.h
> > @@ -247,10 +247,14 @@ struct irq_affinity_notify {
> >   * the MSI(-X) vector space
> >   * @post_vectors:  Don't apply affinity to @post_vectors at end of
> >   * the MSI(-X) vector space
> > + * @nr_sets:   Length of passed in *sets array
> > + * @sets:  Number of affinitized sets
> >   */
> >  struct irq_affinity {
> > int pre_vectors;
> > int post_vectors;
> > +   int nr_sets;
> > +   int *sets;
> >  };
> >  
> >  #if defined(CONFIG_SMP)
> > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> > index f4f29b9d90ee..2046a0f0f0f1 100644
> > --- a/kernel/irq/affinity.c
> > +++ b/kernel/irq/affinity.c
> > @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct 
> > irq_affinity *affd)
> > int curvec, usedvecs;
> > cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
> > struct cpumask *masks = NULL;
> > +   int i, nr_sets;
> >  
> > /*
> >  * If there aren't any vectors left after applying the pre/post
> > @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct 
> > irq_affinity *affd)
> > get_online_cpus();
> > build_node_to_cpumask(node_to_cpumask);
> >  
> > -   /* Spread on present CPUs starting from affd->pre_vectors */
> > -   usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
> > -   node_to_cpumask, cpu_present_mask,
> > -   nmsk, masks);
> > +   /*
> > +* Spread on present CPUs starting from affd->pre_vectors. If we
> > +* have multiple sets, build each sets affinity mask separately.
> > +*/
> > +   nr_sets = affd->nr_sets;
> > +   if (!nr_sets)
> > +   nr_sets = 1;
> > +
> > +   for (i = 0, usedvecs = 0; i < nr_sets; i++) {
> > +   int this_vecs = affd->sets ? affd->sets[i] : affvecs;
> > +   int nr;
> > +
> > +   nr = irq_build_affinity_masks(affd, curvec, this_vecs,
> > + node_to_cpumask, cpu_present_mask,
> > + nmsk, masks + usedvecs);
> 
> The last parameter of the above function should have been 'masks',
> because irq_build_affinity_masks() always treats 'masks' as the base
> address of the array.

We have multiple "bases" when using sets, so we have to update which
base to use by adding accordingly. If you just use 'masks', then you're
going to overwrite your masks from the previous set.


Re: [PATCH 0/2] loop: Better discard for block devices

2018-11-02 Thread Bart Van Assche
On Thu, 2018-11-01 at 15:44 -0700, Gwendal Grignou wrote:
> On Thu, Nov 1, 2018 at 11:15 AM Evan Green  wrote:
> > 
> > On Tue, Oct 30, 2018 at 4:50 PM Bart Van Assche  wrote:
> > > 
> > > On Tue, 2018-10-30 at 16:06 -0700, Evan Green wrote:
> > > > This series addresses some errors seen when using the loop
> > > > device directly backed by a block device. The first change plumbs
> > > > out the correct error message, and the second change prevents the
> > > > error from occurring in many cases.
> > > 
> > > Hi Evan,
> > > 
> > > Can you provide some information about the use case? Why do you think that
> > > it would be useful to support backing a loop device by a block device? Why
> > > to use the loop driver instead of dm-linear for this use case?
> > > 
> > 
> > Hi Bart,
> > In our case, the Chrome OS installer uses the loop device to map
> > slices of the disk that will ultimately represent partitions [1]. I
> > believe it has been doing install this way for a very long time, and
> > has been working well. It actually continues to work, but on block
> > devices that don't support discard operations, things are a tiny bit
> > bumpy. This series is meant to smooth out those bumps. As far as I
> > knew this was a supported scenario.
> > 
> > -Evan
> > [1] 
> > https://chromium.googlesource.com/chromiumos/platform/installer/+/master/chromeos-install
> 
> The code has moved to
> https://chromium.googlesource.com/chromiumos/platform2/+/master/installer/chromeos-install
> but the idea is the same. We create a loop device to abstract the
> persistent destination. The destination can be a block device or a
> file. The later case is used for creating master images to be flashed
> on memory chip before soldering on the production line.
> It is handy when the final device is 4K block aligned but the builder
> is using 512b block aligned device, we can mount a device over a file
> that will behave like the real device we will flash the image on.

Hi Evan and Gwendal,

Since this is a new use case for the loop driver you may want to add a test
for this use case to the blktests project. Many block layer contributors run
these tests to verify their own block layer changes. Contributing a blktests
test for this new use case will make it easier for others to verify that
their changes do not break your use case.

Bart.


Re: [PATCH RFC] block, bfq: set default slice_idle to zero for non-rotational devices

2018-11-02 Thread Paolo Valente



> Il giorno 1 nov 2018, alle ore 22:06, Holger Hoffstätte 
>  ha scritto:
> 
> On 11/01/18 18:43, Konstantin Khlebnikov wrote:
>> With default 8ms idle slice BFQ is up to 10 times slower than CFQ
>> for massive random read workloads for common SATA SSD.
>> For now zero idle slice gives better out of box experience.
>> CFQ employs this since commit 41c0126b3f22 ("block: Make CFQ default
>> to IOPS mode on SSDs")
> 
> Well, that's interesting because 3 years ago I made the same suggestion
> and was told that BFQ's heuristics automagically make it not idle when
> rotational=0.

Yep, that automagic is probably 50% of the good of BFQ.

If one just sets slice_idle=0, then throughput is always maximum with
random I/O; but there is no control on I/O any longer.

At any rate, Konstantin, if you have some use case where BFQ fails,
I'll be very glad to analyze it, and hopefully improve BFQ.  Just one
request: use at least a 4.19.

Thanks,
Paolo

> Did you actually benchmark this? I just tried and don't
> get a noticeable performance difference with slice_idle=0 compared to
> deadline.
> 
> Discussion link:
> https://groups.google.com/forum/#!msg/bfq-iosched/iRMw2n3kYLY/6l9cIm3TBgAJ
> 
> curious..
> 
> Holger



Re: [PATCH v9] virtio_blk: add discard and write zeroes support

2018-11-02 Thread Daniel Verkamp
Hi Dongli,

Unfortunately, I am not aware of any in-progress implementation of
this feature for qemu. It hopefully should not be too difficult to
wire up in the qemu virtio-blk model, but I haven't looked into it in
detail.

Thanks,
-- Daniel
On Thu, Nov 1, 2018 at 4:42 PM Dongli Zhang  wrote:
>
> Hi Daniel,
>
> Other than crosvm, is there any version of qemu (e.g., repositories developed 
> in
> progress on github) where I can try with this feature?
>
> Thank you very much!
>
> Dongli Zhang
>
> On 11/02/2018 06:40 AM, Daniel Verkamp wrote:
> > From: Changpeng Liu 
> >
> > In commit 88c85538, "virtio-blk: add discard and write zeroes features
> > to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio
> > block specification has been extended to add VIRTIO_BLK_T_DISCARD and
> > VIRTIO_BLK_T_WRITE_ZEROES commands.  This patch enables support for
> > discard and write zeroes in the virtio-blk driver when the device
> > advertises the corresponding features, VIRTIO_BLK_F_DISCARD and
> > VIRTIO_BLK_F_WRITE_ZEROES.
> >
> > Signed-off-by: Changpeng Liu 
> > Signed-off-by: Daniel Verkamp 
> > ---
> > dverkamp: I've picked up this patch and made a few minor changes (as
> > listed below); most notably, I changed the kmalloc back to GFP_ATOMIC,
> > since it can be called from a context where sleeping is not allowed.
> > To prevent large allocations, I've also clamped the maximum number of
> > discard segments to 256; this results in a 4K allocation and should be
> > plenty of descriptors for most use cases.
> >
> > I also removed most of the description from the commit message, since it
> > was duplicating the comments from virtio_blk.h and quoting parts of the
> > spec without adding any extra information.  I have tested this iteration
> > of the patch using crosvm with modifications to enable the new features:
> > https://chromium.googlesource.com/chromiumos/platform/crosvm/
> >
> > v9 fixes a number of review issues; I didn't attempt to optimize the
> > single-element write zeroes case, so it still does an allocation per
> > request (I did not see any easy place to put the payload that would
> > avoid the allocation).
> >
> > CHANGELOG:
> > v9: [dverkamp] fix LE types in discard struct; cleanups from Ming Lei
> > v8: [dverkamp] replace shifts by 9 with SECTOR_SHIFT constant
> > v7: [dverkamp] use GFP_ATOMIC for allocation that may not sleep; clarify
> > descriptor flags field; comment wording cleanups.
> > v6: don't set T_OUT bit to discard and write zeroes commands.
> > v5: use new block layer API: blk_queue_flag_set.
> > v4: several optimizations based on MST's comments, remove bit field
> > usage for command descriptor.
> > v3: define the virtio-blk protocol to add discard and write zeroes
> > support, first version implementation based on proposed specification.
> > v2: add write zeroes command support.
> > v1: initial proposal implementation for discard command.
> > ---
> >  drivers/block/virtio_blk.c  | 83 -
> >  include/uapi/linux/virtio_blk.h | 54 +
> >  2 files changed, 135 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > index 086c6bb12baa..0f39efb4b3aa 100644
> > --- a/drivers/block/virtio_blk.c
> > +++ b/drivers/block/virtio_blk.c
> > @@ -18,6 +18,7 @@
> >
> >  #define PART_BITS 4
> >  #define VQ_NAME_LEN 16
> > +#define MAX_DISCARD_SEGMENTS 256u
> >
> >  static int major;
> >  static DEFINE_IDA(vd_index_ida);
> > @@ -172,10 +173,48 @@ static int virtblk_add_req(struct virtqueue *vq, 
> > struct virtblk_req *vbr,
> >   return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
> >  }
> >
> > +static int virtblk_setup_discard_write_zeroes(struct request *req, bool 
> > unmap)
> > +{
> > + unsigned short segments = blk_rq_nr_discard_segments(req);
> > + unsigned short n = 0;
> > + struct virtio_blk_discard_write_zeroes *range;
> > + struct bio *bio;
> > + u32 flags = 0;
> > +
> > + if (unmap)
> > + flags |= VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP;
> > +
> > + range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
> > + if (!range)
> > + return -ENOMEM;
> > +
> > + __rq_for_each_bio(bio, req) {
> > + u64 sector = bio->bi_iter.bi_sector;
> > + u32 num_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> > +
> > + range[n].flags = cpu_to_le32(flags);
> > + range[n].num_sectors = cpu_to_le32(num_sectors);
> > + range[n].sector = cpu_to_le64(sector);
> > + n++;
> > + }
> > +
> > + req->special_vec.bv_page = virt_to_page(range);
> > + req->special_vec.bv_offset = offset_in_page(range);
> > + req->special_vec.bv_len = sizeof(*range) * segments;
> > + req->rq_flags |= RQF_SPECIAL_PAYLOAD;
> > +
> > + return 0;
> > +}
> > +
> >  static inline void virtblk_request_done(struct request *req)
> >  {
> >   struct vi

Re: INFO: task hung in lo_release

2018-11-02 Thread Dmitry Vyukov
On Wed, Jul 18, 2018 at 4:28 PM, Tetsuo Handa
 wrote:
> On 2018/07/18 21:46, syzbot wrote:
>> Showing all locks held in the system:
>> 1 lock held by khungtaskd/902:
>>  #0: 4f60bbd2 (rcu_read_lock){}, at: 
>> debug_show_all_locks+0xd0/0x428 kernel/locking/lockdep.c:4461
>> 1 lock held by rsyslogd/4455:
>>  #0: 86a2d206 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x1bb/0x200 
>> fs/file.c:766
>> 2 locks held by getty/4545:
>>  #0: ece833eb (&tty->ldisc_sem){}, at: ldsem_down_read+0x37/0x40 
>> drivers/tty/tty_ldsem.c:365
>>  #1: 536bed00 (&ldata->atomic_read_lock){+.+.}, at: 
>> n_tty_read+0x335/0x1ce0 drivers/tty/n_tty.c:2140
>> 2 locks held by getty/4546:
>>  #0: 180e8f60 (&tty->ldisc_sem){}, at: ldsem_down_read+0x37/0x40 
>> drivers/tty/tty_ldsem.c:365
>>  #1: 8efac671 (&ldata->atomic_read_lock){+.+.}, at: 
>> n_tty_read+0x335/0x1ce0 drivers/tty/n_tty.c:2140
>> 2 locks held by getty/4547:
>>  #0: ca308631 (&tty->ldisc_sem){}, at: ldsem_down_read+0x37/0x40 
>> drivers/tty/tty_ldsem.c:365
>>  #1: 7c05fef3 (&ldata->atomic_read_lock){+.+.}, at: 
>> n_tty_read+0x335/0x1ce0 drivers/tty/n_tty.c:2140
>> 2 locks held by getty/4548:
>>  #0: 9d93809c (&tty->ldisc_sem){}, at: ldsem_down_read+0x37/0x40 
>> drivers/tty/tty_ldsem.c:365
>>  #1: 4c489ffa (&ldata->atomic_read_lock){+.+.}, at: 
>> n_tty_read+0x335/0x1ce0 drivers/tty/n_tty.c:2140
>> 2 locks held by getty/4549:
>>  #0: ec3b322c (&tty->ldisc_sem){}, at: ldsem_down_read+0x37/0x40 
>> drivers/tty/tty_ldsem.c:365
>>  #1: 107aeb96 (&ldata->atomic_read_lock){+.+.}, at: 
>> n_tty_read+0x335/0x1ce0 drivers/tty/n_tty.c:2140
>> 2 locks held by getty/4550:
>>  #0: 6d1a7b96 (&tty->ldisc_sem){}, at: ldsem_down_read+0x37/0x40 
>> drivers/tty/tty_ldsem.c:365
>>  #1: 564c003d (&ldata->atomic_read_lock){+.+.}, at: 
>> n_tty_read+0x335/0x1ce0 drivers/tty/n_tty.c:2140
>> 2 locks held by getty/4551:
>>  #0: 3cba543a (&tty->ldisc_sem){}, at: ldsem_down_read+0x37/0x40 
>> drivers/tty/tty_ldsem.c:365
>>  #1: 149a289b (&ldata->atomic_read_lock){+.+.}, at: 
>> n_tty_read+0x335/0x1ce0 drivers/tty/n_tty.c:2140
>> 2 locks held by syz-executor6/4597:
>>  #0: 33676c6d (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0xc2/0x830 
>> fs/block_dev.c:1780
>>  #1: 127b5bfb (loop_index_mutex){+.+.}, at: lo_release+0x1f/0x1f0 
>> drivers/block/loop.c:1675
>> 2 locks held by blkid/18494:
>>  #0: 0efc6462 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x19b/0x13c0 
>> fs/block_dev.c:1463
>>  #1: 127b5bfb (loop_index_mutex){+.+.}, at: lo_open+0x1b/0xb0 
>> drivers/block/loop.c:1632
>> 1 lock held by syz-executor5/18515:
>>  #0: 127b5bfb (loop_index_mutex){+.+.}, at: 
>> loop_control_ioctl+0x91/0x540 drivers/block/loop.c:1999
>> 1 lock held by syz-executor1/18498:
>>  #0: 127b5bfb (loop_index_mutex){+.+.}, at: 
>> loop_control_ioctl+0x91/0x540 drivers/block/loop.c:1999
>> 1 lock held by syz-executor3/18521:
>>  #0: 127b5bfb (loop_index_mutex){+.+.}, at: 
>> loop_control_ioctl+0x91/0x540 drivers/block/loop.c:1999
>> 2 locks held by syz-executor3/18522:
>>  #0: 399ff791 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x19b/0x13c0 
>> fs/block_dev.c:1463
>>  #1: 127b5bfb (loop_index_mutex){+.+.}, at: lo_open+0x1b/0xb0 
>> drivers/block/loop.c:1632
>> 1 lock held by syz-executor4/18506:
>>  #0: 127b5bfb (loop_index_mutex){+.+.}, at: 
>> loop_control_ioctl+0x91/0x540 drivers/block/loop.c:1999
>> 1 lock held by syz-executor0/18508:
>> 1 lock held by syz-executor7/18507:
>>  #0: 127b5bfb (loop_index_mutex){+.+.}, at: 
>> loop_control_ioctl+0x91/0x540 drivers/block/loop.c:1999
>> 1 lock held by syz-executor2/18514:
>>  #0: 0efc6462 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x19b/0x13c0 
>> fs/block_dev.c:1463
>> 1 lock held by blkid/18513:
>>  #0: 33676c6d (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x19b/0x13c0 
>> fs/block_dev.c:1463
>> 1 lock held by blkid/18520:
>>  #0: 127b5bfb (loop_index_mutex){+.+.}, at: loop_probe+0x82/0x1d0 
>> drivers/block/loop.c:1979
>> 1 lock held by blkid/18524:
>>  #0: 399ff791 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x19b/0x13c0 
>> fs/block_dev.c:1463
>
> Dmitry, it is impossible to check what these lock holders are doing without 
> dump of these threads
> (they are not always TASK_UNINTERRUPTIBLE waiters; e.g. PID=18508 is 
> TASK_RUNNING with a lock held).

I know. One day I will hopefully get to implementing dump collection...

> Jens, when can we start testing "[PATCH v3] block/loop: Serialize ioctl 
> operations." ?

Was that merged? If we have a potential fix, merging it may be the
simplest way to address it without debugging.
I see that this "[v4] block/loop: Serialize ioctl operations." still
has "State: New":
https://patchwork.kernel.org/patch/10612217/


Re: INFO: task hung in lo_release

2018-11-02 Thread Tetsuo Handa
On 2018/11/03 4:24, Dmitry Vyukov wrote:
>> Dmitry, it is impossible to check what these lock holders are doing without 
>> dump of these threads
>> (they are not always TASK_UNINTERRUPTIBLE waiters; e.g. PID=18508 is 
>> TASK_RUNNING with a lock held).
> 
> I know. One day I will hopefully get to implementing dump collection...
> 
>> Jens, when can we start testing "[PATCH v3] block/loop: Serialize ioctl 
>> operations." ?
> 
> Was that merged? If we have a potential fix, merging it may be the
> simplest way to address it without debugging.
> I see that this "[v4] block/loop: Serialize ioctl operations." still
> has "State: New":
> https://patchwork.kernel.org/patch/10612217/
> 

Jan Kara is going to resend 
https://marc.info/?i=20181010100415.26525-1-j...@suse.cz
after the merge window closes.


Re: [PATCH 13/16] irq: add support for allocating (and affinitizing) sets of IRQs

2018-11-02 Thread Ming Lei
On Fri, Nov 02, 2018 at 09:09:50AM -0600, Keith Busch wrote:
> On Fri, Nov 02, 2018 at 10:37:07PM +0800, Ming Lei wrote:
> > On Tue, Oct 30, 2018 at 12:32:49PM -0600, Jens Axboe wrote:
> > > A driver may have a need to allocate multiple sets of MSI/MSI-X
> > > interrupts, and have them appropriately affinitized. Add support for
> > > defining a number of sets in the irq_affinity structure, of varying
> > > sizes, and get each set affinitized correctly across the machine.
> > > 
> > > Cc: Thomas Gleixner 
> > > Cc: linux-ker...@vger.kernel.org
> > > Reviewed-by: Hannes Reinecke 
> > > Reviewed-by: Ming Lei 
> > > Signed-off-by: Jens Axboe 
> > > ---
> > >  drivers/pci/msi.c | 14 ++
> > >  include/linux/interrupt.h |  4 
> > >  kernel/irq/affinity.c | 40 ++-
> > >  3 files changed, 49 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > > index af24ed50a245..e6c6e10b9ceb 100644
> > > --- a/drivers/pci/msi.c
> > > +++ b/drivers/pci/msi.c
> > > @@ -1036,6 +1036,13 @@ static int __pci_enable_msi_range(struct pci_dev 
> > > *dev, int minvec, int maxvec,
> > >   if (maxvec < minvec)
> > >   return -ERANGE;
> > >  
> > > + /*
> > > +  * If the caller is passing in sets, we can't support a range of
> > > +  * vectors. The caller needs to handle that.
> > > +  */
> > > + if (affd->nr_sets && minvec != maxvec)
> > > + return -EINVAL;
> > > +
> > >   if (WARN_ON_ONCE(dev->msi_enabled))
> > >   return -EINVAL;
> > >  
> > > @@ -1087,6 +1094,13 @@ static int __pci_enable_msix_range(struct pci_dev 
> > > *dev,
> > >   if (maxvec < minvec)
> > >   return -ERANGE;
> > >  
> > > + /*
> > > +  * If the caller is passing in sets, we can't support a range of
> > > +  * supported vectors. The caller needs to handle that.
> > > +  */
> > > + if (affd->nr_sets && minvec != maxvec)
> > > + return -EINVAL;
> > > +
> > >   if (WARN_ON_ONCE(dev->msix_enabled))
> > >   return -EINVAL;
> > >  
> > > diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> > > index 1d6711c28271..ca397ff40836 100644
> > > --- a/include/linux/interrupt.h
> > > +++ b/include/linux/interrupt.h
> > > @@ -247,10 +247,14 @@ struct irq_affinity_notify {
> > >   *   the MSI(-X) vector space
> > >   * @post_vectors:Don't apply affinity to @post_vectors at end of
> > >   *   the MSI(-X) vector space
> > > + * @nr_sets: Length of passed in *sets array
> > > + * @sets:Number of affinitized sets
> > >   */
> > >  struct irq_affinity {
> > >   int pre_vectors;
> > >   int post_vectors;
> > > + int nr_sets;
> > > + int *sets;
> > >  };
> > >  
> > >  #if defined(CONFIG_SMP)
> > > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> > > index f4f29b9d90ee..2046a0f0f0f1 100644
> > > --- a/kernel/irq/affinity.c
> > > +++ b/kernel/irq/affinity.c
> > > @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct 
> > > irq_affinity *affd)
> > >   int curvec, usedvecs;
> > >   cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
> > >   struct cpumask *masks = NULL;
> > > + int i, nr_sets;
> > >  
> > >   /*
> > >* If there aren't any vectors left after applying the pre/post
> > > @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct 
> > > irq_affinity *affd)
> > >   get_online_cpus();
> > >   build_node_to_cpumask(node_to_cpumask);
> > >  
> > > - /* Spread on present CPUs starting from affd->pre_vectors */
> > > - usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
> > > - node_to_cpumask, cpu_present_mask,
> > > - nmsk, masks);
> > > + /*
> > > +  * Spread on present CPUs starting from affd->pre_vectors. If we
> > > +  * have multiple sets, build each sets affinity mask separately.
> > > +  */
> > > + nr_sets = affd->nr_sets;
> > > + if (!nr_sets)
> > > + nr_sets = 1;
> > > +
> > > + for (i = 0, usedvecs = 0; i < nr_sets; i++) {
> > > + int this_vecs = affd->sets ? affd->sets[i] : affvecs;
> > > + int nr;
> > > +
> > > + nr = irq_build_affinity_masks(affd, curvec, this_vecs,
> > > +   node_to_cpumask, cpu_present_mask,
> > > +   nmsk, masks + usedvecs);
> > 
> > The last parameter of the above function should have been 'masks',
> > because irq_build_affinity_masks() always treats 'masks' as the base
> > address of the array.
> 
> We have multiple "bases" when using sets, so we have to update which
> base to use by adding accordingly. If you just use 'masks', then you're
> going to overwrite your masks from the previous set.

For irq_build_affinity_masks(), the passed 'startvec' is always relative
to the absolute 1st element, so the passed 'masks' should be always the
absolute base too. Not mention 'cu