Re: [PATCH 04/13] block: Move bdi_unregister() to del_gendisk()

2017-02-22 Thread Jan Kara
On Tue 21-02-17 19:53:29, Bart Van Assche wrote:
> On 02/21/2017 09:55 AM, Jan Kara wrote:
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 47104f6a398b..9a901dcfdd5c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -580,8 +580,6 @@ void blk_cleanup_queue(struct request_queue *q)
> > q->queue_lock = &q->__queue_lock;
> > spin_unlock_irq(lock);
> >  
> > -   bdi_unregister(q->backing_dev_info);
> > -
> > /* @q is and will stay empty, shutdown and put */
> > blk_put_queue(q);
> >  }
> > diff --git a/block/genhd.c b/block/genhd.c
> > index f6c4d4400759..68c613edb93a 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -660,6 +660,13 @@ void del_gendisk(struct gendisk *disk)
> > disk->flags &= ~GENHD_FL_UP;
> >  
> > sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
> > +   /*
> > +* Unregister bdi before releasing device numbers (as they can get
> > +* reused and we'd get clashes in sysfs) but after bdev inodes are
> > +* unhashed and thus will be soon destroyed as bdev inode's reference
> > +* to wb_writeback can block bdi_unregister().
> > +*/
> > +   bdi_unregister(disk->queue->backing_dev_info);
> > blk_unregister_queue(disk);
> > blk_unregister_region(disk_devt(disk), disk->minors);
> 
> This change looks suspicious to me. There are drivers that create a
> block layer queue but neither call device_add_disk() nor del_gendisk(),
> e.g. drivers/scsi/st.c. Although bdi_init() will be called for the
> queues created by these drivers, this patch will cause the
> bdi_unregister() call to be skipped for these drivers.

Well, the thing is that bdi_unregister() is the counterpart to
bdi_register(). Unless you call bdi_register(), which happens only in
device_add_disk() (and some filesystems which create their private bdis),
there's no point in calling bdi_unregister(). Counterpart to bdi_init() is
bdi_exit() and that gets called always once bdi reference count drops to 0.

Honza
-- 
Jan Kara 
SUSE Labs, CR


[PATCH 0/4] blk-mq: cleanup on all kinds of kobjects

2017-02-22 Thread Ming Lei
This patchset cleans up on kojects of request_queue.mq_kobj,
sw queue's kobject and hw queue's kobject.

The 1st patch initialized kobject of request_queue and sw queue
in blk_mq_init_allocated_queue(), so we can avoid to initialize
these kobjects several times, and this patch fixes one kerne
warning reported from Omar Sandoval.

The 2nd patch makes lifetime consitent between mq request queue/sw
queue and their kobjects.

The 3rd patch makes lifetime consitent between hw queue and their
kobjects.

The last patch is a followup of 3rd patch.

Thanks,
Ming

Ming Lei (4):
  blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue()
  blk-mq: make lifetime consitent between q/ctx and its kobject
  blk-mq: make lifetime consistent between hctx and its kobject
  blk-mq: free hctx->cpumask in release handler of hctx's kobject

 block/blk-mq-sysfs.c | 40 +---
 block/blk-mq.c   | 28 ++--
 block/blk-mq.h   |  2 ++
 3 files changed, 37 insertions(+), 33 deletions(-)

-- 
2.7.4



[PATCH 3/4] blk-mq: make lifetime consistent between hctx and its kobject

2017-02-22 Thread Ming Lei
This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(),
and trys to keep lifetime consistent between hctx and hctx's kobject.

Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become
totally symmetrical, and kobject's refcounter drops to zero just
when the hctx is freed.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sysfs.c | 15 ++-
 block/blk-mq.c   |  5 +
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 77fb238af2be..cb19ec16a7fc 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -17,6 +17,14 @@ static void blk_mq_sysfs_release(struct kobject *kobj)
 {
 }
 
+static void blk_mq_hw_sysfs_release(struct kobject *kobj)
+{
+   struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
+ kobj);
+   kfree(hctx->ctxs);
+   kfree(hctx);
+}
+
 struct blk_mq_ctx_sysfs_entry {
struct attribute attr;
ssize_t (*show)(struct blk_mq_ctx *, char *);
@@ -200,7 +208,7 @@ static struct kobj_type blk_mq_ctx_ktype = {
 static struct kobj_type blk_mq_hw_ktype = {
.sysfs_ops  = &blk_mq_hw_sysfs_ops,
.default_attrs  = default_hw_ctx_attrs,
-   .release= blk_mq_sysfs_release,
+   .release= blk_mq_hw_sysfs_release,
 };
 
 static void blk_mq_unregister_hctx(struct blk_mq_hw_ctx *hctx)
@@ -244,12 +252,9 @@ static void __blk_mq_unregister_dev(struct device *dev, 
struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
int i;
 
-   queue_for_each_hw_ctx(q, hctx, i) {
+   queue_for_each_hw_ctx(q, hctx, i)
blk_mq_unregister_hctx(hctx);
 
-   kobject_put(&hctx->kobj);
-   }
-
blk_mq_debugfs_unregister_hctxs(q);
 
kobject_uevent(&q->mq_kobj, KOBJ_REMOVE);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index bc45736d70ba..993bb4d4d909 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2152,8 +2152,7 @@ void blk_mq_release(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i) {
if (!hctx)
continue;
-   kfree(hctx->ctxs);
-   kfree(hctx);
+   kobject_put(&hctx->kobj);
}
 
q->mq_map = NULL;
@@ -2232,8 +2231,6 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set 
*set,
blk_mq_exit_hctx(q, set, hctx, j);
free_cpumask_var(hctx->cpumask);
kobject_put(&hctx->kobj);
-   kfree(hctx->ctxs);
-   kfree(hctx);
hctxs[j] = NULL;
 
}
-- 
2.7.4



[PATCH 1/4] blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue()

2017-02-22 Thread Ming Lei
Both q->mq_kobj and sw queues' kobjects should have been initialized
once, instead of doing that each add_disk context.

Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
because percpu allocator fills zero to allocated variable.

This patch fixes one issue[1] reported from Omar.

[1] kernel wearning when doing unbind/bind on one scsi-mq device

[   19.347924] kobject (8800791ea0b8): tried to init an initialized object, 
something is seriously wrong.
[   19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 
4.10.0-rc7-00210-g53f39eeaa263 #34
[   19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.1-20161122_114906-anatol 04/01/2014
[   19.350920] Workqueue: events_unbound async_run_entry_fn
[   19.350920] Call Trace:
[   19.350920]  dump_stack+0x63/0x83
[   19.350920]  kobject_init+0x77/0x90
[   19.350920]  blk_mq_register_dev+0x40/0x130
[   19.350920]  blk_register_queue+0xb6/0x190
[   19.350920]  device_add_disk+0x1ec/0x4b0
[   19.350920]  sd_probe_async+0x10d/0x1c0 [sd_mod]
[   19.350920]  async_run_entry_fn+0x48/0x150
[   19.350920]  process_one_work+0x1d0/0x480
[   19.350920]  worker_thread+0x48/0x4e0
[   19.350920]  kthread+0x101/0x140
[   19.350920]  ? process_one_work+0x480/0x480
[   19.350920]  ? kthread_create_on_node+0x60/0x60
[   19.350920]  ret_from_fork+0x2c/0x40

Cc: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sysfs.c | 4 +---
 block/blk-mq.c   | 4 +++-
 block/blk-mq.h   | 1 +
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 295e69670c39..124305407c80 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -277,7 +277,7 @@ void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx)
kobject_init(&hctx->kobj, &blk_mq_hw_ktype);
 }
 
-static void blk_mq_sysfs_init(struct request_queue *q)
+void blk_mq_sysfs_init(struct request_queue *q)
 {
struct blk_mq_ctx *ctx;
int cpu;
@@ -297,8 +297,6 @@ int blk_mq_register_dev(struct device *dev, struct 
request_queue *q)
 
blk_mq_disable_hotplug();
 
-   blk_mq_sysfs_init(q);
-
ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
if (ret < 0)
goto out;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b29e7dc7b309..ade1d7cde37e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1941,7 +1941,6 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
struct blk_mq_ctx *__ctx = per_cpu_ptr(q->queue_ctx, i);
struct blk_mq_hw_ctx *hctx;
 
-   memset(__ctx, 0, sizeof(*__ctx));
__ctx->cpu = i;
spin_lock_init(&__ctx->lock);
INIT_LIST_HEAD(&__ctx->rq_list);
@@ -2248,6 +2247,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
if (!q->queue_ctx)
goto err_exit;
 
+   /* init q->mq_kobj and sw queues' kobjects */
+   blk_mq_sysfs_init(q);
+
q->queue_hw_ctx = kzalloc_node(nr_cpu_ids * sizeof(*(q->queue_hw_ctx)),
GFP_KERNEL, set->numa_node);
if (!q->queue_hw_ctx)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 24b2256186f3..445343425f6d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -77,6 +77,7 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct 
request_queue *q,
 /*
  * sysfs helpers
  */
+extern void blk_mq_sysfs_init(struct request_queue *q);
 extern int blk_mq_sysfs_register(struct request_queue *q);
 extern void blk_mq_sysfs_unregister(struct request_queue *q);
 extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
-- 
2.7.4



[PATCH 2/4] blk-mq: make lifetime consitent between q/ctx and its kobject

2017-02-22 Thread Ming Lei
Currently from kobject view, both q->mq_kobj and ctx->kobj can
be released during one cycle of blk_mq_register_dev() and
blk_mq_unregister_dev(). Actually, sw queue's lifetime is
same with its request queue's, which is covered by request_queue->kobj.

So we don't need to call kobject_put() for the two kinds of
kobject in __blk_mq_unregister_dev(), instead we do that
in release handler of request queue.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sysfs.c | 20 +---
 block/blk-mq.c   |  7 ++-
 block/blk-mq.h   |  1 +
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 124305407c80..77fb238af2be 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -242,15 +242,11 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx 
*hctx)
 static void __blk_mq_unregister_dev(struct device *dev, struct request_queue 
*q)
 {
struct blk_mq_hw_ctx *hctx;
-   struct blk_mq_ctx *ctx;
-   int i, j;
+   int i;
 
queue_for_each_hw_ctx(q, hctx, i) {
blk_mq_unregister_hctx(hctx);
 
-   hctx_for_each_ctx(hctx, ctx, j)
-   kobject_put(&ctx->kobj);
-
kobject_put(&hctx->kobj);
}
 
@@ -258,8 +254,6 @@ static void __blk_mq_unregister_dev(struct device *dev, 
struct request_queue *q)
 
kobject_uevent(&q->mq_kobj, KOBJ_REMOVE);
kobject_del(&q->mq_kobj);
-   kobject_put(&q->mq_kobj);
-
kobject_put(&dev->kobj);
 
q->mq_sysfs_init_done = false;
@@ -277,6 +271,18 @@ void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx)
kobject_init(&hctx->kobj, &blk_mq_hw_ktype);
 }
 
+void blk_mq_sysfs_deinit(struct request_queue *q)
+{
+   struct blk_mq_ctx *ctx;
+   int cpu;
+
+   for_each_possible_cpu(cpu) {
+   ctx = per_cpu_ptr(q->queue_ctx, cpu);
+   kobject_put(&ctx->kobj);
+   }
+   kobject_put(&q->mq_kobj);
+}
+
 void blk_mq_sysfs_init(struct request_queue *q)
 {
struct blk_mq_ctx *ctx;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ade1d7cde37e..bc45736d70ba 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2160,7 +2160,12 @@ void blk_mq_release(struct request_queue *q)
 
kfree(q->queue_hw_ctx);
 
-   /* ctx kobj stays in queue_ctx */
+   /*
+* release .mq_kobj and sw queue's kobject now because
+* both share lifetime with request queue.
+*/
+   blk_mq_sysfs_deinit(q);
+
free_percpu(q->queue_ctx);
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 445343425f6d..2cd73b762351 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -78,6 +78,7 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct 
request_queue *q,
  * sysfs helpers
  */
 extern void blk_mq_sysfs_init(struct request_queue *q);
+extern void blk_mq_sysfs_deinit(struct request_queue *q);
 extern int blk_mq_sysfs_register(struct request_queue *q);
 extern void blk_mq_sysfs_unregister(struct request_queue *q);
 extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
-- 
2.7.4



[PATCH 4/4] blk-mq: free hctx->cpumask in release handler of hctx's kobject

2017-02-22 Thread Ming Lei
It is obviously that hctx->cpumask is per hctx, and both
share same lifetime, so this patch moves freeing of hctx->cpumask
into release handler of hctx's kobject.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sysfs.c |  1 +
 block/blk-mq.c   | 12 
 2 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index cb19ec16a7fc..d745ab81033a 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -21,6 +21,7 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 {
struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
  kobj);
+   free_cpumask_var(hctx->cpumask);
kfree(hctx->ctxs);
kfree(hctx);
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 993bb4d4d909..db97f53e99d0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1851,16 +1851,6 @@ static void blk_mq_exit_hw_queues(struct request_queue 
*q,
}
 }
 
-static void blk_mq_free_hw_queues(struct request_queue *q,
-   struct blk_mq_tag_set *set)
-{
-   struct blk_mq_hw_ctx *hctx;
-   unsigned int i;
-
-   queue_for_each_hw_ctx(q, hctx, i)
-   free_cpumask_var(hctx->cpumask);
-}
-
 static int blk_mq_init_hctx(struct request_queue *q,
struct blk_mq_tag_set *set,
struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
@@ -2229,7 +2219,6 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set 
*set,
if (hctx->tags)
blk_mq_free_map_and_requests(set, j);
blk_mq_exit_hctx(q, set, hctx, j);
-   free_cpumask_var(hctx->cpumask);
kobject_put(&hctx->kobj);
hctxs[j] = NULL;
 
@@ -2342,7 +2331,6 @@ void blk_mq_free_queue(struct request_queue *q)
blk_mq_del_queue_tag_set(q);
 
blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
-   blk_mq_free_hw_queues(q, set);
 }
 
 /* Basically redo blk_mq_init_queue with queue frozen */
-- 
2.7.4



Re: Manual driver binding and unbinding broken for SCSI

2017-02-22 Thread Ming Lei
On Wed, Feb 22, 2017 at 1:14 AM, Jan Kara  wrote:
> On Sun 19-02-17 18:19:58, Omar Sandoval wrote:
>> On Fri, Feb 17, 2017 at 04:43:56PM -0800, James Bottomley wrote:
>> > This seems to be related to a 0day test we got on the block tree,
>> > details here:
>> >
>> > http://marc.info/?t=14862406881
>> >
>> > I root caused the above to something not being released when it should
>> > be, so it looks like you have the same problem.  It seems to be a
>> > recent commit in the block tree, so could you bisect it since you have
>> > a nice reproducer?
>>
>> These appear to actually be two separate issues.
>>
>> The unbind followed by bind crash only happens with scsi-mq. It reproes
>> since at least 4.0.
>>
>> The unbind followed by a new device coming up crash happens both with
>> and without scsi-mq. The earliest version I was able to check for that
>> was 4.6, which did reproduce.
>>
>> I'll see if I can get some more info on these two issues separately.
>
> Actually, the second issue is only a warning right? And if I understand the
> issue correctly, it should be fixed by either Dan's patches in linux-block
> or my patch 4 in the series which matches your test results. So that is
> dealt with. I have no idea about the first issue though.

Looks the 1st one is one old issue in blk-mq, and I have sent one patchset
to address it:

http://marc.info/?l=linux-kernel&m=148775847517071&w=2

Omar, feel free to give a test.

thanks,
Ming Lei


Re: [PATCH 0/13 v2] block: Fix block device shutdown related races

2017-02-22 Thread Jan Kara
On Tue 21-02-17 10:19:28, Jens Axboe wrote:
> On 02/21/2017 10:09 AM, Jan Kara wrote:
> > Hello,
> > 
> > this is a second revision of the patch set to fix several different races 
> > and
> > issues I've found when testing device shutdown and reuse. The first three
> > patches are fixes to problems in my previous series fixing BDI lifetime 
> > issues.
> > Patch 4 fixes issues with reuse of BDI name with scsi devices. With it I 
> > cannot
> > reproduce the BDI name reuse issues using Omar's stress test using 
> > scsi_debug
> > so it can be used as a replacement of Dan's patches. Patches 5-11 fix oops 
> > that
> > is triggered by __blkdev_put() calling inode_detach_wb() too early (the 
> > problem
> > reported by Thiago). Patches 12 and 13 fix oops due to a bug in gendisk code
> > where get_gendisk() can return already freed gendisk structure (again 
> > triggered
> > by Omar's stress test).
> > 
> > People, please have a look at patches. They are mostly simple however the
> > interactions are rather complex so I may have missed something. Also I'm
> > happy for any additional testing these patches can get - I've stressed them
> > with Omar's script, tested memcg writeback, tested static (not udev managed)
> > device inodes.
> > 
> > Jens, I think at least patches 1-3 should go in together with my fixes you
> > already have in your tree (or shortly after them). It is up to you whether
> > you decide to delay my first fixes or pick these up quickly. Patch 4 is
> > (IMHO a cleaner) replacement of Dan's patches so consider whether you want
> > to use it instead of those patches.
> 
> I have applied 1-3 to my for-linus branch, which will go in after
> the initial pull request has been pulled by Linus. Consider fixing up
> #4 so it applies, I like it.

OK, attached is patch 4 rebased on top of Linus' tree from today which
already has linux-block changes pulled in. I've left put_disk_devt() in
blk_cleanup_queue() to maintain the logic in the original patch (now commit
0dba1314d4f8) that request_queue and gendisk each hold one devt reference.
The bdi_unregister() call that is moved to del_gendisk() by this patch is
now protected by the gendisk reference instead of the request_queue one
so it still maintains the property that devt reference protects bdi
registration-unregistration lifetime (as much as that is not needed anymore
after this patch).

I have also updated the comment in the code and the changelog - they were
somewhat stale after changes to the whole series Tejun suggested.

Honza
-- 
Jan Kara 
SUSE Labs, CR
>From 9abe9565c83af6b653b159a7bf5b895aff638c65 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Wed, 8 Feb 2017 08:05:56 +0100
Subject: [PATCH] block: Move bdi_unregister() to del_gendisk()

Commit 6cd18e711dd8 "block: destroy bdi before blockdev is
unregistered." moved bdi unregistration (at that time through
bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because
it needs to happen before blk_unregister_region() call in del_gendisk()
for MD. SCSI though will free up the device number from sd_remove()
called through a maze of callbacks from device_del() in
__scsi_remove_device() before blk_cleanup_queue() and thus similar races
as described in 6cd18e711dd8 can happen for SCSI as well as reported by
Omar [1].

Moving bdi_unregister() to del_gendisk() works for MD and fixes the
problem for SCSI since del_gendisk() gets called from sd_remove() before
freeing the device number.

This also makes device_add_disk() (calling bdi_register_owner()) more
symmetric with del_gendisk().

[1] http://marc.info/?l=linux-block&m=148554717109098&w=2

Tested-by: Lekshmi Pillai 
Acked-by: Tejun Heo 
Signed-off-by: Jan Kara 
---
 block/blk-core.c | 1 -
 block/genhd.c| 5 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b9e857f4afe8..1086dac8724c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -578,7 +578,6 @@ void blk_cleanup_queue(struct request_queue *q)
 		q->queue_lock = &q->__queue_lock;
 	spin_unlock_irq(lock);
 
-	bdi_unregister(q->backing_dev_info);
 	put_disk_devt(q->disk_devt);
 
 	/* @q is and will stay empty, shutdown and put */
diff --git a/block/genhd.c b/block/genhd.c
index 2f444b87a5f2..b26a5ea115d0 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -681,6 +681,11 @@ void del_gendisk(struct gendisk *disk)
 	disk->flags &= ~GENHD_FL_UP;
 
 	sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
+	/*
+	 * Unregister bdi before releasing device numbers (as they can get
+	 * reused and we'd get clashes in sysfs).
+	 */
+	bdi_unregister(disk->queue->backing_dev_info);
 	blk_unregister_queue(disk);
 	blk_unregister_region(disk_devt(disk), disk->minors);
 
-- 
2.10.2



Re: [PATCH 06/16] mmc: core: replace waitqueue with worker

2017-02-22 Thread Adrian Hunter
On 09/02/17 17:33, Linus Walleij wrote:
> The waitqueue in the host context is there to signal back from
> mmc_request_done() through mmc_wait_data_done() that the hardware
> is done with a command, and when the wait is over, the core
> will typically submit the next asynchronous request that is pending
> just waiting for the hardware to be available.
> 
> This is in the way for letting the mmc_request_done() trigger the
> report up to the block layer that a block request is finished.
> 
> Re-jig this as a first step, remvoving the waitqueue and introducing
> a work that will run after a completed asynchronous request,
> finalizing that request, including retransmissions, and eventually
> reporting back with a completion and a status code to the
> asynchronous issue method.
> 
> This had the upside that we can remove the MMC_BLK_NEW_REQUEST
> status code and the "new_request" state in the request queue
> that is only there to make the state machine spin out
> the first time we send a request.
> 
> Introduce a workqueue in the host for handling just this, and
> then a work and completion in the asynchronous request to deal
> with this mechanism.
> 
> This is a central change that let us do many other changes since
> we have broken the submit and complete code paths in two, and we
> can potentially remove the NULL flushing of the asynchronous
> pipeline and report block requests as finished directly from
> the worker.

This needs more thought.  The completion should go straight to the mmc block
driver from the ->done() callback.  And from there straight back to the
block layer if recovery is not needed.  We want to stop using
mmc_start_areq() altogether because we never want to wait - we always want
to issue (if possible) and return.

The core API to use is __mmc_start_req() but the block driver should
populate mrq->done with its own handler. i.e. change __mmc_start_req()

-   mrq->done = mmc_wait_done;
+   if (!mrq->done)
+   mrq->done = mmc_wait_done;

mrq->done() would complete the request (e.g. via blk_complete_request()) if
it has no errors (and doesn't need polling), and wake up the queue thread to
finish up everything else and start the next request.

For the blk-mq port, the queue thread should also be retained, partly
because it solves some synchronization problems, but mostly because, at this
stage, we anyway don't have solutions for all the different ways the driver
can block.
(as listed here https://marc.info/?l=linux-mmc&m=148336571720463&w=2 )



Re: sense handling improvements

2017-02-22 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

>> I can bring them in after Linus' initial block pull.

Christoph> Both the block and SCSI trees are now merged by Linus, and
Christoph> Jens didn't pick up patch one from this series yet - maybe
Christoph> it's best to send the whole series through the SCSI tree in
Christoph> this case.

Will do.


-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCHv4 4/4] block/sed: Embed function data into the function sequence

2017-02-22 Thread Jon Derrick
On 02/22/2017 12:13 AM, Christoph Hellwig wrote:
>> +if (!lock_held)
>> +mutex_lock(&dev->dev_lock);
> 
> No conditional locking, please.  I guess I causesd this by asking you
> to remove __opal_lock_unlock, but it seems we'd either need to keep it
> in the end.
> 
> Except for that the series looks fine to me.
> 
> Jens: given that 1-3 are the important fixes how about you pick those
> up ASAP?  They all also had my Reviewed-by for previous postings.
> 
Thanks

I'll respin just 4/4 shortly with __opal_lock_unlock


Re: blk_integrity_revalidate() clears BDI_CAP_STABLE_WRITES

2017-02-22 Thread Ilya Dryomov
On Wed, Feb 22, 2017 at 5:41 AM, Martin K. Petersen
 wrote:
>> "Ilya" == Ilya Dryomov  writes:
>
> Ilya,
>
> Ilya> could you please explain blk_integrity_revalidate() and
> Ilya> its GENHD_FL_UP check in particular?  We have the queue,
> Ilya> bi->profile can't be NULL after blk_integrity_register(), and
> Ilya> since the latter "must" be used for registering the profile with
> Ilya> the block layer, wouldn't the following be sufficient for
> Ilya> blk_integrity users?
>
> IIrc, the FL_UP check fixed a registration problem in the nvme driver.

Did it have something to do with a NULL disk->queue?

>
> The rationale behind revalidate was that we need to handle devices which
> lose the integrity capability at runtime (i.e. a integrity-enabled DM
> device is extended with a non-cable drive forcing the feature to be
> turned off). The clearing of the integrity profile is more important in
> that case than zapping the stable pages flag. But that was the original
> reason for not just ORing BDI_CAP_STABLE_WRITES.
>
> I don't have a huge problem with keeping stable pages on if a device
> suddenly stops being integrity capable. However, I'd like to understand
> your use case a bit better.

Well, blk_integrity_revalidate() doesn't clear the profile, it just
clears the stable pages flag.  Whoever calls blk_integrity_unregister()
to clear the profile can also clear the stable pages flag -- why not
let blk_integrity_unregister() clear the flag like I suggested?

>
> Ilya> The alternative seems to be to set up a bogus
> Ilya> blk_integrity_profile (nop_profile won't do -- this one would have
> Ilya> to be truly bogus w/ NULL *_fn) under BLK_DEV_INTEGRITY ifdefs and
> Ilya> hope that nothing breaks.
>
> Can you point me to the relevant code on your end?

This code doesn't exist yet -- it's just something I thought I'd have
to do if my patch or something along those lines isn't accepted.  There
is the nop_profile with stub *_fn callbacks, which is used by nvme and
nvdimm to make bio_integrity_enabled() return true, so that per-interval
buffers are allocated in bio_integrity_prep().  "I want some space for
per-interval metadata, but no integrity checksums" is use case there.

In the rbd case, we don't want that.  We just want to set the stable
pages flag and make sure it's not reset by blk_integrity code.  So, if
blk_integrity_revalidate() isn't fixed, rbd will need to set up a bogus
profile with NULL *_fn callbacks to make bio_integrity_enabled() return
false.  This is obviously fragile, because blk_get_integrity() will no
longer return NULL...

We are setting BDI_CAP_STABLE_WRITES in rbd_init_disk(), see commit
bae818ee1577 ("rbd: require stable pages if message data CRCs are
enabled").  The CRCs are generated in write_partial_message_data() and
verified in read_partial_msg_data().

I'm attaching the patch I had in mind.

Thanks,

Ilya
From 24a0cf8bc437a65b3986e9ab25cf2f05e6bbd5d0 Mon Sep 17 00:00:00 2001
From: Ilya Dryomov 
Date: Mon, 20 Feb 2017 18:50:50 +0100
Subject: [PATCH] block: get rid of blk_integrity_revalidate()

Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
introduced blk_integrity_revalidate(), which clears the stable pages
bit if no blk_integrity profile is registered.  It's called from
revalidate_disk() and rescan_partitions(), making it impossible to set
BDI_CAP_STABLE_WRITES for drivers that support partitions and don't use
blk_integrity.  One such driver is rbd, where the ceph messenger is
responsible for generating/verifying CRCs.

Since blk_integrity_{un,}register() "must" be used for (un)registering
the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
setting there.  This way drivers that call blk_integrity_register() and
use integrity infrastructure won't interfere with drivers that don't
but still want stable pages.

Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
Signed-off-by: Ilya Dryomov 
---
 block/blk-integrity.c | 19 ++-
 block/partition-generic.c |  1 -
 fs/block_dev.c|  1 -
 include/linux/genhd.h |  2 --
 4 files changed, 2 insertions(+), 21 deletions(-)

diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index d69c5c79f98e..319f2e4f4a8b 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -417,7 +417,7 @@ void blk_integrity_register(struct gendisk *disk, struct blk_integrity *template
 	bi->tuple_size = template->tuple_size;
 	bi->tag_size = template->tag_size;
 
-	blk_integrity_revalidate(disk);
+	disk->queue->backing_dev_info.capabilities |= BDI_CAP_STABLE_WRITES;
 }
 EXPORT_SYMBOL(blk_integrity_register);
 
@@ -430,26 +430,11 @@ EXPORT_SYMBOL(blk_integrity_register);
  */
 void blk_integrity_unregister(struct gendisk *disk)
 {
-	blk_integrity_revalidate(disk);
+	disk->queue->backing_dev_info.capabilities &= ~BDI_CAP_STABLE_WRITES;
 	memset(&disk->queue->integrity, 0, sizeof(struct blk_integrity));
 }
 EXPORT_SYMBOL(blk_integrity_unregister);
 
-vo

[PATCH] block/sed: Embed function data into the function sequence

2017-02-22 Thread Jon Derrick
By embedding the function data with the function sequence, we can
eliminate the external function data and state variable code. It also
made obvious some other small cleanups.

Signed-off-by: Jon Derrick 
---
This is 4/4 of [1]:
Reverted opal_lock_unlock conditional locking and reintroduced
__opal_lock_unlock

[1]
http://lists.infradead.org/pipermail/linux-nvme/2017-February/008380.html

 block/sed-opal.c | 418 ++-
 1 file changed, 163 insertions(+), 255 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 4fc4d7b..8935575 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -34,7 +34,11 @@
 #define IO_BUFFER_LENGTH 2048
 #define MAX_TOKS 64
 
-typedef int (*opal_step)(struct opal_dev *dev);
+struct opal_step {
+   int (*fn)(struct opal_dev *dev, void *data);
+   void *data;
+};
+typedef int (cont_fn)(struct opal_dev *dev);
 
 enum opal_atom_width {
OPAL_WIDTH_TINY,
@@ -80,9 +84,7 @@ struct opal_dev {
void *data;
sec_send_recv *send_recv;
 
-   const opal_step *funcs;
-   void **func_data;
-   int state;
+   const struct opal_step *steps;
struct mutex dev_lock;
u16 comid;
u32 hsn;
@@ -213,8 +215,6 @@ struct opal_dev {
{ 0x00, 0x00, 0x00, 0x06, 0x00, 0x00, 0x08, 0x03 },
 };
 
-typedef int (cont_fn)(struct opal_dev *dev);
-
 static int end_opal_session_error(struct opal_dev *dev);
 
 struct opal_suspend_data {
@@ -375,18 +375,18 @@ static void check_geometry(struct opal_dev *dev, const 
void *data)
 
 static int next(struct opal_dev *dev)
 {
-   opal_step func;
-   int error = 0;
+   const struct opal_step *step;
+   int state = 0, error = 0;
 
do {
-   func = dev->funcs[dev->state];
-   if (!func)
+   step = &dev->steps[state];
+   if (!step->fn)
break;
 
-   error = func(dev);
+   error = step->fn(dev, step->data);
if (error) {
pr_err("Error on step function: %d with error %d: %s\n",
-  dev->state, error,
+  state, error,
   opal_error_to_human(error));
 
/* For each OPAL command we do a discovery0 then we
@@ -396,10 +396,10 @@ static int next(struct opal_dev *dev)
 * session. Therefore we shouldn't attempt to terminate
 * a session, as one has not yet been created.
 */
-   if (dev->state > 1)
+   if (state > 1)
return end_opal_session_error(dev);
}
-   dev->state++;
+   state++;
} while (!error);
 
return error;
@@ -483,7 +483,7 @@ static int opal_discovery0_end(struct opal_dev *dev)
return 0;
 }
 
-static int opal_discovery0(struct opal_dev *dev)
+static int opal_discovery0(struct opal_dev *dev, void *data)
 {
int ret;
 
@@ -1018,7 +1018,7 @@ static int finalize_and_send(struct opal_dev *dev, 
cont_fn cont)
return opal_send_recv(dev, cont);
 }
 
-static int gen_key(struct opal_dev *dev)
+static int gen_key(struct opal_dev *dev, void *data)
 {
const u8 *method;
u8 uid[OPAL_UID_LENGTH];
@@ -1072,15 +1072,14 @@ static int get_active_key_cont(struct opal_dev *dev)
return 0;
 }
 
-static int get_active_key(struct opal_dev *dev)
+static int get_active_key(struct opal_dev *dev, void *data)
 {
u8 uid[OPAL_UID_LENGTH];
int err = 0;
-   u8 *lr;
+   u8 *lr = data;
 
clear_opal_cmd(dev);
set_comid(dev, dev->comid);
-   lr = dev->func_data[dev->state];
 
err = build_locking_range(uid, sizeof(uid), *lr);
if (err)
@@ -1163,17 +1162,16 @@ static inline int enable_global_lr(struct opal_dev 
*dev, u8 *uid,
return err;
 }
 
-static int setup_locking_range(struct opal_dev *dev)
+static int setup_locking_range(struct opal_dev *dev, void *data)
 {
u8 uid[OPAL_UID_LENGTH];
-   struct opal_user_lr_setup *setup;
+   struct opal_user_lr_setup *setup = data;
u8 lr;
int err = 0;
 
clear_opal_cmd(dev);
set_comid(dev, dev->comid);
 
-   setup = dev->func_data[dev->state];
lr = setup->session.opal_key.lr;
err = build_locking_range(uid, sizeof(uid), lr);
if (err)
@@ -1286,20 +1284,19 @@ static int start_generic_opal_session(struct opal_dev 
*dev,
return finalize_and_send(dev, start_opal_session_cont);
 }
 
-static int start_anybodyASP_opal_session(struct opal_dev *dev)
+static int start_anybodyASP_opal_session(struct opal_dev *dev, void *data)
 {
return start_generic_opal_session(dev, OPAL_ANYBODY_UID,
  OPAL_ADMINSP_UID, NULL, 0);
 }
 
-static int start_SIDASP_opal_session(str

Re: [PATCHv4 4/4] block/sed: Embed function data into the function sequence

2017-02-22 Thread Jens Axboe
On 02/22/2017 12:13 AM, Christoph Hellwig wrote:
>> +if (!lock_held)
>> +mutex_lock(&dev->dev_lock);
> 
> No conditional locking, please.  I guess I causesd this by asking you
> to remove __opal_lock_unlock, but it seems we'd either need to keep it
> in the end.
> 
> Except for that the series looks fine to me.
> 
> Jens: given that 1-3 are the important fixes how about you pick those
> up ASAP?  They all also had my Reviewed-by for previous postings.

I picked up 1-3, and re-added your reviewed by. #4 should be sorted
before -rc1, though.

-- 
Jens Axboe



Re: [PATCHv4 4/4] block/sed: Embed function data into the function sequence

2017-02-22 Thread Scott Bauer
On Wed, Feb 22, 2017 at 09:10:31AM -0700, Jens Axboe wrote:
> On 02/22/2017 12:13 AM, Christoph Hellwig wrote:
> >> +  if (!lock_held)
> >> +  mutex_lock(&dev->dev_lock);
> > 
> > No conditional locking, please.  I guess I causesd this by asking you
> > to remove __opal_lock_unlock, but it seems we'd either need to keep it
> > in the end.
> > 
> > Except for that the series looks fine to me.
> > 
> > Jens: given that 1-3 are the important fixes how about you pick those
> > up ASAP?  They all also had my Reviewed-by for previous postings.
> 
> I picked up 1-3, and re-added your reviewed by. #4 should be sorted
> before -rc1, though.
> 

#4 Is good to go as well. It was resent this morning under
[PATCH] block/sed: Embed function data into the function sequence
And contains the changes Christoph requested, I'll re-add my sign-off.
Once that gets In I can rebase mine and get them out today too.


Re: [PATCH] block/sed: Embed function data into the function sequence

2017-02-22 Thread Scott Bauer
On Wed, Feb 22, 2017 at 07:55:13AM -0700, Jon Derrick wrote:
> By embedding the function data with the function sequence, we can
> eliminate the external function data and state variable code. It also
> made obvious some other small cleanups.
> 
> Signed-off-by: Jon Derrick 
Reviewed-by: Scott Bauer 

Christoph's review is here (I don't know if its kosher for me to put it in my 
message)
http://lists.infradead.org/pipermail/linux-nvme/2017-February/008405.html


Re: [PATCHv4 4/4] block/sed: Embed function data into the function sequence

2017-02-22 Thread Jens Axboe
On 02/22/2017 09:13 AM, Scott Bauer wrote:
> On Wed, Feb 22, 2017 at 09:10:31AM -0700, Jens Axboe wrote:
>> On 02/22/2017 12:13 AM, Christoph Hellwig wrote:
 +  if (!lock_held)
 +  mutex_lock(&dev->dev_lock);
>>>
>>> No conditional locking, please.  I guess I causesd this by asking you
>>> to remove __opal_lock_unlock, but it seems we'd either need to keep it
>>> in the end.
>>>
>>> Except for that the series looks fine to me.
>>>
>>> Jens: given that 1-3 are the important fixes how about you pick those
>>> up ASAP?  They all also had my Reviewed-by for previous postings.
>>
>> I picked up 1-3, and re-added your reviewed by. #4 should be sorted
>> before -rc1, though.
>>
> 
> #4 Is good to go as well. It was resent this morning under
> [PATCH] block/sed: Embed function data into the function sequence
> And contains the changes Christoph requested, I'll re-add my sign-off.
> Once that gets In I can rebase mine and get them out today too.

I see, I found it now. Guys, let's get this process streamlined a bit
more. This whole thing has been a flurry of patches and patchseries,
posted by either you or Jon. Previous patch series was 1-4 patches
posted by you, and then patch #4 is replaced by a single patch from Jon,
posted outside of that thread. Honestly, I feel like this should have
been pushed to 4.12 instead, it clearly wasn't ready before the merge
window.

-- 
Jens Axboe



Re: [PATCHv4 4/4] block/sed: Embed function data into the function sequence

2017-02-22 Thread Keith Busch
On Wed, Feb 22, 2017 at 09:47:53AM -0700, Jens Axboe wrote:
> I see, I found it now. Guys, let's get this process streamlined a bit
> more. This whole thing has been a flurry of patches and patchseries,
> posted by either you or Jon. Previous patch series was 1-4 patches
> posted by you, and then patch #4 is replaced by a single patch from Jon,
> posted outside of that thread. Honestly, I feel like this should have
> been pushed to 4.12 instead, it clearly wasn't ready before the merge
> window.

Sorry, I'll run interference to help ensure future patch series are
complete and coherent in one thread.


[PATCH 3/3] block/sed-opal: Propagate original error message to userland.

2017-02-22 Thread Scott Bauer
During an error on a comannd, ex: user provides wrong pw to unlock
range, we will gracefully terminate the opal session. We want to
propagate the original error to userland instead of the result of
the session termination, which is almost always a success.

Signed-off-by: Scott Bauer 
---
 block/sed-opal.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 020bf3e..1e18dca 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -396,8 +396,11 @@ static int next(struct opal_dev *dev)
 * session. Therefore we shouldn't attempt to terminate
 * a session, as one has not yet been created.
 */
-   if (state > 1)
-   return end_opal_session_error(dev);
+   if (state > 1) {
+   end_opal_session_error(dev);
+   return error;
+   }
+
}
state++;
} while (!error);
-- 
2.7.4



Final opal patches for rc1

2017-02-22 Thread Scott Bauer
These are the final changes for rc1.

Patch 1 contains some code to clean up an opal structure if something changes
at runtime. (already reviewed by Christoph)

Patch 2 contains changes to the nvme driver to use the above patch. If a new
FW is loaded and we no longer support opal we clean up. This patch is based off
code from Keith and Christoph.

Patch 3 is a one line change to propagate an original error messaage to user
land instead of the return status of the session termination code.



[PATCH 1/3] block/sed-opal: Introduce free_opal_dev to free the structure and clean up state

2017-02-22 Thread Scott Bauer
Before we free the opal structure we need to clean up any saved
locking ranges that the user had told us to unlock from a suspend.

Signed-off-by: Scott Bauer 
Reviewed-by: Christoph Hellwig 
---
 block/sed-opal.c | 30 ++
 include/linux/sed-opal.h |  5 +
 2 files changed, 35 insertions(+)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 8935575..020bf3e 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -1987,6 +1987,28 @@ static int check_opal_support(struct opal_dev *dev)
return ret;
 }
 
+static void clean_opal_dev(struct opal_dev *dev)
+{
+
+   struct opal_suspend_data *suspend, *next;
+
+   mutex_lock(&dev->dev_lock);
+   list_for_each_entry_safe(suspend, next, &dev->unlk_lst, node) {
+   list_del(&suspend->node);
+   kfree(suspend);
+   }
+   mutex_unlock(&dev->dev_lock);
+}
+
+void free_opal_dev(struct opal_dev *dev)
+{
+   if (!dev)
+   return;
+   clean_opal_dev(dev);
+   kfree(dev);
+}
+EXPORT_SYMBOL(free_opal_dev);
+
 struct opal_dev *init_opal_dev(void *data, sec_send_recv *send_recv)
 {
struct opal_dev *dev;
@@ -2141,6 +2163,14 @@ static int opal_reverttper(struct opal_dev *dev, struct 
opal_key *opal)
setup_opal_dev(dev, revert_steps);
ret = next(dev);
mutex_unlock(&dev->dev_lock);
+
+   /*
+* If we successfully reverted lets clean
+* any saved locking ranges.
+*/
+   if (!ret)
+   clean_opal_dev(dev);
+
return ret;
 }
 
diff --git a/include/linux/sed-opal.h b/include/linux/sed-opal.h
index deee23d..04b124f 100644
--- a/include/linux/sed-opal.h
+++ b/include/linux/sed-opal.h
@@ -27,6 +27,7 @@ typedef int (sec_send_recv)(void *data, u16 spsp, u8 secp, 
void *buffer,
size_t len, bool send);
 
 #ifdef CONFIG_BLK_SED_OPAL
+void free_opal_dev(struct opal_dev *dev);
 bool opal_unlock_from_suspend(struct opal_dev *dev);
 struct opal_dev *init_opal_dev(void *data, sec_send_recv *send_recv);
 int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *ioctl_ptr);
@@ -51,6 +52,10 @@ static inline bool is_sed_ioctl(unsigned int cmd)
return false;
 }
 #else
+static inline void free_opal_dev(struct opal_dev *dev)
+{
+}
+
 static inline bool is_sed_ioctl(unsigned int cmd)
 {
return false;
-- 
2.7.4



[PATCH 2/3] nvme/pci: re-check security protocol support after reset

2017-02-22 Thread Scott Bauer
A device may change capabilities after each reset, e.g. due to a firmware
upgrade.  We should thus check for Security Send/Receive and OPAL support
after each reset.

Based on patches from Christoph and Keith.

Signed-off-by: Scott Bauer 
---
 drivers/nvme/host/pci.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ddc51ad..f660fc2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1739,7 +1739,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
if (dev->ctrl.admin_q)
blk_put_queue(dev->ctrl.admin_q);
kfree(dev->queues);
-   kfree(dev->ctrl.opal_dev);
+   free_opal_dev(dev->ctrl.opal_dev);
kfree(dev);
 }
 
@@ -1789,14 +1789,17 @@ static void nvme_reset_work(struct work_struct *work)
if (result)
goto out;
 
-   if ((dev->ctrl.oacs & NVME_CTRL_OACS_SEC_SUPP) && !dev->ctrl.opal_dev) {
-   dev->ctrl.opal_dev =
-   init_opal_dev(&dev->ctrl, &nvme_sec_submit);
+   if (dev->ctrl.oacs & NVME_CTRL_OACS_SEC_SUPP) {
+   if (!dev->ctrl.opal_dev)
+   dev->ctrl.opal_dev =
+   init_opal_dev(&dev->ctrl, &nvme_sec_submit);
+   else if (was_suspend)
+   opal_unlock_from_suspend(dev->ctrl.opal_dev);
+   } else {
+   free_opal_dev(dev->ctrl.opal_dev);
+   dev->ctrl.opal_dev = NULL;
}
 
-   if (was_suspend)
-   opal_unlock_from_suspend(dev->ctrl.opal_dev);
-
result = nvme_setup_io_queues(dev);
if (result)
goto out;
-- 
2.7.4



Re: [PATCH 04/13] block: Move bdi_unregister() to del_gendisk()

2017-02-22 Thread Bart Van Assche
On 02/22/2017 01:20 AM, Jan Kara wrote:
> On Tue 21-02-17 19:53:29, Bart Van Assche wrote:
>> This change looks suspicious to me. There are drivers that create a
>> block layer queue but neither call device_add_disk() nor del_gendisk(),
>> e.g. drivers/scsi/st.c. Although bdi_init() will be called for the
>> queues created by these drivers, this patch will cause the
>> bdi_unregister() call to be skipped for these drivers.
> 
> Well, the thing is that bdi_unregister() is the counterpart to
> bdi_register(). Unless you call bdi_register(), which happens only in
> device_add_disk() (and some filesystems which create their private bdis),
> there's no point in calling bdi_unregister(). Counterpart to bdi_init() is
> bdi_exit() and that gets called always once bdi reference count drops to 0.

That makes sense to me. Hence:

Reviewed-by: Bart Van Assche 



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Jens Axboe
On 02/21/2017 04:23 PM, Linus Torvalds wrote:
> On Tue, Feb 21, 2017 at 3:15 PM, Jens Axboe  wrote:
>>
>> But under a device managed by blk-mq, that device exposes a number of
>> hardware queues. For older style devices, that number is typically 1
>> (single queue).
> 
> ... but why would this ever be different from the normal IO scheduler?

Because we have a different set of schedulers for blk-mq, different
than the legacy path. mq-deadline is a basic port that will work
fine with rotational storage, but it's not going to be a good choice
for NVMe because of scalability issues.

We'll have BFQ on the blk-mq side, catering to the needs of those
folks that currently rely on the richer feature set that CFQ supports.

We've continually been working towards getting rid of the legacy
IO path, and its set of schedulers. So if it's any consolation,
those options will go away in the future.

> IOW, what makes single-queue mq scheduling so special that
> 
>  (a) it needs its own config option
> 
>  (b) it is different from just the regular IO scheduler in the first place?
> 
> So the whole thing stinks. The fact that it then has an
> incomprehensible config option seems to be just gravy on top of the
> crap.

What do you mean by "the regular IO scheduler"? These are different
schedulers.

As explained above, single-queue mq devices generally DO want mq-deadline.
multi-queue mq devices, we don't have a good choice for them right now,
so we retain the current behavior (that we've had since blk-mq was
introduced in 3.13) of NOT doing any IO scheduling for them. If you
do want scheduling for them, set the option, or configure udev to
make the right choice for you.

I agree the wording isn't great, and we can improve that. But I do
think that the current choices make sense.

>> "none" just means that we don't have a scheduler attached.
> 
> .. which makes no sense to me in the first place.
> 
> People used to try to convince us that doing IO schedulers was a
> mistake, because modern disk hardware did a better job than we could
> do in software.
> 
> Those people were full of crap. The regular IO scheduler used to have
> a "NONE" option too. Maybe it even still has one, but only insane
> people actually use it.
> 
> Why is the MQ stuff magically so different that NONE would make sense at all>?

I was never one of those people, and I've always been a strong advocate
for imposing scheduling to keep devices in check. The regular IO scheduler
pool includes "noop", which is probably the one you are thinking of. That
one is a bit different than the new "none" option for blk-mq, in that it
does do insertion sorts and it does do merges. "none" does some merging,
but only where it happens to make sense. There's no insertion sorting.

> And equally importantly: why do we _ask_ people these issues? Is this
> some kind of sick "cover your ass" thing, where you can say "well, I
> asked about it", when inevitably the choice ends up being the wrong
> one?
> 
> We have too damn many Kconfig options as-is, I'm trying to push back
> on them. These two options seem fundamentally broken and stupid.
> 
> The "we have no good idea, so let's add a Kconfig option" seems like a
> broken excuse for these things existing.
> 
> So why ask this question in the first place?
> 
> Is there any possible reason why "NONE" is a good option at all? And
> if it is the _only_ option (because no other better choice exists), it
> damn well shouldn't be a kconfig option!

I'm all for NOT asking questions, and not providing tunables. That's
generally how I do write code. See the blk-wbt stuff, for instance, that
basically just has one tunable that's set sanely by default, and we
figure out the rest.

I don't want to regress performance of blk-mq devices by attaching
mq-deadline to them. When we do have a sane scheduler choice, we'll
make that the default. And yes, maybe we can remove the Kconfig option
at that point.

For single queue devices, we could kill the option. But we're expecting
bfq-mq for 4.12, and we'll want to have the option at that point unless
you want to rely solely on runtime setting of the scheduler through
udev or by the sysadmin.

-- 
Jens Axboe



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Linus Torvalds
On Wed, Feb 22, 2017 at 10:14 AM, Jens Axboe  wrote:
>
> What do you mean by "the regular IO scheduler"? These are different
> schedulers.

Not to the user they aren't.

If the user already answered once about the IO schedulers, we damn
well shouldn't ask again abotu another small implementaiton detail.

How hard is this to understand? You're asking users stupid things.

It's not just about the wording. It's a fundamental issue.  These
questions are about internal implementation details. They make no
sense to a user. They don't even make sense to a kernel developer, for
chrissake!

Don't make the kconfig mess worse. This "we can't make good defaults
in the kernel, so ask users about random things that they cannot
possibly answer" model is not an acceptable model.

If the new schedulers aren't better than NOOP, they shouldn't exist.
And if you want people to be able to test, they should be dynamic.

And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER?

It's really that simple.

 Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Jens Axboe
On 02/22/2017 11:26 AM, Linus Torvalds wrote:
> On Wed, Feb 22, 2017 at 10:14 AM, Jens Axboe  wrote:
>>
>> What do you mean by "the regular IO scheduler"? These are different
>> schedulers.
> 
> Not to the user they aren't.
> 
> If the user already answered once about the IO schedulers, we damn
> well shouldn't ask again abotu another small implementaiton detail.
> 
> How hard is this to understand? You're asking users stupid things.

The fact is that we have two different sets, until we can yank
the old ones. So I can't just ask one question, since the sets
aren't identical.

This IS confusing to the user, and it's an artifact of the situation
that we have where we are phasing out the old IO path and switching
to blk-mq. I don't want to user to know about blk-mq, I just want
it to be what everything runs on. But until that happens, and it is
happening, we are going to be stuck with that situation.

We have this exposed in other places, too. Like for dm, and for
SCSI. Not a perfect situation, but something that WILL go away
eventually.

> It's not just about the wording. It's a fundamental issue.  These
> questions are about internal implementation details. They make no
> sense to a user. They don't even make sense to a kernel developer, for
> chrissake!
> 
> Don't make the kconfig mess worse. This "we can't make good defaults
> in the kernel, so ask users about random things that they cannot
> possibly answer" model is not an acceptable model.

There are good defaults! mq single-queue should default to mq-deadline,
and mq multi-queue should default to "none" for now. If you feel that
strongly about it (and I'm guessing you do, judging by the speed
typing and generally annoyed demeanor), then by all means, let's kill
the config entries and I'll just hardwire the defaults. The config
entries were implemented similarly to the old schedulers, and each
scheduler is selectable individually. I'd greatly prefer just
improving the wording so it makes more sense.

> If the new schedulers aren't better than NOOP, they shouldn't exist.
> And if you want people to be able to test, they should be dynamic.

They are dynamic! You can build them as modules, you can switch at
runtime. Just like we have always been able to. I can't make it more
dynamic than that. We're reusing the same internal infrastructure for
that, AND the user visible ABI for checking what is available, and
setting a new one.

> And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER?

BECAUSE IT'S POLICY! Fact of that matter is, if I just default to what
we had before, it'd all be running with none. In a few years time, if
I'm lucky, someone will have shipped udev rules setting this appropriately.
If I ask the question, we'll get testing NOW. People will run with
the default set.

-- 
Jens Axboe



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Linus Torvalds
On Wed, Feb 22, 2017 at 10:26 AM, Linus Torvalds
 wrote:
>
> And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR USER?

Basically, I'm pushing back on config options that I can't personally
even sanely answer.

If it's a config option about "do I have a particular piece of
hardware", it makes sense. But these new ones were just complete
garbage.

The whole "default IO scheduler" thing is a disease. We should stop
making up these shit schedulers and then say "we don't know which one
works best for you".

All it does is encourage developers to make shortcuts and create crap
that isn't generically useful, and then blame the user and say "well,
you should have picked a different scheduler" when they say "this does
not work well for me".

We have had too many of those kinds of broken choices.  And when the
new Kconfig options get so confusing and so esoteric that I go "Hmm, I
have no idea if my hardware does a single queue or not", I put my foot
down.

When the IO scheduler questions were about a generic IO scheduler for
everything, I can kind of understand them. I think it was still a
mistake (for the reasons outline above), but at least it was a
comprehensible question to ask.

But when it gets to "what should I do about a single-queue version of
a MQ scheduler", the question is no longer even remotely sensible. The
question should simply NOT EXIST. There is no possible valid reason to
ask that kind of crap.

   Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Jens Axboe
On 02/22/2017 11:42 AM, Linus Torvalds wrote:
> On Wed, Feb 22, 2017 at 10:26 AM, Linus Torvalds
>  wrote:
>>
>> And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR 
>> USER?
> 
> Basically, I'm pushing back on config options that I can't personally
> even sanely answer.

I got that much, and I don't disagree on that part.

> If it's a config option about "do I have a particular piece of
> hardware", it makes sense. But these new ones were just complete
> garbage.
> 
> The whole "default IO scheduler" thing is a disease. We should stop
> making up these shit schedulers and then say "we don't know which one
> works best for you".
> 
> All it does is encourage developers to make shortcuts and create crap
> that isn't generically useful, and then blame the user and say "well,
> you should have picked a different scheduler" when they say "this does
> not work well for me".
> 
> We have had too many of those kinds of broken choices.  And when the
> new Kconfig options get so confusing and so esoteric that I go "Hmm, I
> have no idea if my hardware does a single queue or not", I put my foot
> down.
> 
> When the IO scheduler questions were about a generic IO scheduler for
> everything, I can kind of understand them. I think it was still a
> mistake (for the reasons outline above), but at least it was a
> comprehensible question to ask.
> 
> But when it gets to "what should I do about a single-queue version of
> a MQ scheduler", the question is no longer even remotely sensible. The
> question should simply NOT EXIST. There is no possible valid reason to
> ask that kind of crap.

OK, so here's what I'll do:

1) We'll kill the default scheduler choices. sq blk-mq will default to
   mq-deadline, mq blk-mq will default to "none" (at least for now, until
   the new scheduler is done).
2) The individual schedulers will be y/m/n selectable, just like any
   other driver.

I hope that works for everyone.

-- 
Jens Axboe



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Jens Axboe
On 02/22/2017 11:45 AM, Linus Torvalds wrote:
> On Wed, Feb 22, 2017 at 10:41 AM, Jens Axboe  wrote:
>>
>> The fact is that we have two different sets, until we can yank
>> the old ones. So I can't just ask one question, since the sets
>> aren't identical.
> 
> Bullshit.
> 
> I'm, saying: rip out the question ENTIRELY. For *both* cases.
> 
> If you cannot yourself give a good answer, then there's no f*cking way
> any user can give a good answer. So asking the question is totally and
> utterly pointless.
> 
> All it means is that different people will try different (in random
> ways) configurations, and the end result is random crap.
> 
> So get rid of those questions. Pick a default, and live with it. And
> if people complain about performance, fix the performance issue.
> 
> It's that simple.

No, it's not that simple at all. Fact is, some optimizations make sense
for some workloads, and some do not. CFQ works great for some cases, and
it works poorly for others, even if we try to make heuristics that enable
it to work well for all cases. Some optimizations are costly, that's
fine on certain types of hardware or maybe that's a trade off you want
to make. Or we end up with tons of settings for a single driver, that
does not reduce the configuration matrix at all.

By that logic, why do we have ANY config options outside of what drivers
to build? What should I set HZ at? RCU options? Let's default to ext4,
and kill off xfs? Or btrfs? slab/slob/slub/whatever?

Yes, that's taking the argument a bit more to the extreme, but it's the
same damn thing.

I'm fine with getting rid of the default selections, but we're NOT
going to be able to have just one scheduler for everything. We can
make sane defaults based on the hardware type.

-- 
Jens Axboe



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Linus Torvalds
On Wed, Feb 22, 2017 at 10:41 AM, Jens Axboe  wrote:
>
> The fact is that we have two different sets, until we can yank
> the old ones. So I can't just ask one question, since the sets
> aren't identical.

Bullshit.

I'm, saying: rip out the question ENTIRELY. For *both* cases.

If you cannot yourself give a good answer, then there's no f*cking way
any user can give a good answer. So asking the question is totally and
utterly pointless.

All it means is that different people will try different (in random
ways) configurations, and the end result is random crap.

So get rid of those questions. Pick a default, and live with it. And
if people complain about performance, fix the performance issue.

It's that simple.

Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Linus Torvalds
On Wed, Feb 22, 2017 at 10:52 AM, Jens Axboe  wrote:
>>
>> It's that simple.
>
> No, it's not that simple at all. Fact is, some optimizations make sense
> for some workloads, and some do not.

Are you even listening?

I'm saying no user can ever give a sane answer to your question. The
question is insane and wrong.

I already said you can have a dynamic configuration (and maybe even an
automatic heuristic - like saying that a ramdisk gets NOOP by default,
real hardware does not).

But asking a user at kernel config time for a default is insane. If
*you* cannot answer it, then the user sure as hell cannot.

Other configuration questions have problems too, but at least the
question about "should I support ext4" is something a user (or distro)
can sanely answer. So your comparisons are pure bullshit.

 Linus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Jens Axboe
On 02/22/2017 11:56 AM, Linus Torvalds wrote:
> On Wed, Feb 22, 2017 at 10:52 AM, Jens Axboe  wrote:
>>>
>>> It's that simple.
>>
>> No, it's not that simple at all. Fact is, some optimizations make sense
>> for some workloads, and some do not.
> 
> Are you even listening?
> 
> I'm saying no user can ever give a sane answer to your question. The
> question is insane and wrong.
> 
> I already said you can have a dynamic configuration (and maybe even an
> automatic heuristic - like saying that a ramdisk gets NOOP by default,
> real hardware does not).
> 
> But asking a user at kernel config time for a default is insane. If
> *you* cannot answer it, then the user sure as hell cannot.
> 
> Other configuration questions have problems too, but at least the
> question about "should I support ext4" is something a user (or distro)
> can sanely answer. So your comparisons are pure bullshit.

As per the previous email, this was my proposed solution:

OK, so here's what I'll do:

1) We'll kill the default scheduler choices. sq blk-mq will default to
   mq-deadline, mq blk-mq will default to "none" (at least for now, until
   the new scheduler is done).
2) The individual schedulers will be y/m/n selectable, just like any
   other driver.

Any further settings on that can be done at runtime, through sysfs.

-- 
Jens Axboe



[PATCH v3 1/2] blk-mq: use sbq wait queues instead of restart for driver tags

2017-02-22 Thread Omar Sandoval
From: Omar Sandoval 

Commit 50e1dab86aa2 ("blk-mq-sched: fix starvation for multiple hardware
queues and shared tags") fixed one starvation issue for shared tags.
However, we can still get into a situation where we fail to allocate a
tag because all tags are allocated but we don't have any pending
requests on any hardware queue.

One solution for this would be to restart all queues that share a tag
map, but that really sucks. Ideally, we could just block and wait for a
tag, but that isn't always possible from blk_mq_dispatch_rq_list().

However, we can still use the struct sbitmap_queue wait queues with a
custom callback instead of blocking. This has a few benefits:

1. It avoids iterating over all hardware queues when completing an I/O,
   which the current restart code has to do.
2. It benefits from the existing rolling wakeup code.
3. It avoids punting to another thread just to have it block.

Signed-off-by: Omar Sandoval 
Signed-off-by: Jens Axboe 
---
Changed from v2:

- Allow the hardware queue to be run while we're waiting for a tag. This
  fixes a hang observed when running xfs/297. We still avoid busy
  looping by moving the same check into blk_mq_dispatch_rq_list().

 block/blk-mq.c | 64 +++---
 include/linux/blk-mq.h |  2 ++
 2 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b29e7dc7b309..9e6b064e5339 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -904,6 +904,44 @@ static bool reorder_tags_to_front(struct list_head *list)
return first != NULL;
 }
 
+static int blk_mq_dispatch_wake(wait_queue_t *wait, unsigned mode, int flags,
+   void *key)
+{
+   struct blk_mq_hw_ctx *hctx;
+
+   hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
+
+   list_del(&wait->task_list);
+   clear_bit_unlock(BLK_MQ_S_TAG_WAITING, &hctx->state);
+   blk_mq_run_hw_queue(hctx, true);
+   return 1;
+}
+
+static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
+{
+   struct sbq_wait_state *ws;
+
+   /*
+* The TAG_WAITING bit serves as a lock protecting hctx->dispatch_wait.
+* The thread which wins the race to grab this bit adds the hardware
+* queue to the wait queue.
+*/
+   if (test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state) ||
+   test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, &hctx->state))
+   return false;
+
+   init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
+   ws = bt_wait_ptr(&hctx->tags->bitmap_tags, hctx);
+
+   /*
+* As soon as this returns, it's no longer safe to fiddle with
+* hctx->dispatch_wait, since a completion can wake up the wait queue
+* and unlock the bit.
+*/
+   add_wait_queue(&ws->wait, &hctx->dispatch_wait);
+   return true;
+}
+
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head 
*list)
 {
struct request_queue *q = hctx->queue;
@@ -931,15 +969,22 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, 
struct list_head *list)
continue;
 
/*
-* We failed getting a driver tag. Mark the queue(s)
-* as needing a restart. Retry getting a tag again,
-* in case the needed IO completed right before we
-* marked the queue as needing a restart.
+* The initial allocation attempt failed, so we need to
+* rerun the hardware queue when a tag is freed.
 */
-   blk_mq_sched_mark_restart(hctx);
-   if (!blk_mq_get_driver_tag(rq, &hctx, false))
+   if (blk_mq_dispatch_wait_add(hctx)) {
+   /*
+* It's possible that a tag was freed in the
+* window between the allocation failure and
+* adding the hardware queue to the wait queue.
+*/
+   if (!blk_mq_get_driver_tag(rq, &hctx, false))
+   break;
+   } else {
break;
+   }
}
+
list_del_init(&rq->queuelist);
 
bd.rq = rq;
@@ -995,10 +1040,11 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, 
struct list_head *list)
 *
 * blk_mq_run_hw_queue() already checks the STOPPED bit
 *
-* If RESTART is set, then let completion restart the queue
-* instead of potentially looping here.
+* If RESTART or TAG_WAITING is set, then let completion restart
+* the queue instead of potentially looping her

[PATCH v3 2/2] blk-mq-sched: separate mark hctx and queue restart operations

2017-02-22 Thread Omar Sandoval
From: Omar Sandoval 

In blk_mq_sched_dispatch_requests(), we call blk_mq_sched_mark_restart()
after we dispatch requests left over on our hardware queue dispatch
list. This is so we'll go back and dispatch requests from the scheduler.
In this case, it's only necessary to restart the hardware queue that we
are running; there's no reason to run other hardware queues just because
we are using shared tags.

So, split out blk_mq_sched_mark_restart() into two operations, one for
just the hardware queue and one for the whole request queue. The core
code only needs the hctx variant, but I/O schedulers will want to use
both.

This also requires adjusting blk_mq_sched_restart_queues() to always
check the queue restart flag, not just when using shared tags.

Signed-off-by: Omar Sandoval 
Signed-off-by: Jens Axboe 
---
 block/blk-mq-sched.c | 20 
 block/blk-mq-sched.h | 26 ++
 2 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 9e8d6795a8c1..16df0a5e7046 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -205,7 +205,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 * needing a restart in that case.
 */
if (!list_empty(&rq_list)) {
-   blk_mq_sched_mark_restart(hctx);
+   blk_mq_sched_mark_restart_hctx(hctx);
did_work = blk_mq_dispatch_rq_list(hctx, &rq_list);
} else if (!has_sched_dispatch) {
blk_mq_flush_busy_ctxs(hctx, &rq_list);
@@ -331,20 +331,16 @@ static void blk_mq_sched_restart_hctx(struct 
blk_mq_hw_ctx *hctx)
 
 void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
 {
+   struct request_queue *q = hctx->queue;
unsigned int i;
 
-   if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+   if (test_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
+   if (test_and_clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
+   queue_for_each_hw_ctx(q, hctx, i)
+   blk_mq_sched_restart_hctx(hctx);
+   }
+   } else {
blk_mq_sched_restart_hctx(hctx);
-   else {
-   struct request_queue *q = hctx->queue;
-
-   if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
-   return;
-
-   clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
-
-   queue_for_each_hw_ctx(q, hctx, i)
-   blk_mq_sched_restart_hctx(hctx);
}
 }
 
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 7b5f3b95c78e..a75b16b123f7 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -122,17 +122,27 @@ static inline bool blk_mq_sched_has_work(struct 
blk_mq_hw_ctx *hctx)
return false;
 }
 
-static inline void blk_mq_sched_mark_restart(struct blk_mq_hw_ctx *hctx)
+/*
+ * Mark a hardware queue as needing a restart.
+ */
+static inline void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
 {
-   if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
+   if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-   if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
-   struct request_queue *q = hctx->queue;
+}
+
+/*
+ * Mark a hardware queue and the request queue it belongs to as needing a
+ * restart.
+ */
+static inline void blk_mq_sched_mark_restart_queue(struct blk_mq_hw_ctx *hctx)
+{
+   struct request_queue *q = hctx->queue;
 
-   if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
-   set_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
-   }
-   }
+   if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
+   set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+   if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
+   set_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
 }
 
 static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
-- 
2.11.1



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Linus Torvalds
On Wed, Feb 22, 2017 at 10:58 AM, Jens Axboe  wrote:
> On 02/22/2017 11:56 AM, Linus Torvalds wrote:
>
> OK, so here's what I'll do:
>
> 1) We'll kill the default scheduler choices. sq blk-mq will default to
>mq-deadline, mq blk-mq will default to "none" (at least for now, until
>the new scheduler is done).
> 2) The individual schedulers will be y/m/n selectable, just like any
>other driver.

Yes. That makes sense as options. I can (or, perhaps even more
importantly, a distro can) answer those kinds of questions.

   Linus


Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-02-22 Thread Paolo Valente

> Il giorno 17 feb 2017, alle ore 11:30, Paolo Valente 
>  ha scritto:
> 
> 
>> Il giorno 16 feb 2017, alle ore 11:31, Paolo Valente 
>>  ha scritto:
>> 
>>> 
>>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe  ha scritto:
>>> 
>>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
 On 02/15/2017 10:24 AM, Paolo Valente wrote:
> 
>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval 
>>  ha scritto:
>> 
>> From: Omar Sandoval 
>> 
>> None of the other blk-mq elevator hooks are called with this lock held.
>> Additionally, it can lead to circular locking dependencies between
>> queue_lock and the private scheduler lock.
>> 
> 
> Hi Omar,
> I'm sorry but it seems that a new potential deadlock has showed up.
> See lockdep splat below.
> 
> I've tried to think about different solutions than turning back to
> deferring the body of exit_icq, but at no avail.
 
 Looks like a interaction between bfqd->lock and q->queue_lock. Since the
 core has no notion of you bfqd->lock, the naturally dependency here
 would be to nest bfqd->lock inside q->queue_lock. Is that possible for
 you?
 
 Looking at the code a bit, maybe it'd just be simpler to get rid of
 holding the queue lock for that spot. For the mq scheduler, we really
 don't want places where we invoke with that held already. Does the below
 work for you?
>>> 
>>> Would need to remove one more lockdep assert. And only test this for
>>> the mq parts, we'd need to spread a bit of love on the classic
>>> scheduling icq exit path for this to work on that side.
>>> 
>> 
>> Sorry Jens, same splat.  What confuses me is the second column
>> in the possible scenario:
>> 
>> [  139.368477]CPU0   CPU1
>> [  139.369129]   
>> [  139.369774]   lock(&(&ioc->lock)->rlock);
>> [  139.370339]   
>> lock(&(&q->__queue_lock)->rlock);
>> [  139.390579]   
>> lock(&(&ioc->lock)->rlock);
>> [  139.391522]   lock(&(&bfqd->lock)->rlock);
>> 
>> I could not find any code path, related to the reported call traces,
>> and taking first q->queue_lock and then ioc->lock.
>> 
>> Any suggestion on how to go on, and hopefully help with this problem is
>> welcome.
>> 
> 
> Jens,
> this is just to tell you that I have found the link that still causes
> the circular dependency: an ioc->lock nested into a queue_lock in
> ioc_create_icq.  I'll try to come back with a solution proposal.
> 

Solution implemented and apparently working.  If anyone wants to have
a look at or test it, it is in the usual branch [1].  I have found
other similar circular dependencies in the meanwhile, and this
solution covers them too.  I'm pasting the message of the last commit
below.  Actually, the branch now also contains a fix to another minor
bug (if useful to know, to not waste further time in preparing several
micro fixes, I have just merged some minor fixes into existing
commits). I'm starting to work on cgroups support.

Unnest request-queue and ioc locks from scheduler locks

In some bio-merging functions, the request-queue lock needs to be
taken, to lookup for the bic associated with the process that issued
the bio that may need to be merged. In addition, put_io_context must
be invoked in some other functions, and put_io_context may cause the
lock of the involved ioc to be taken. In both cases, these extra
request-queue or ioc locks are taken, or might be taken, while the
scheduler lock is being held. In this respect, there are other code
paths, in part external to bfq-mq, in which the same locks are taken
(nested) in the opposite order, i.e., it is the scheduler lock to be
taken while the request-queue or the ioc lock is being held.  This
leads to circular deadlocks.

This commit addresses this issue by modifying the logic of the above
functions, so as to let the lookup and put_io_context be performed,
and thus the extra locks be taken, outside the critical sections
protected by the scheduler lock.


Thanks,
Paolo

[1] https://github.com/Algodev-github/bfq-mq

> Thanks,
> Paolo
> 
>> Thanks,
>> Paolo
>> 
>>> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
>>> index b12f9c87b4c3..546ff8f81ede 100644
>>> --- a/block/blk-ioc.c
>>> +++ b/block/blk-ioc.c
>>> @@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
>>> icq->flags |= ICQ_EXITED;
>>> }
>>> 
>>> -/* Release an icq.  Called with both ioc and q locked. */
>>> +/* Release an icq.  Called with ioc locked. */
>>> static void ioc_destroy_icq(struct io_cq *icq)
>>> {
>>> struct io_context *ioc = icq->ioc;
>>> @@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
>>> struct elevator_type *et = q->elevator->type;
>>> 
>>> lockdep_assert_held(&ioc->lock);
>>> -   lockdep_assert_held(q->queue_lock);
>>>

Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-22 Thread Paolo Valente

> Il giorno 13 feb 2017, alle ore 23:38, Bart Van Assche 
>  ha scritto:
> 
> On Mon, 2017-02-13 at 22:07 +, Bart Van Assche wrote:
>> On Mon, 2017-02-13 at 22:07 +0100, Paolo Valente wrote:
>>> but what do you think about trying this fix?
>> 
>> Sorry but with ... the same server I used for the previous test still
>> didn't boot up properly. A screenshot is available at
>> https://goo.gl/photos/Za9QVGCNe2BJBwxVA.
>> 
>>> Otherwise, if you have no news or suggestions, would you be willing to
>>> try my micro-logging proposal https://github.com/Algodev-github/bfq-mq?
>> 
>> Sorry but it's not clear to me what logging mechanism you are referring
>> to and how to enable it? Are you perhaps referring to
>> CONFIG_BFQ_REDIRECT_TO_CONSOLE?
> 
> Anyway, a second screenshot has been added to the same album after I had
> applied the following patch:
> 

Hi Bart,
thanks for this second attempt of yours.  Although, unfortunately, not
providing some clear indication of the exact cause of your hang (apart
from a possible deadlock), your log helped me notice another bug.

At any rate, as I have just written to Jens, I have pushed a new
version of the branch [1] (not just added new commits, but also
integrated some old commit with new changes, to make it more quickly).
The branch now contains both a fix for the above bug, and, more
importantly, a fix for the circular dependencies that were still
lurking around.  Could you please test it?

Crossing my fingers,
Paolo

[1] https://github.com/Algodev-github/bfq-mq

> diff --git a/block/Makefile b/block/Makefile
> index 1c04fe19e825..bf472ac82c08 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -2,6 +2,8 @@
> # Makefile for the kernel block layer
> #
>  
> +KBUILD_CFLAGS += -DCONFIG_BFQ_REDIRECT_TO_CONSOLE
> +
> 
> obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
>blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
>blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
> 
> Bart.
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
> Notice & Disclaimer:
> 
> This e-mail and any files transmitted with it may contain confidential or 
> legally privileged information of WDC and/or its affiliates, and are intended 
> solely for the use of the individual or entity to which they are addressed. 
> If you are not the intended recipient, any disclosure, copying, distribution 
> or any action taken or omitted to be taken in reliance on it, is prohibited. 
> If you have received this e-mail in error, please notify the sender 
> immediately and delete the e-mail in its entirety from your system.
> 



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Jens Axboe
On 02/22/2017 12:04 PM, Linus Torvalds wrote:
> On Wed, Feb 22, 2017 at 10:58 AM, Jens Axboe  wrote:
>> On 02/22/2017 11:56 AM, Linus Torvalds wrote:
>>
>> OK, so here's what I'll do:
>>
>> 1) We'll kill the default scheduler choices. sq blk-mq will default to
>>mq-deadline, mq blk-mq will default to "none" (at least for now, until
>>the new scheduler is done).
>> 2) The individual schedulers will be y/m/n selectable, just like any
>>other driver.
> 
> Yes. That makes sense as options. I can (or, perhaps even more
> importantly, a distro can) answer those kinds of questions.

Someone misspelled pacman:

parman (PARMAN) [N/m/y] (NEW) ?

There is no help available for this option.

Or I think it's pacman, because I have no idea what else it could be. I'm
going to say N.

-- 
Jens Axboe



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Jens Axboe
On 02/22/2017 02:50 PM, Markus Trippelsdorf wrote:
> On 2017.02.22 at 11:44 -0700, Jens Axboe wrote:
>> On 02/22/2017 11:42 AM, Linus Torvalds wrote:
>>> On Wed, Feb 22, 2017 at 10:26 AM, Linus Torvalds
>>>  wrote:

 And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR 
 USER?
>>>
>>> Basically, I'm pushing back on config options that I can't personally
>>> even sanely answer.
>>
>> I got that much, and I don't disagree on that part.
>>
>>> If it's a config option about "do I have a particular piece of
>>> hardware", it makes sense. But these new ones were just complete
>>> garbage.
>>>
>>> The whole "default IO scheduler" thing is a disease. We should stop
>>> making up these shit schedulers and then say "we don't know which one
>>> works best for you".
>>>
>>> All it does is encourage developers to make shortcuts and create crap
>>> that isn't generically useful, and then blame the user and say "well,
>>> you should have picked a different scheduler" when they say "this does
>>> not work well for me".
>>>
>>> We have had too many of those kinds of broken choices.  And when the
>>> new Kconfig options get so confusing and so esoteric that I go "Hmm, I
>>> have no idea if my hardware does a single queue or not", I put my foot
>>> down.
>>>
>>> When the IO scheduler questions were about a generic IO scheduler for
>>> everything, I can kind of understand them. I think it was still a
>>> mistake (for the reasons outline above), but at least it was a
>>> comprehensible question to ask.
>>>
>>> But when it gets to "what should I do about a single-queue version of
>>> a MQ scheduler", the question is no longer even remotely sensible. The
>>> question should simply NOT EXIST. There is no possible valid reason to
>>> ask that kind of crap.
>>
>> OK, so here's what I'll do:
>>
>> 1) We'll kill the default scheduler choices. sq blk-mq will default to
>>mq-deadline, mq blk-mq will default to "none" (at least for now, until
>>the new scheduler is done).
> 
> But what about e.g. SATA SSDs? Wouldn't they be better off without any
> scheduler? 

Marginal. If they are single queue, using a basic scheduler like
deadline isn't going to be a significant amount of overhead. In some
cases they are going to be better off, due to better merging. In the
worst case, overhead is slightly higher. Net result is positive, I'd
say.

> So perhaps setting "none" for queue/rotational==0 and mq-deadline for
> spinning drives automatically in the sq blk-mq case?

You can do that through a udev rule. The kernel doesn't know if the
device is rotational or not when we set up the scheduler. So we'd either
have to add code to do that, or simply just do it with a udev rule. I'd
prefer the latter.

-- 
Jens Axboe



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Markus Trippelsdorf
On 2017.02.22 at 11:44 -0700, Jens Axboe wrote:
> On 02/22/2017 11:42 AM, Linus Torvalds wrote:
> > On Wed, Feb 22, 2017 at 10:26 AM, Linus Torvalds
> >  wrote:
> >>
> >> And dammit, IF YOU DON'T EVEN KNOW, WHY THE HELL ARE YOU ASKING THE POOR 
> >> USER?
> > 
> > Basically, I'm pushing back on config options that I can't personally
> > even sanely answer.
> 
> I got that much, and I don't disagree on that part.
> 
> > If it's a config option about "do I have a particular piece of
> > hardware", it makes sense. But these new ones were just complete
> > garbage.
> > 
> > The whole "default IO scheduler" thing is a disease. We should stop
> > making up these shit schedulers and then say "we don't know which one
> > works best for you".
> > 
> > All it does is encourage developers to make shortcuts and create crap
> > that isn't generically useful, and then blame the user and say "well,
> > you should have picked a different scheduler" when they say "this does
> > not work well for me".
> > 
> > We have had too many of those kinds of broken choices.  And when the
> > new Kconfig options get so confusing and so esoteric that I go "Hmm, I
> > have no idea if my hardware does a single queue or not", I put my foot
> > down.
> > 
> > When the IO scheduler questions were about a generic IO scheduler for
> > everything, I can kind of understand them. I think it was still a
> > mistake (for the reasons outline above), but at least it was a
> > comprehensible question to ask.
> > 
> > But when it gets to "what should I do about a single-queue version of
> > a MQ scheduler", the question is no longer even remotely sensible. The
> > question should simply NOT EXIST. There is no possible valid reason to
> > ask that kind of crap.
> 
> OK, so here's what I'll do:
> 
> 1) We'll kill the default scheduler choices. sq blk-mq will default to
>mq-deadline, mq blk-mq will default to "none" (at least for now, until
>the new scheduler is done).

But what about e.g. SATA SSDs? Wouldn't they be better off without any
scheduler? 
So perhaps setting "none" for queue/rotational==0 and mq-deadline for
spinning drives automatically in the sq blk-mq case?

-- 
Markus


Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-22 Thread Linus Torvalds
On Wed, Feb 22, 2017 at 1:50 PM, Markus Trippelsdorf
 wrote:
>
> But what about e.g. SATA SSDs? Wouldn't they be better off without any
> scheduler?
> So perhaps setting "none" for queue/rotational==0 and mq-deadline for
> spinning drives automatically in the sq blk-mq case?

Jens already said that the merging advantage can outweigh the costs,
but he didn't actually talk much about it.

The scheduler advantage can outweigh the costs of running a scheduler
by an absolutely _huge_ amount.

An SSD isn't zero-cost, and each command tends to have some fixed
overhead on the controller, and pretty much all SSD's heavily prefer
fewer large request over lots of tiny ones.

There are also fairness/latency issues that tend to very heavily favor
having an actual scheduler, ie reads want to be scheduled before
writes on an SSD (within reason) in order to make latency better.

Ten years ago, there were lots of people who argued that you don't
want to do do scheduling for SSD's, because SSD's were so fast that
you only added overhead. Nobody really believes that fairytale any
more.

So you might have particular loads that look better with noop, but
they will be rare and far between. Try it, by all means, and if it
works for you, set it in your udev rules.

The main place where a noop scheduler currently might make sense is
likely for a ramdisk, but quite frankly, since the main real usecase
for a ram-disk tends to be to make it easy to profile and find the
bottlenecks for performance analysis (for emulating future "infinitely
fast" media), even that isn't true - using noop there defeats the
whole purpose.

  Linus


Re: sense handling improvements

2017-02-22 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph,

>> I can bring them in after Linus' initial block pull.

Christoph> Both the block and SCSI trees are now merged by Linus, and
Christoph> Jens didn't pick up patch one from this series yet - maybe
Christoph> it's best to send the whole series through the SCSI tree in
Christoph> this case.

I applied 1-4 to 4.11/scsi-fixes. Both 5 and 6 had problems so please
fix those up.

-- 
Martin K. Petersen  Oracle Linux Engineering