Re: [PATCH] elevator: allow specifying elevator by config boot param for blk-mq devices

2017-07-13 Thread Johannes Thumshirn
On Thu, Jul 13, 2017 at 08:02:41AM -0600, Jens Axboe wrote:
> Because it sucks as an interface - there's no way to apply different
> settings to different devices, and using the kernel command line to
> control something like this is ugly. It never should have been done.
> 
> The sysfs interface, either manually, scripted, or through udev,
> makes a lot more sense.

Not that I disagree with your reasons, but not being able to select a
mq scheduler confuses quite some users out there.

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] elevator: allow specifying elevator by config boot param for blk-mq devices

2017-07-13 Thread Paolo Valente

> Il giorno 13 lug 2017, alle ore 16:02, Jens Axboe  ha 
> scritto:
> 
> On 07/13/2017 04:11 AM, Johannes Thumshirn wrote:
>> On Wed, Jul 12, 2017 at 04:51:33PM -0600, Jens Axboe wrote:
>>> No, that boot option was a mistake, let's not propagate that to mq
>>> scheduling as well.
>> 
>> Can you please explain why? I as well have requests from our users to
>> select the mq schedulers on the command line.
> 
> Because it sucks as an interface - there's no way to apply different
> settings to different devices, and using the kernel command line to
> control something like this is ugly. It never should have been done.
> 
> The sysfs interface, either manually, scripted, or through udev,
> makes a lot more sense.
> 

One doubt: with the new interface, and using, I guess, udev, is it
still possible to control which I/O scheduler is actually used during
all the boot process, or at least almost all of it?

Thanks,
Paolo

> -- 
> Jens Axboe
> 



[PATCH] efi: add gpt_overprovisioned kernel command line parameter

2017-07-13 Thread Mark Salyzyn
Allow a gpt partition table to be used if it is overprovisioned, the
last usable lba in the GPT headers is beyond the device boundaries.
This feature is enabled if gpt_overprovisioned is added to the kernel
command line.  It will not expose any individual partitions that go
beyond the device boundaries, just the ones that are under that limit.

Default off, one can perform an exploratory boot of the kernel and
triage or backup the properly provisioned partitions.  This would
allow the system to at least partially boot. For example if the boot,
root or system filesystems reside on properly provisioned partitions.

Examples, helpful should a RAID array be resized downwards, or
during embedded development should an overprovisioned partition
table update be flashed to a device by accident.

Signed-off-by: Mark Salyzyn 
---
 block/partitions/efi.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/block/partitions/efi.c b/block/partitions/efi.c
index 39f70d968754..e18ee5cf138c 100644
--- a/block/partitions/efi.c
+++ b/block/partitions/efi.c
@@ -117,6 +117,15 @@ force_gpt_fn(char *str)
 }
 __setup("gpt", force_gpt_fn);
 
+static int overprovisioned_gpt;
+static int __init
+overprovisioned_gpt_fn(char *str)
+{
+   overprovisioned_gpt = 1;
+   return 1;
+}
+__setup("gpt_overprovisioned", overprovisioned_gpt_fn);
+
 
 /**
  * efi_crc32() - EFI version of crc32 function
@@ -420,7 +429,8 @@ static int is_gpt_valid(struct parsed_partitions *state, 
u64 lba,
pr_debug("GPT: last_usable_lba incorrect: %lld > %lld\n",
 (unsigned long 
long)le64_to_cpu((*gpt)->last_usable_lba),
 (unsigned long long)lastlba);
-   goto fail;
+   if (!overprovisioned_gpt)
+   goto fail;
}
if (le64_to_cpu((*gpt)->last_usable_lba) < 
le64_to_cpu((*gpt)->first_usable_lba)) {
pr_debug("GPT: last_usable_lba incorrect: %lld > %lld\n",
-- 
2.13.2.932.g7449e964c-goog



Re: [PATCH 2/2] md: raid1/raid10: initialize bvec table via bio_add_page()

2017-07-13 Thread NeilBrown
On Thu, Jul 13 2017, Shaohua Li wrote:

> On Thu, Jul 13, 2017 at 05:20:52PM +0800, Ming Lei wrote:
>> On Thu, Jul 13, 2017 at 01:09:28PM +1000, NeilBrown wrote:
>> > On Thu, Jul 13 2017, Ming Lei wrote:
>> > 
>> > > On Thu, Jul 13, 2017 at 10:01:33AM +1000, NeilBrown wrote:
>> > >> On Wed, Jul 12 2017, Ming Lei wrote:
>> > >> 
>> > >> > We will support multipage bvec soon, so initialize bvec
>> > >> > table using the standardy way instead of writing the
>> > >> > talbe directly. Otherwise it won't work any more once
>> > >> > multipage bvec is enabled.
>> > >> >
>> > >> > Acked-by: Guoqing Jiang 
>> > >> > Signed-off-by: Ming Lei 
>> > >> > ---
>> > >> >  drivers/md/md.c | 21 +
>> > >> >  drivers/md/md.h |  3 +++
>> > >> >  drivers/md/raid1.c  | 16 ++--
>> > >> >  drivers/md/raid10.c |  4 ++--
>> > >> >  4 files changed, 28 insertions(+), 16 deletions(-)
>> > >> >
>> > >> > diff --git a/drivers/md/md.c b/drivers/md/md.c
>> > >> > index 8cdca0296749..cc8dcd928dde 100644
>> > >> > --- a/drivers/md/md.c
>> > >> > +++ b/drivers/md/md.c
>> > >> > @@ -9130,6 +9130,27 @@ void md_reload_sb(struct mddev *mddev, int nr)
>> > >> >  }
>> > >> >  EXPORT_SYMBOL(md_reload_sb);
>> > >> >  
>> > >> > +/* generally called after bio_reset() for reseting bvec */
>> > >> > +void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages 
>> > >> > *rp,
>> > >> > +int size)
>> > >> > +{
>> > >> > + int idx = 0;
>> > >> > +
>> > >> > + /* initialize bvec table again */
>> > >> > + do {
>> > >> > + struct page *page = resync_fetch_page(rp, idx);
>> > >> > + int len = min_t(int, size, PAGE_SIZE);
>> > >> > +
>> > >> > + /*
>> > >> > +  * won't fail because the vec table is big
>> > >> > +  * enough to hold all these pages
>> > >> > +  */
>> > >> > + bio_add_page(bio, page, len, 0);
>> > >> > + size -= len;
>> > >> > + } while (idx++ < RESYNC_PAGES && size > 0);
>> > >> > +}
>> > >> > +EXPORT_SYMBOL(md_bio_reset_resync_pages);
>> > >> 
>> > >> I really don't think this is a good idea.
>> > >> This code is specific to raid1/raid10.  It is not generic md code.  So
>> > >> it doesn't belong here.
>> > >
>> > > We already added 'struct resync_pages' in drivers/md/md.h, so I think
>> > > it is reasonable to add this function into drivers/md/md.c
>> > 
>> > Alternative perspective: it was unreasonable to add "resync_pages" to
>> > md.h.
>> > 
>> > >
>> > >> 
>> > >> If you want to remove code duplication, then work on moving all raid1
>> > >> functionality into raid10.c, then discard raid1.c
>> > >
>> > > This patch is for avoiding new code duplication, not for removing current
>> > > duplication.
>> > >
>> > >> 
>> > >> Or at the very least, have a separate "raid1-10.c" file for the common
>> > >> code.
>> > >
>> > > You suggested it last time, but looks too overkill to be taken. But if
>> > > someone wants to refactor raid1 and raid10, I think it can be a good 
>> > > start,
>> > > but still not belong to this patch.
>> > 
>> > You are trying to create common code for raid1 and raid10.  This does
>> > not belong in md.c.
>> > If you really want to have a single copy of common code, then it exactly
>> > is the role of this patch to create a place to put it.
>> > I'm not saying you should put all common code in raid1-10.c.   Just the
>> > function that you have identified.
>> 
>> I really don't want to waste time on this kind of thing, I can do
>> either one frankly.
>> 
>> Shaohua, could you share me which way you like to merge? I can do it in
>> either way.
>
> I don't have strong preference, but Neil's suggestion does make the code a
> little better. Of course, only put the function into the raid1-10.c right now.

To make it as easy as possible, I would suggest creating raid1-10.c
containing just this function (and maybe the definitions from md.h),
and declare the function "static" and #include raid1-10.c into raid1.c
and raid10.c.  i.e. no worrying about modules and exporting symbols.

Anyone who cares (and that might even be me) could move functionality
gradually out of raid1.c and raid10.c in raid1-10.c.  Maybe where would
come a tipping-point where it is easy to just discard raid1.c and raid10.c
and finish the job.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


[for-4.14 RFC PATCH 0/2] dm rq: eliminate historic blk-mq and .request_fn queue stacking restrictions

2017-07-13 Thread Mike Snitzer
Hi,

Please review the 2 patches in this series.  I'm left on-the-fence
about whether there is any point re-enabling "dm-mq stacked on old
.request_fn device(s)" -- especially given the awkward details
documented in the 1/2 patch header.

I welcome any feedback on this, thanks!

BTW, I tested these changes using mptest:
 git://github.com/snitm/mptest.git

with the following permutations ('runtest' script tweaked before each
permutation):

echo N > /sys/module/scsi_mod/parameters/use_blk_mq
echo N > /sys/module/dm_mod/parameters/use_blk_mq

echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
echo Y > /sys/module/dm_mod/parameters/use_blk_mq

echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
echo N > /sys/module/dm_mod/parameters/use_blk_mq

echo N > /sys/module/scsi_mod/parameters/use_blk_mq
echo Y > /sys/module/dm_mod/parameters/use_blk_mq

Mike Snitzer (2):
  dm rq: avoid deadlock if dm-mq is stacked on old .request_fn device(s)
  dm rq: eliminate historic blk-mq and .request_fn queue stacking restrictions

 drivers/md/dm-mpath.c | 16 ++--
 drivers/md/dm-rq.c| 13 +
 drivers/md/dm-table.c | 31 +++
 drivers/md/dm-target.c|  4 ++--
 drivers/md/dm.h   |  1 -
 include/linux/device-mapper.h |  3 ++-
 6 files changed, 26 insertions(+), 42 deletions(-)

-- 
2.10.1



[for-4.14 RFC PATCH 2/2] dm rq: eliminate historic blk-mq and .request_fn queue stacking restrictions

2017-07-13 Thread Mike Snitzer
Currently if dm_mod.use_blk_mq=Y (or a DM-multipath table is loaded with
queue_mode=mq) and all underlying devices are not blk-mq, DM core will
fail with the error:
  "table load rejected: all devices are not blk-mq request-stackable"

This all-blk-mq-or-nothing approach is too cut-throat because it
prevents access to data stored on what could have been a previously
working multipath setup (e.g. if user decides to try dm_mod.use_blk_mq=Y
or queue_mode=mq only to find their underlying devices aren't blk-mq).

This restriction, and others like not being able to stack a top-level
blk-mq request-queue ontop of old .request_fn device(s), can be removed
thanks to commit eb8db831be ("dm: always defer request allocation to the
owner of the request_queue").

Now that request-based DM will always rely on the target (multipath) to
call blk_get_request() to create a clone request it is possible to
support all 4 permutations of stacking old .request_fn and blk-mq
request_queues.

Depends-on: eb8db831be ("dm: always defer request allocation to the owner of 
the request_queue")
Reported-by: Ewan Milne 
Signed-off-by: Mike Snitzer 
---
 drivers/md/dm-rq.c|  5 -
 drivers/md/dm-table.c | 31 +++
 drivers/md/dm.h   |  1 -
 3 files changed, 3 insertions(+), 34 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 95bb44c..d64677b 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -782,11 +782,6 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
struct dm_target *immutable_tgt;
int err;
 
-   if (!dm_table_all_blk_mq_devices(t)) {
-   DMERR("request-based dm-mq may only be stacked on blk-mq 
device(s)");
-   return -EINVAL;
-   }
-
md->tag_set = kzalloc_node(sizeof(struct blk_mq_tag_set), GFP_KERNEL, 
md->numa_node_id);
if (!md->tag_set)
return -ENOMEM;
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index a39bcd9..e630768 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -46,7 +46,6 @@ struct dm_table {
 
bool integrity_supported:1;
bool singleton:1;
-   bool all_blk_mq:1;
unsigned integrity_added:1;
 
/*
@@ -910,7 +909,6 @@ static int dm_table_determine_type(struct dm_table *t)
 {
unsigned i;
unsigned bio_based = 0, request_based = 0, hybrid = 0;
-   unsigned sq_count = 0, mq_count = 0;
struct dm_target *tgt;
struct dm_dev_internal *dd;
struct list_head *devices = dm_table_get_devices(t);
@@ -985,11 +983,9 @@ static int dm_table_determine_type(struct dm_table *t)
int srcu_idx;
struct dm_table *live_table = dm_get_live_table(t->md, 
&srcu_idx);
 
-   /* inherit live table's type and all_blk_mq */
-   if (live_table) {
+   /* inherit live table's type */
+   if (live_table)
t->type = live_table->type;
-   t->all_blk_mq = live_table->all_blk_mq;
-   }
dm_put_live_table(t->md, srcu_idx);
return 0;
}
@@ -999,25 +995,9 @@ static int dm_table_determine_type(struct dm_table *t)
struct request_queue *q = bdev_get_queue(dd->dm_dev->bdev);
 
if (!blk_queue_stackable(q)) {
-   DMERR("table load rejected: including"
- " non-request-stackable devices");
+   DMERR("table load rejected: includes 
non-request-stackable devices");
return -EINVAL;
}
-
-   if (q->mq_ops)
-   mq_count++;
-   else
-   sq_count++;
-   }
-   if (sq_count && mq_count) {
-   DMERR("table load rejected: not all devices are blk-mq 
request-stackable");
-   return -EINVAL;
-   }
-   t->all_blk_mq = mq_count > 0;
-
-   if (t->type == DM_TYPE_MQ_REQUEST_BASED && !t->all_blk_mq) {
-   DMERR("table load rejected: all devices are not blk-mq 
request-stackable");
-   return -EINVAL;
}
 
return 0;
@@ -1067,11 +1047,6 @@ bool dm_table_request_based(struct dm_table *t)
return __table_type_request_based(dm_table_get_type(t));
 }
 
-bool dm_table_all_blk_mq_devices(struct dm_table *t)
-{
-   return t->all_blk_mq;
-}
-
 static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device 
*md)
 {
enum dm_queue_mode type = dm_table_get_type(t);
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 38c84c0..c484c4d 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -70,7 +70,6 @@ struct dm_target *dm_table_get_immutable_target(struct 
dm_table *t);
 struct dm_target *dm_table_get_wildcard_target(struct dm_table *t);
 bool dm_table_bio_based(struct dm_table *t);
 bool dm_table_request_based(struct dm_table *t);
-bool dm_t

[for-4.14 RFC PATCH 1/2] dm rq: avoid deadlock if dm-mq is stacked on old .request_fn device(s)

2017-07-13 Thread Mike Snitzer
Conditionally use __blk_put_request() or blk_put_request() instead of
just blk_put_request() in multipath_release_clone().

Otherwise a deadlock will occur because scsi_end_request() will take the
clone request's queue_lock, around its call to blk_finish_request(), and
then the later blk_put_request() also tries to take it:

[12749.916332]  queued_spin_lock_slowpath+0xb/0xf
[12749.916335]  _raw_spin_lock_irqsave+0x37/0x40
[12749.916339]  blk_put_request+0x39/0x60
[12749.916342]  multipath_release_clone+0xe/0x10 [dm_multipath]
[12749.916350]  dm_softirq_done+0x156/0x240 [dm_mod]
[12749.916353]  __blk_mq_complete_request+0x90/0x140
[12749.916355]  blk_mq_complete_request+0x16/0x20
[12749.916360]  dm_complete_request+0x23/0x40 [dm_mod]
[12749.916365]  end_clone_request+0x1d/0x20 [dm_mod]
[12749.916367]  blk_finish_request+0x83/0x120
[12749.916370]  scsi_end_request+0x12d/0x1d0
[12749.916371]  scsi_io_completion+0x13c/0x630
[12749.916374]  ? set_next_entity+0x7c/0x780
[12749.916376]  scsi_finish_command+0xd9/0x120
[12749.916378]  scsi_softirq_done+0x12a/0x150
[12749.916380]  blk_done_softirq+0x9e/0xd0
[12749.916382]  __do_softirq+0xc9/0x269
[12749.916384]  run_ksoftirqd+0x29/0x50
[12749.916385]  smpboot_thread_fn+0x110/0x160
[12749.916387]  kthread+0x109/0x140
[12749.916389]  ? sort_range+0x30/0x30
[12749.916390]  ? kthread_park+0x60/0x60
[12749.916391]  ret_from_fork+0x25/0x30

This "fix" is gross in that the long-term fitness of stacking blk-mq DM
multipath (dm-mq) ontop of old .request_fn devices is questionable.  The
above stack trace shows just how ugly it is to have old .request_fn SCSI
code cascade into blk-mq code during DM multipath request completion.

Signed-off-by: Mike Snitzer 
---
 drivers/md/dm-mpath.c | 16 ++--
 drivers/md/dm-rq.c|  8 +---
 drivers/md/dm-target.c|  4 ++--
 include/linux/device-mapper.h |  3 ++-
 4 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 0e8ab5b..34cf7b6 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -520,9 +520,21 @@ static int multipath_clone_and_map(struct dm_target *ti, 
struct request *rq,
return DM_MAPIO_REMAPPED;
 }
 
-static void multipath_release_clone(struct request *clone)
+static void multipath_release_clone(struct dm_target *ti, struct request 
*clone)
 {
-   blk_put_request(clone);
+   struct multipath *m = ti->private;
+   struct request_queue *q = clone->q;
+
+   if (!q->mq_ops && m->queue_mode == DM_TYPE_MQ_REQUEST_BASED) {
+   /*
+* dm-mq on .request_fn already holds clone->q->queue_lock
+* via blk_finish_request()...
+* - true for .request_fn SCSI, but is it _always_ true?
+*/
+   lockdep_assert_held(q->queue_lock);
+   __blk_put_request(q, clone);
+   } else
+   blk_put_request(clone);
 }
 
 /*
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index c6ebc5b..95bb44c 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -220,11 +220,12 @@ static void dm_end_request(struct request *clone, 
blk_status_t error)
 {
int rw = rq_data_dir(clone);
struct dm_rq_target_io *tio = clone->end_io_data;
+   struct dm_target *ti = tio->ti;
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;
 
blk_rq_unprep_clone(clone);
-   tio->ti->type->release_clone_rq(clone);
+   ti->type->release_clone_rq(ti, clone);
 
rq_end_stats(md, rq);
if (!rq->q->mq_ops)
@@ -267,6 +268,7 @@ static void dm_mq_delay_requeue_request(struct request *rq, 
unsigned long msecs)
 
 static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool 
delay_requeue)
 {
+   struct dm_target *ti = tio->ti;
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;
int rw = rq_data_dir(rq);
@@ -274,7 +276,7 @@ static void dm_requeue_original_request(struct 
dm_rq_target_io *tio, bool delay_
rq_end_stats(md, rq);
if (tio->clone) {
blk_rq_unprep_clone(tio->clone);
-   tio->ti->type->release_clone_rq(tio->clone);
+   ti->type->release_clone_rq(ti, tio->clone);
}
 
if (!rq->q->mq_ops)
@@ -488,7 +490,7 @@ static int map_request(struct dm_rq_target_io *tio)
case DM_MAPIO_REMAPPED:
if (setup_clone(clone, rq, tio, GFP_ATOMIC)) {
/* -ENOMEM */
-   ti->type->release_clone_rq(clone);
+   ti->type->release_clone_rq(ti, clone);
return DM_MAPIO_REQUEUE;
}
 
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index c0d7e60..adbd17b 100644
--- a/drivers/md/dm-target.c
+++ b/drivers/md/dm-target.c
@@ -138,12 +138,12 @@ static int io_err_clone_and_map_rq(struct dm_target *ti, 
struct request *rq,
return DM_MAPI

Re: 4.12 NULL pointer dereference in kmem_cache_free on USB storage removal

2017-07-13 Thread Bart Van Assche
On Thu, 2017-07-13 at 23:24 +0300, Meelis Roos wrote:
> [258062.320700] RIP: 0010:kmem_cache_free+0x12/0x160
> [258062.320886] Call Trace:
> [258062.320897]  scsi_exit_rq+0x4d/0x60
> [258062.320909]  free_request_size+0x1c/0x30
> [258062.320923]  mempool_destroy+0x1d/0x60
> [258062.320935]  blk_exit_rl+0x1b/0x40
> [258062.320946]  __blk_release_queue+0x7d/0x120
> [258062.320959]  process_one_work+0x1af/0x340
> [258062.320972]  worker_thread+0x43/0x3e0
> [258062.320984]  kthread+0xfe/0x130
> [258062.320995]  ? create_worker+0x170/0x170
> [258062.321007]  ? kthread_create_on_node+0x40/0x40
> [258062.321022]  ret_from_fork+0x22/0x30

Hello Meelis,

Thank you for your report. Can you apply commit 8e6882545d8c ("scsi: Avoid
that scsi_exit_rq() triggers a use-after-free") on top of kernel v4.12 and
retest? That commit has been tagged "Cc: stable" so I hope that this patch
will be included in kernel v4.12.1. However, that kernel is not yet available
unfortunately ...

Thanks,

Bart.

Re: [v4.12-rc6 regression] commit dc9edc44de6c introduced use-after-free

2017-07-13 Thread Bart Van Assche
On Thu, 2017-06-29 at 19:34 +0800, Eryu Guan wrote:
> Hi all,
> 
> I got a use-after-free report from kasan-enabled kernel, when running
> fstests xfs/279 (generic/108 could trigger too). I appended the console
> log at the end of email.
> 
> git bisect pointed first bad commit to dc9edc44de6c ("block: Fix a
> blk_exit_rl() regression"), and reverting that commit on top of
> v4.12-rc7 kernel does resolve the use-after-free.
> 
> I can reproduce it by simply inserting & removing scsi_debug module.
> 
> modprobe scsi_debug
> modprobe -r scsi_debug
> 
> If you need more info please let me know.
> 
> Thanks,
> Eryu
> 
> [  101.977744] run fstests xfs/279 at 2017-06-29 19:08:59
> [  102.458699] scsi host5: scsi_debug: version 1.86 [20160430]
> [  102.458699]   dev_size_mb=128, opts=0x0, submit_queues=1, statistics=0
> [  102.472103] scsi 5:0:0:0: Direct-Access Linuxscsi_debug   0186 
> PQ: 0 ANSI: 7
> [  102.503428] sd 5:0:0:0: Attached scsi generic sg5 type 0
> [  102.505414] sd 5:0:0:0: [sde] 262144 512-byte logical blocks: (134 MB/128 
> MiB)
> [  102.505418] sd 5:0:0:0: [sde] 4096-byte physical blocks
> [  102.506568] sd 5:0:0:0: [sde] Write Protect is off
> [  102.508874] sd 5:0:0:0: [sde] Write cache: enabled, read cache: enabled, 
> supports DPO and FUA
> [  102.535845] sd 5:0:0:0: [sde] Attached SCSI disk
> [  104.876076] sd 5:0:0:0: [sde] Synchronizing SCSI cache
> [  104.92] 
> ==
> [  104.932796] BUG: KASAN: use-after-free in scsi_exit_rq+0xf3/0x120
> [  104.938886] Read of size 1 at addr 88022d574580 by task kworker/3:1/78
> [  104.945755]
> [  104.947254] CPU: 3 PID: 78 Comm: kworker/3:1 Not tainted 4.12.0-rc6.kasan 
> #98
> [  104.954382] Hardware name: IBM System x3550 M3 -[7944OEJ]-/90Y4784 , 
> BIOS -[D6E150CUS-1.11]- 02/08/2011
> [  104.964117] Workqueue: events __blk_release_queue
> [  104.968819] Call Trace:
> [  104.971271]  dump_stack+0x63/0x89
> [  104.974588]  print_address_description+0x78/0x290
> [  104.979291]  ? scsi_exit_rq+0xf3/0x120
> [  104.983042]  kasan_report+0x230/0x340
> [  104.986706]  __asan_report_load1_noabort+0x19/0x20
> [  104.991496]  scsi_exit_rq+0xf3/0x120
> [  104.995074]  free_request_size+0x44/0x60
> [  104.998999]  mempool_destroy.part.6+0x9b/0x150
> [  105.003444]  mempool_destroy+0x13/0x20
> [  105.007195]  blk_exit_rl+0x3b/0x60
> [  105.010599]  __blk_release_queue+0x14c/0x410
> [  105.014874]  process_one_work+0x5be/0xe90
> [  105.018883]  worker_thread+0xe4/0xe70
> [  105.022547]  ? pci_mmcfg_check_reserved+0x110/0x110
> [  105.027423]  kthread+0x2d3/0x3d0
> [  105.030653]  ? process_one_work+0xe90/0xe90
> [  105.034836]  ? kthread_create_on_node+0xb0/0xb0
> [  105.039366]  ret_from_fork+0x25/0x30
> [  105.042940]
> [  105.044436] Allocated by task 2763:
> [  105.047927]  save_stack_trace+0x1b/0x20
> [  105.051761]  save_stack+0x46/0xd0
> [  105.055074]  kasan_kmalloc+0xad/0xe0
> [  105.058653]  __kmalloc+0x105/0x1f0
> [  105.062057]  scsi_host_alloc+0x6d/0x11b0
> [  105.065980]  0xa0ad5ba6
> [  105.069123]  driver_probe_device+0x5d2/0xc70
> [  105.073393]  __device_attach_driver+0x1d3/0x2a0
> [  105.077920]  bus_for_each_drv+0x114/0x1c0
> [  105.081928]  __device_attach+0x1bf/0x290
> [  105.085850]  device_initial_probe+0x13/0x20
> [  105.090031]  bus_probe_device+0x19b/0x240
> [  105.094038]  device_add+0x842/0x1420
> [  105.097616]  device_register+0x1a/0x20
> [  105.101365]  0xa0adf185
> [  105.104507]  0xa0920a55
> [  105.107650]  do_one_initcall+0x91/0x210
> [  105.111487]  do_init_module+0x1bb/0x549
> [  105.115323]  load_module+0x4ea8/0x5f50
> [  105.119073]  SYSC_finit_module+0x169/0x1a0
> [  105.123169]  SyS_finit_module+0xe/0x10
> [  105.126919]  do_syscall_64+0x18a/0x410
> [  105.130669]  return_from_SYSCALL_64+0x0/0x6a
> [  105.134937]
> [  105.136432] Freed by task 2823:
> [  105.139573]  save_stack_trace+0x1b/0x20
> [  105.143407]  save_stack+0x46/0xd0
> [  105.146721]  kasan_slab_free+0x72/0xc0
> [  105.150471]  kfree+0x96/0x1a0
> [  105.153440]  scsi_host_dev_release+0x2cb/0x430
> [  105.157883]  device_release+0x76/0x1d0
> [  105.161634]  kobject_put+0x192/0x3f0
> [  105.165209]  put_device+0x17/0x20
> [  105.168524]  scsi_host_put+0x15/0x20
> [  105.172100]  0xa0ad8e0b
> [  105.175242]  device_release_driver_internal+0x26a/0x4e0
> [  105.180463]  device_release_driver+0x12/0x20
> [  105.184733]  bus_remove_device+0x2d0/0x590
> [  105.188830]  device_del+0x526/0x8d0
> [  105.192317]  device_unregister+0x1a/0xa0
> [  105.196239]  0xa0ad6381
> [  105.199379]  0xa0ae8924
> [  105.202520]  SyS_delete_module+0x38e/0x440
> [  105.206617]  do_syscall_64+0x18a/0x410
> [  105.210366]  return_from_SYSCALL_64+0x0/0x6a
> [  105.214634]
> [  105.216130] The buggy address belongs to the object at 88022d574400
> [  105.216130]  which belongs to the cache kmalloc-2048 of size 2048
> [  105.228808] The buggy address is

4.12 NULL pointer dereference in kmem_cache_free on USB storage removal

2017-07-13 Thread Meelis Roos
Just found this in dmesg, on disconnecting unmounted camera USB storage.

[219369.449387] usb 3-2: new full-speed USB device number 3 using xhci_hcd
[219369.598434] usb 3-2: New USB device found, idVendor=054c, idProduct=05be
[219369.598437] usb 3-2: New USB device strings: Mfr=1, Product=2, 
SerialNumber=3
[219369.598439] usb 3-2: Product: USB Bus-powered Device
[219369.598440] usb 3-2: Manufacturer: Sony
[219369.598442] usb 3-2: SerialNumber: C67320732569
[219369.605665] input: Sony USB Bus-powered Device as 
/devices/pci:00/:00:14.0/usb3/3-2/3-2:1.0/0003:054C:05BE.0002/input/input14
[219369.605843] hid-generic 0003:054C:05BE.0002: input,hidraw1: USB HID v1.11 
Device [Sony USB Bus-powered Device] on usb-:00:14.0-2/input0
[219369.626538] usb 3-2: USB disconnect, device number 3
[219373.529247] usb 3-2: new high-speed USB device number 4 using xhci_hcd
[219373.671251] usb 3-2: New USB device found, idVendor=054c, idProduct=0529
[219373.671254] usb 3-2: New USB device strings: Mfr=1, Product=2, 
SerialNumber=3
[219373.671255] usb 3-2: Product: DSC-RX100
[219373.671257] usb 3-2: Manufacturer: Sony
[219373.671258] usb 3-2: SerialNumber: C67320732569
[219373.672166] usb-storage 3-2:1.0: USB Mass Storage device detected
[219373.672280] scsi host7: usb-storage 3-2:1.0
[219374.687670] scsi 7:0:0:0: Direct-Access Sony DSC  1.00 
PQ: 0 ANSI: 0
[219374.688100] sd 7:0:0:0: Attached scsi generic sg7 type 0
[219374.796049] sd 7:0:0:0: [sdg] 62676992 512-byte logical blocks: (32.1 
GB/29.9 GiB)
[219374.796934] sd 7:0:0:0: [sdg] Write Protect is off
[219374.796937] sd 7:0:0:0: [sdg] Mode Sense: 03 00 00 00
[219374.798338] sd 7:0:0:0: [sdg] No Caching mode page found
[219374.798343] sd 7:0:0:0: [sdg] Assuming drive cache: write through
[219374.811127]  sdg: sdg1
[219374.817442] sd 7:0:0:0: [sdg] Attached SCSI removable disk
[219378.245104] FAT-fs (sdg1): utf8 is not a recommended IO charset for FAT 
filesystems, filesystem will be case sensitive!
[219378.286284] FAT-fs (sdg1): Volume was not properly unmounted. Some data may 
be corrupt. Please run fsck.
[255861.349304] usb 3-3: USB disconnect, device number 2
[255861.706636] usb 3-3: new low-speed USB device number 5 using xhci_hcd
[255861.851692] usb 3-3: New USB device found, idVendor=046d, idProduct=c00e
[255861.851697] usb 3-3: New USB device strings: Mfr=1, Product=2, 
SerialNumber=0
[255861.851699] usb 3-3: Product: USB-PS/2 Optical Mouse
[255861.851701] usb 3-3: Manufacturer: Logitech
[255861.855516] input: Logitech USB-PS/2 Optical Mouse as 
/devices/pci:00/:00:14.0/usb3/3-3/3-3:1.0/0003:046D:C00E.0003/input/input15
[255861.855767] hid-generic 0003:046D:C00E.0003: input,hidraw0: USB HID v1.10 
Mouse [Logitech USB-PS/2 Optical Mouse] on usb-:00:14.0-3/input0
[258062.212706] usb 3-2: USB disconnect, device number 4
[258062.320255] BUG: unable to handle kernel NULL pointer dereference at 
0009
[258062.320290] IP: kmem_cache_free+0x12/0x160
[258062.320303] PGD 0 
[258062.320303] P4D 0 

[258062.320324] Oops:  [#1] SMP
[258062.320334] Modules linked in: nls_utf8 nls_cp775 vfat fat joydev uinput 
snd_hrtimer uas usb_storage xt_CT iptable_raw xt_length xt_mark iptable_mangle 
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat ipt_REJECT 
nf_reject_ipv4 xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack 
nf_conntrack libcrc32c iptable_filter cpufreq_powersave cpufreq_userspace 
usbhid snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic 
x86_pkg_temp_thermal intel_powerclamp kvm_intel iTCO_wdt iTCO_vendor_support 
kvm irqbypass crct10dif_pclmul i915 crc32c_intel iosf_mbi i2c_algo_bit 
drm_kms_helper drm sr_mod cdrom intel_gtt agpgart xhci_pci e1000e syscopyarea 
snd_hda_intel xhci_hcd ehci_pci snd_hda_codec sysfillrect ehci_hcd pcspkr 
serio_raw sysimgblt snd_hda_core e100 ptp usbcore lpc_ich fb_sys_fops
[258062.320536]  mii sg pps_core usb_common snd_hwdep mfd_core i2c_i801 
nuvoton_cir rc_core video coretemp hwmon 8021q garp mrp stp llc snd_seq_dummy 
snd_seq_oss snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_midi snd_seq_midi_event 
snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore parport_pc lp 
parport ip_tables x_tables autofs4
[258062.320623] CPU: 2 PID: 21200 Comm: kworker/2:2 Not tainted 4.12.0 #181
[258062.320641] Hardware name:  /DH77KC, BIOS 
KCH7710H.86A.0110.2013.0513.1018 05/13/2013
[258062.320668] Workqueue: events __blk_release_queue
[258062.320682] task: 88010da0ea00 task.stack: c900029f8000
[258062.320700] RIP: 0010:kmem_cache_free+0x12/0x160
[258062.320713] RSP: 0018:c900029fbdb8 EFLAGS: 00010202
[258062.320729] RAX: 813ab050 RBX: 8802151e1880 RCX: 
0003
[258062.320749] RDX: 880213736840 RSI: 8802151e1880 RDI: 

[258062.320768] RBP: c900029fbdc8 R08: 880213736920 R09: 
000180800040
[258062.320788] R10: c900029fbe20 R11: 0c00 R12: 

[2580

Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()

2017-07-13 Thread Bart Van Assche
On Thu, 2017-07-13 at 18:23 +0800, Ming Lei wrote:
> Please discuss the congestion control in patch 4 and 5.

Hello Ming,

Since there is a significant risk that this patch series will introduce
performance and/or functional regressions, I will wait with reviewing this
patch series further until it has been shown that there is no less risky
alternative to realize the same performance improvement. I think Jens had
already suggested a possible alternative, namely by only starting a queue
if it really has to be started.

Bart.

Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold

2017-07-13 Thread Bart Van Assche
On Thu, 2017-07-13 at 23:32 +0800, Ming Lei wrote:
> On Thu, Jul 13, 2017 at 02:56:38PM +, Bart Van Assche wrote:
> > hctx_may_queue() severely limits the queue depth if many LUNs are associated
> > with the same SCSI host. I think that this is a performance regression
> > compared to scsi-sq and that this performance regression should be fixed.
> 
> IMO, it is hard to evaluate/compare perf between scsi-mq vs scsi-sq:
> 
>   - how many LUNs do you run IO on concurrently?
>   - evaluate the perf on single LUN or multi LUN?
> 
> BTW, active_queues is a runtime variable which accounts the actual active
> queues in use.

Hello Ming,

What I described can be verified easily by running fio against a single LUN
of a SCSI host with which a large number of LUNs are associated.

BTW, something I overlooked in my analysis of the active_queues variable is
that it is not only decremented if a queue is destroyed but also if a queue
is idle during (block layer request timeout) seconds. So the problem I
described will only occur if some software submits I/O requests periodically
to all SCSI LUNs with a shorter interval than the I/O timeout. I'm not sure
any Linux software does this today. As an example, the time between path
checks by the multipathd TUR checker typically exceeds the timeout of a
single TUR check.

Bart.

Re: [PATCH 2/2] block: delete bio_uninit

2017-07-13 Thread Christoph Hellwig
On Thu, Jul 13, 2017 at 09:50:23AM -0700, Shaohua Li wrote:
> > As said in the last mail I think there is no point in having this call..
> 
> I'm hesitant to do this. bio_associate_blkcg/bio_associate_current can be
> called in any time for a bio, so we not just attach cgroup info to info in bio
> submit (maybe the bio_associate_blkcg/bio_associate_current callers do sumbit
> always, but I didn't audit yet).

bio_associate_current is only called from blk_throtl_assoc_bio, which
is only called from blk_throtl_bio, which is only called from
blkcg_bio_issue_check, which is only called from
generic_make_request_checks, which is only called from
generic_make_request, which at that point consumes the bio from the
callers perspective.

bio_associate_blkcg might be a different story, but then we should
treat both very differently.

> The other reason is I'd like
> blk_throtl_bio_endio is only called once for the whole bio not for splitted
> bio, so this depends on bio_free to free cgroup info for chained bio which
> always does bio_put.

Then we'll need to fix that.  We really should not require every
caller of bio_init to pair it with a new uninit call, which would be
a whole lot more work.


Re: [PATCH 2/2] block: delete bio_uninit

2017-07-13 Thread Shaohua Li
On Thu, Jul 13, 2017 at 09:45:19AM +0200, Christoph Hellwig wrote:
> >  static void bio_free(struct bio *bio)
> >  {
> > struct bio_set *bs = bio->bi_pool;
> > void *p;
> >  
> > -   bio_uninit(bio);
> > +   bio_disassociate_task(bio);
> 
> As said in the last mail I think there is no point in having this call..

I'm hesitant to do this. bio_associate_blkcg/bio_associate_current can be
called in any time for a bio, so we not just attach cgroup info to info in bio
submit (maybe the bio_associate_blkcg/bio_associate_current callers do sumbit
always, but I didn't audit yet). The other reason is I'd like
blk_throtl_bio_endio is only called once for the whole bio not for splitted
bio, so this depends on bio_free to free cgroup info for chained bio which
always does bio_put.

> > @@ -294,7 +289,7 @@ void bio_reset(struct bio *bio)
> >  {
> > unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);
> >  
> > -   bio_uninit(bio);
> > +   bio_disassociate_task(bio);
> 
> .. and this one.  And I suspect it would make sense to have all these
> changes in a single patch.

This one is likely ok, but again I didn't audit yet. I'm ok these changes are
in a single patch.

Thanks,
Shaohua


Re: [PATCH 2/2] md: raid1/raid10: initialize bvec table via bio_add_page()

2017-07-13 Thread Shaohua Li
On Thu, Jul 13, 2017 at 05:20:52PM +0800, Ming Lei wrote:
> On Thu, Jul 13, 2017 at 01:09:28PM +1000, NeilBrown wrote:
> > On Thu, Jul 13 2017, Ming Lei wrote:
> > 
> > > On Thu, Jul 13, 2017 at 10:01:33AM +1000, NeilBrown wrote:
> > >> On Wed, Jul 12 2017, Ming Lei wrote:
> > >> 
> > >> > We will support multipage bvec soon, so initialize bvec
> > >> > table using the standardy way instead of writing the
> > >> > talbe directly. Otherwise it won't work any more once
> > >> > multipage bvec is enabled.
> > >> >
> > >> > Acked-by: Guoqing Jiang 
> > >> > Signed-off-by: Ming Lei 
> > >> > ---
> > >> >  drivers/md/md.c | 21 +
> > >> >  drivers/md/md.h |  3 +++
> > >> >  drivers/md/raid1.c  | 16 ++--
> > >> >  drivers/md/raid10.c |  4 ++--
> > >> >  4 files changed, 28 insertions(+), 16 deletions(-)
> > >> >
> > >> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > >> > index 8cdca0296749..cc8dcd928dde 100644
> > >> > --- a/drivers/md/md.c
> > >> > +++ b/drivers/md/md.c
> > >> > @@ -9130,6 +9130,27 @@ void md_reload_sb(struct mddev *mddev, int nr)
> > >> >  }
> > >> >  EXPORT_SYMBOL(md_reload_sb);
> > >> >  
> > >> > +/* generally called after bio_reset() for reseting bvec */
> > >> > +void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages 
> > >> > *rp,
> > >> > + int size)
> > >> > +{
> > >> > +  int idx = 0;
> > >> > +
> > >> > +  /* initialize bvec table again */
> > >> > +  do {
> > >> > +  struct page *page = resync_fetch_page(rp, idx);
> > >> > +  int len = min_t(int, size, PAGE_SIZE);
> > >> > +
> > >> > +  /*
> > >> > +   * won't fail because the vec table is big
> > >> > +   * enough to hold all these pages
> > >> > +   */
> > >> > +  bio_add_page(bio, page, len, 0);
> > >> > +  size -= len;
> > >> > +  } while (idx++ < RESYNC_PAGES && size > 0);
> > >> > +}
> > >> > +EXPORT_SYMBOL(md_bio_reset_resync_pages);
> > >> 
> > >> I really don't think this is a good idea.
> > >> This code is specific to raid1/raid10.  It is not generic md code.  So
> > >> it doesn't belong here.
> > >
> > > We already added 'struct resync_pages' in drivers/md/md.h, so I think
> > > it is reasonable to add this function into drivers/md/md.c
> > 
> > Alternative perspective: it was unreasonable to add "resync_pages" to
> > md.h.
> > 
> > >
> > >> 
> > >> If you want to remove code duplication, then work on moving all raid1
> > >> functionality into raid10.c, then discard raid1.c
> > >
> > > This patch is for avoiding new code duplication, not for removing current
> > > duplication.
> > >
> > >> 
> > >> Or at the very least, have a separate "raid1-10.c" file for the common
> > >> code.
> > >
> > > You suggested it last time, but looks too overkill to be taken. But if
> > > someone wants to refactor raid1 and raid10, I think it can be a good 
> > > start,
> > > but still not belong to this patch.
> > 
> > You are trying to create common code for raid1 and raid10.  This does
> > not belong in md.c.
> > If you really want to have a single copy of common code, then it exactly
> > is the role of this patch to create a place to put it.
> > I'm not saying you should put all common code in raid1-10.c.   Just the
> > function that you have identified.
> 
> I really don't want to waste time on this kind of thing, I can do
> either one frankly.
> 
> Shaohua, could you share me which way you like to merge? I can do it in
> either way.

I don't have strong preference, but Neil's suggestion does make the code a
little better. Of course, only put the function into the raid1-10.c right now.

Thanks,
Shaohua


Re: [PATCH] blk-mq: only alloc hctx point as many as nr_hw_queues

2017-07-13 Thread Jens Axboe
On 07/13/2017 10:30 AM, weiping zhang wrote:
> On Thu, Jul 13, 2017 at 10:14:31AM -0600, Jens Axboe wrote:
>> On 07/13/2017 10:10 AM, weiping zhang wrote:
>>> one hw queue only has one hctx, reduce it's number same as nr_hw_queues.
>>
>> What happens if a device registers with eg 4 hw queues, then later
>> calls blk_mq_update_nr_hw_queues()?
>>
> 
> your means, 10 cpu, and  device has 10 hw queues, but  only register 4,
> and then update it to 10, no place store new hctx. Am i understanding
> right ? if yes, could you tell me why driver do that ? thanks.

It doesn't matter why the driver would do that, the fact is that this is
a feature we support and it's a part of the API. NBD uses it, for
instance. Your patch breaks it, for saving a bit of memory. So I will
not include it.

-- 
Jens Axboe



Re: [PATCH] blk-mq: only alloc hctx point as many as nr_hw_queues

2017-07-13 Thread weiping zhang
On Thu, Jul 13, 2017 at 10:14:31AM -0600, Jens Axboe wrote:
> On 07/13/2017 10:10 AM, weiping zhang wrote:
> > one hw queue only has one hctx, reduce it's number same as nr_hw_queues.
> 
> What happens if a device registers with eg 4 hw queues, then later
> calls blk_mq_update_nr_hw_queues()?
> 

your means, 10 cpu, and  device has 10 hw queues, but  only register 4,
and then update it to 10, no place store new hctx. Am i understanding
right ? if yes, could you tell me why driver do that ? thanks.

--
weiping


Re: [PATCH] blk-mq: only alloc hctx point as many as nr_hw_queues

2017-07-13 Thread Jens Axboe
On 07/13/2017 10:10 AM, weiping zhang wrote:
> one hw queue only has one hctx, reduce it's number same as nr_hw_queues.

What happens if a device registers with eg 4 hw queues, then later
calls blk_mq_update_nr_hw_queues()?

-- 
Jens Axboe



[PATCH] blk-mq: only alloc hctx point as many as nr_hw_queues

2017-07-13 Thread weiping zhang
one hw queue only has one hctx, reduce it's number same as nr_hw_queues.

Signed-off-by: weiping zhang 
---
 block/blk-mq.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 05dfa3f..8c98f59 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2320,8 +2320,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
/* init q->mq_kobj and sw queues' kobjects */
blk_mq_sysfs_init(q);
 
-   q->queue_hw_ctx = kzalloc_node(nr_cpu_ids * sizeof(*(q->queue_hw_ctx)),
-   GFP_KERNEL, set->numa_node);
+   q->queue_hw_ctx = kzalloc_node(set->nr_hw_queues *
+   sizeof(*(q->queue_hw_ctx)),GFP_KERNEL, set->numa_node);
if (!q->queue_hw_ctx)
goto err_percpu;
 
-- 
2.9.4



Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold

2017-07-13 Thread Ming Lei
On Thu, Jul 13, 2017 at 02:56:38PM +, Bart Van Assche wrote:
> On Thu, 2017-07-13 at 18:43 +0800, Ming Lei wrote:
> > On Wed, Jul 12, 2017 at 03:39:14PM +, Bart Van Assche wrote:
> > > On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote:
> > > > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> > > > > What happens with fluid congestion boundaries, with shared tags?
> > > > 
> > > > The approach in this patch should work, but the threshold may not
> > > > be accurate in this way, one simple method is to use the average
> > > > tag weight in EWMA, like this:
> > > > 
> > > > sbitmap_weight() / hctx->tags->active_queues
> > > 
> > > Hello Ming,
> > > 
> > > That approach would result in a severe performance degradation. 
> > > "active_queues"
> > > namely represents the number of queues against which I/O ever has been 
> > > queued.
> > > If e.g. 64 LUNs would be associated with a single SCSI host and all 64 
> > > LUNs are
> > > responding and if the queue depth would also be 64 then the approach you
> > > proposed will reduce the effective queue depth per LUN from 64 to 1.
> > 
> > No, this approach does _not_ reduce the effective queue depth, it only
> > stops the queue for a while when the queue is busy enough.
> >
> > In this case, there may not have congestion because for blk-mq at most 
> > allows
> > to assign queue_depth/active_queues tags to each LUN, please see 
> > hctx_may_queue().
> 
> Hello Ming,
> 
> hctx_may_queue() severely limits the queue depth if many LUNs are associated
> with the same SCSI host. I think that this is a performance regression
> compared to scsi-sq and that this performance regression should be fixed.

IMO, it is hard to evaluate/compare perf between scsi-mq vs scsi-sq:

- how many LUNs do you run IO on concurrently?
- evaluate the perf on single LUN or multi LUN?

BTW, active_queues is a runtime variable which accounts the actual active
queues in use.

-- 
Ming


Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold

2017-07-13 Thread Bart Van Assche
On Thu, 2017-07-13 at 18:43 +0800, Ming Lei wrote:
> On Wed, Jul 12, 2017 at 03:39:14PM +, Bart Van Assche wrote:
> > On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote:
> > > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> > > > What happens with fluid congestion boundaries, with shared tags?
> > > 
> > > The approach in this patch should work, but the threshold may not
> > > be accurate in this way, one simple method is to use the average
> > > tag weight in EWMA, like this:
> > > 
> > >   sbitmap_weight() / hctx->tags->active_queues
> > 
> > Hello Ming,
> > 
> > That approach would result in a severe performance degradation. 
> > "active_queues"
> > namely represents the number of queues against which I/O ever has been 
> > queued.
> > If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs 
> > are
> > responding and if the queue depth would also be 64 then the approach you
> > proposed will reduce the effective queue depth per LUN from 64 to 1.
> 
> No, this approach does _not_ reduce the effective queue depth, it only
> stops the queue for a while when the queue is busy enough.
>
> In this case, there may not have congestion because for blk-mq at most allows
> to assign queue_depth/active_queues tags to each LUN, please see 
> hctx_may_queue().

Hello Ming,

hctx_may_queue() severely limits the queue depth if many LUNs are associated
with the same SCSI host. I think that this is a performance regression
compared to scsi-sq and that this performance regression should be fixed.

Bart.

Re: [PATCH] blk-mq-debugfs: add mapping show for hw queue to cpu

2017-07-13 Thread weiping zhang
On Wed, Jul 12, 2017 at 11:25:12AM -0600, Jens Axboe wrote:
> On 07/12/2017 11:13 AM, weiping zhang wrote:
> > On Wed, Jul 12, 2017 at 10:57:37AM -0600, Jens Axboe wrote:
> >> On 07/12/2017 10:54 AM, weiping zhang wrote:
> >>> A mapping show as following:
> >>>
> >>> hctx  cpus
> >>> hctx0   0   1
> >>> hctx1   2
> >>> hctx2   3
> >>> hctx3   4   5
> >>
> >> We already have that information in the /sys/block//mq/X/cpu_list
> >>
> >> where X is the hardware queue number. Why do we need it in debugfs as
> >> well, presented differently?
> > 
> > this arribute give a more obviously showing, only by 1 "cat" command.
> > /sys/bloc//mq/X/cpu_list not easy get mapping for all hctx by 1
> > command.
> > also /sys/kernel/debug/block/xxx/hctxN/cpuX can export that info, but
> > these two methods both not obviously.
> 
> The point is the information is there, and I'm sure most people can
> work out the shell logic to collect this info. I see zero benefit to
> adding this to debugfs.
> 

Hi Jens,

It seems no other people wanna this patch, please skip it, thanks for
your replay ^_^.

--
weiping


Re: [PATCH] nbd: kill unused ret in recv_work

2017-07-13 Thread Jens Axboe
On 07/13/2017 05:20 AM, Kefeng Wang wrote:
> No need to return value in queue work, kill ret variable.

Applied, thanks.

-- 
Jens Axboe



Re: [PATCH] elevator: allow specifying elevator by config boot param for blk-mq devices

2017-07-13 Thread Jens Axboe
On 07/13/2017 04:11 AM, Johannes Thumshirn wrote:
> On Wed, Jul 12, 2017 at 04:51:33PM -0600, Jens Axboe wrote:
>> No, that boot option was a mistake, let's not propagate that to mq
>> scheduling as well.
> 
> Can you please explain why? I as well have requests from our users to
> select the mq schedulers on the command line.

Because it sucks as an interface - there's no way to apply different
settings to different devices, and using the kernel command line to
control something like this is ugly. It never should have been done.

The sysfs interface, either manually, scripted, or through udev,
makes a lot more sense.

-- 
Jens Axboe



Re: [PATCH] elevator: allow specifying elevator by config boot param for blk-mq devices

2017-07-13 Thread Johannes Thumshirn
On Wed, Jul 12, 2017 at 04:51:33PM -0600, Jens Axboe wrote:
> No, that boot option was a mistake, let's not propagate that to mq
> scheduling as well.

Can you please explain why? I as well have requests from our users to
select the mq schedulers on the command line.

Thanks,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH] nbd: kill unused ret in recv_work

2017-07-13 Thread Kefeng Wang
No need to return value in queue work, kill ret variable.

Signed-off-by: Kefeng Wang 
---
 drivers/block/nbd.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index dea7d85..87a0a29 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -626,7 +626,6 @@ static void recv_work(struct work_struct *work)
struct nbd_device *nbd = args->nbd;
struct nbd_config *config = nbd->config;
struct nbd_cmd *cmd;
-   int ret = 0;
 
while (1) {
cmd = nbd_read_stat(nbd, args->index);
@@ -636,7 +635,6 @@ static void recv_work(struct work_struct *work)
mutex_lock(&nsock->tx_lock);
nbd_mark_nsock_dead(nbd, nsock, 1);
mutex_unlock(&nsock->tx_lock);
-   ret = PTR_ERR(cmd);
break;
}
 
-- 
1.8.3.1



Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold

2017-07-13 Thread Ming Lei
On Wed, Jul 12, 2017 at 03:39:14PM +, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote:
> > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> > > What happens with fluid congestion boundaries, with shared tags?
> > 
> > The approach in this patch should work, but the threshold may not
> > be accurate in this way, one simple method is to use the average
> > tag weight in EWMA, like this:
> > 
> > sbitmap_weight() / hctx->tags->active_queues
> 
> Hello Ming,
> 
> That approach would result in a severe performance degradation. 
> "active_queues"
> namely represents the number of queues against which I/O ever has been queued.
> If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs 
> are
> responding and if the queue depth would also be 64 then the approach you
> proposed will reduce the effective queue depth per LUN from 64 to 1.

No, this approach does _not_ reduce the effective queue depth, it only
stops the queue for a while when the queue is busy enough.

In this case, there may not have congestion because for blk-mq at most allows
to assign queue_depth/active_queues tags to each LUN, please see 
hctx_may_queue().
Then get_driver_tag() can only allow to return one pending tag at most to the
request_queue(LUN).

The algorithm in this patch only starts to work when congestion happens,
that said it is only run when BLK_STS_RESOURCE is returned from .queue_rq().
This approach is for avoiding to dispatch requests to one busy queue
unnecessarily, so that we don't need to heat CPU unnecessarily, and
merge gets improved meantime.

-- 
Ming


Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()

2017-07-13 Thread Ming Lei
On Wed, Jul 12, 2017 at 03:12:37PM +, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 11:15 +0800, Ming Lei wrote:
> > On Tue, Jul 11, 2017 at 07:57:53PM +, Bart Van Assche wrote:
> > > On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> > > > Now SCSI won't stop queue, and not necessary to use
> > > > blk_mq_start_hw_queues(), so switch to blk_mq_run_hw_queues()
> > > > instead.
> > > > 
> > > > Cc: "James E.J. Bottomley" 
> > > > Cc: "Martin K. Petersen" 
> > > > Cc: linux-s...@vger.kernel.org
> > > > Signed-off-by: Ming Lei 
> > > > ---
> > > >  drivers/scsi/scsi_lib.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > > index f6097b89d5d3..91d890356b78 100644
> > > > --- a/drivers/scsi/scsi_lib.c
> > > > +++ b/drivers/scsi/scsi_lib.c
> > > > @@ -333,7 +333,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
> > > >  static void scsi_kick_queue(struct request_queue *q)
> > > >  {
> > > > if (q->mq_ops)
> > > > -   blk_mq_start_hw_queues(q);
> > > > +   blk_mq_run_hw_queues(q, false);
> > > > else
> > > > blk_run_queue(q);
> > > >  }
> > > 
> > > Hello Ming,
> > > 
> > > Now that we have separate flags to represent the "stopped" and "quiesced"
> > > states, wouldn't it be better not to modify scsi_kick_queue() but instead 
> > > to
> > > stop a SCSI hardware queue again if scsi_queue_rq() returns 
> > > BLK_STS_RESOURCE?
> > > See also commits 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck") 
> > > and
> > > commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped 
> > > queues").
> > 
> > As you can see in the following patches, all stop/start queue APIs will
> > be removed, and the 'stopped' state will become a internal state.
> > 
> > It doesn't make sense to clear 'stopped' for scsi anymore, since the
> > queue won't be stopped actually. So we need this change.
> 
> Hello Ming,
> 
> As Jens already noticed, this approach won't work properly for concurrent I/O
> to multiple LUNs associated with the same SCSI host. This approach won't work
> properly for dm-mpath either. Sorry but I'm not convinced that it's possible

We can deal with special cases by passing flag, so that we don't need
to expose 'stopped' flag to drivers. Again, it has caused lots of
trouble.

If you check linus tree, most of start/stop queue API has been replaced
with quiesce/unquiesce. We can remove others too.

> to replace the stop/start queue API for all block drivers by an algorithm
> that is based on estimating the queue depth.

Anyway this patch is correct, and doesn't make sense to clear 'stopped'
in SCSI since SCSI doesn't stop queue at all even though not introduce
congest control, right?

Please discuss the congestion control in patch 4 and 5.

-- 
Ming


Re: [PATCH 16/19] bcache: increase the number of open buckets

2017-07-13 Thread Coly Li
On 2017/7/1 上午4:43, bca...@lists.ewheeler.net wrote:
> From: Tang Junhui 
> 
> In currently, we only alloc 6 open buckets for each cache set,
> but in usually, we always attach about 10 or so backend devices for
> each cache set, and the each bcache device are always accessed by
> about 10 or so threads in top application layer. So 6 open buckets
> are too few, It has led to that each of the same thread write data
> to different buckets, which would cause low efficiency write-back,
> and also cause buckets inefficient, and would be Very easy to run
> out of.
> 
> I add debug message in bch_open_buckets_alloc() to print alloc bucket
> info, and test with ten bcache devices with a cache set, and each
> bcache device is accessed by ten threads.
> 
> From the debug message, we can see that, after the modification, One
> bucket is more likely to assign to the same thread, and the data from
> the same thread are more likely to write the same bucket. Usually the
> same thread always write/read the same backend device, so it is good
> for write-back and also promote the usage efficiency of buckets.
> 
> Signed-off-by: Tang Junhui 

Nice catch for performance !

Reviewed-by: Coly Li 

Thanks.


> ---
>  drivers/md/bcache/alloc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> index ca4abe1..cacbe2d 100644
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -68,6 +68,8 @@
>  #include 
>  #include 
>  
> +#define MAX_OPEN_BUCKETS 128
> +
>  /* Bucket heap / gen */
>  
>  uint8_t bch_inc_gen(struct cache *ca, struct bucket *b)
> @@ -671,7 +673,7 @@ int bch_open_buckets_alloc(struct cache_set *c)
>  
>   spin_lock_init(&c->data_bucket_lock);
>  
> - for (i = 0; i < 6; i++) {
> + for (i = 0; i < MAX_OPEN_BUCKETS; i++) {
>   struct open_bucket *b = kzalloc(sizeof(*b), GFP_KERNEL);
>   if (!b)
>   return -ENOMEM;
> 


-- 
Coly Li


Re: [PATCH 18/19] bcache: silence static checker warning

2017-07-13 Thread Coly Li
On 2017/7/1 上午4:43, bca...@lists.ewheeler.net wrote:
> From: Dan Carpenter 
> 
> In olden times, closure_return() used to have a hidden return built in.
> We removed the hidden return but forgot to add a new return here.  If
> "c" were NULL we would oops on the next line, but fortunately "c" is
> never NULL.  Let's just remove the if statement.
> 
> Signed-off-by: Dan Carpenter 
> ---
>  drivers/md/bcache/super.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 24cb9b7..243391d 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1381,9 +1381,6 @@ static void cache_set_flush(struct closure *cl)
>   struct btree *b;
>   unsigned i;
>  
> - if (!c)
> - closure_return(cl);
> -
>   bch_cache_accounting_destroy(&c->accounting);
>  
>   kobject_put(&c->internal);
> 
Agree, cache_set_flush() is only called from a continue_at() in
__cache_set_unregister(). In this case, cl is always not NULL.

Reviewed-by: Coly Li 

Thanks.

-- 
Coly Li


Re: [PATCH 2/2] md: raid1/raid10: initialize bvec table via bio_add_page()

2017-07-13 Thread Ming Lei
On Thu, Jul 13, 2017 at 01:09:28PM +1000, NeilBrown wrote:
> On Thu, Jul 13 2017, Ming Lei wrote:
> 
> > On Thu, Jul 13, 2017 at 10:01:33AM +1000, NeilBrown wrote:
> >> On Wed, Jul 12 2017, Ming Lei wrote:
> >> 
> >> > We will support multipage bvec soon, so initialize bvec
> >> > table using the standardy way instead of writing the
> >> > talbe directly. Otherwise it won't work any more once
> >> > multipage bvec is enabled.
> >> >
> >> > Acked-by: Guoqing Jiang 
> >> > Signed-off-by: Ming Lei 
> >> > ---
> >> >  drivers/md/md.c | 21 +
> >> >  drivers/md/md.h |  3 +++
> >> >  drivers/md/raid1.c  | 16 ++--
> >> >  drivers/md/raid10.c |  4 ++--
> >> >  4 files changed, 28 insertions(+), 16 deletions(-)
> >> >
> >> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> > index 8cdca0296749..cc8dcd928dde 100644
> >> > --- a/drivers/md/md.c
> >> > +++ b/drivers/md/md.c
> >> > @@ -9130,6 +9130,27 @@ void md_reload_sb(struct mddev *mddev, int nr)
> >> >  }
> >> >  EXPORT_SYMBOL(md_reload_sb);
> >> >  
> >> > +/* generally called after bio_reset() for reseting bvec */
> >> > +void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
> >> > +   int size)
> >> > +{
> >> > +int idx = 0;
> >> > +
> >> > +/* initialize bvec table again */
> >> > +do {
> >> > +struct page *page = resync_fetch_page(rp, idx);
> >> > +int len = min_t(int, size, PAGE_SIZE);
> >> > +
> >> > +/*
> >> > + * won't fail because the vec table is big
> >> > + * enough to hold all these pages
> >> > + */
> >> > +bio_add_page(bio, page, len, 0);
> >> > +size -= len;
> >> > +} while (idx++ < RESYNC_PAGES && size > 0);
> >> > +}
> >> > +EXPORT_SYMBOL(md_bio_reset_resync_pages);
> >> 
> >> I really don't think this is a good idea.
> >> This code is specific to raid1/raid10.  It is not generic md code.  So
> >> it doesn't belong here.
> >
> > We already added 'struct resync_pages' in drivers/md/md.h, so I think
> > it is reasonable to add this function into drivers/md/md.c
> 
> Alternative perspective: it was unreasonable to add "resync_pages" to
> md.h.
> 
> >
> >> 
> >> If you want to remove code duplication, then work on moving all raid1
> >> functionality into raid10.c, then discard raid1.c
> >
> > This patch is for avoiding new code duplication, not for removing current
> > duplication.
> >
> >> 
> >> Or at the very least, have a separate "raid1-10.c" file for the common
> >> code.
> >
> > You suggested it last time, but looks too overkill to be taken. But if
> > someone wants to refactor raid1 and raid10, I think it can be a good start,
> > but still not belong to this patch.
> 
> You are trying to create common code for raid1 and raid10.  This does
> not belong in md.c.
> If you really want to have a single copy of common code, then it exactly
> is the role of this patch to create a place to put it.
> I'm not saying you should put all common code in raid1-10.c.   Just the
> function that you have identified.

I really don't want to waste time on this kind of thing, I can do
either one frankly.

Shaohua, could you share me which way you like to merge? I can do it in
either way.

-- 
Ming


Re: [PATCH 2/2] block: delete bio_uninit

2017-07-13 Thread Christoph Hellwig
>  static void bio_free(struct bio *bio)
>  {
>   struct bio_set *bs = bio->bi_pool;
>   void *p;
>  
> - bio_uninit(bio);
> + bio_disassociate_task(bio);

As said in the last mail I think there is no point in having this call..

> @@ -294,7 +289,7 @@ void bio_reset(struct bio *bio)
>  {
>   unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);
>  
> - bio_uninit(bio);
> + bio_disassociate_task(bio);

.. and this one.  And I suspect it would make sense to have all these
changes in a single patch.



Re: linux-next scsi-mq hang in suspend-resume

2017-07-13 Thread Christoph Hellwig
On Wed, Jul 12, 2017 at 10:50:19AM -0600, Jens Axboe wrote:
> On 07/12/2017 08:51 AM, Tomi Sarvela wrote:
> > Hello there,
> > 
> > I've been running Intel GFX CI testing for linux DRM-Tip i915 driver, 
> > and couple of weeks ago we took linux-next for a ride to see what kind 
> > of integration problems there might pop up when pulling 4.13-rc1. 
> > Latest results can be seen at
> > 
> > https://intel-gfx-ci.01.org/CI/next-issues.html
> > https://intel-gfx-ci.01.org/CI/next-all.html
> > 
> > The purple blocks are hangs, starting from 20170628 (20170627 was 
> > untestable due to locking changes which were reverted). Traces were 
> > pointing to ext4 but bisecting between good 20170626 and bad 20170628 
> > pointed to:
> > 
> > commit 5c279bd9e40624f4ab6e688671026d6005b066fa
> > Date:   Fri Jun 16 10:27:55 2017 +0200
> > 
> > scsi: default to scsi-mq
> > 
> > Reproduction is 100% or close to it when running two i-g-t tests as a 
> > testlist. I'm assuming that it creates the correct amount or pattern 
> > of actions to the device. The testlist consists of the following 
> > lines:
> > 
> > igt@gem_exec_gttfill@basic
> > igt@gem_exec_suspend@basic-s3
> > 
> > Kernel option scsi_mod.use_blk_mq=0 hides the issue on testhosts. 
> > Configuration option was copied over on testhosts and 20170712 was re-
> > tested, that's why today looks so much greener.
> > 
> > More information including traces and reproduction instructions at
> > https://bugzilla.kernel.org/show_bug.cgi?id=196223
> > 
> > I can run patchsets through the farm, if needed. In addition, daily 
> > linux-next tags are automatically tested and results published.
> 
> Christoph, any ideas? Smells like something in SCSI, my notebook
> with nvme/blk-mq suspend/resumes just fine.

There isn't much mq-specific scsi code, so it's probably an interaction
of both.  I'll see if the bugzilla has enough data to reproduce it
locally.

Although I really wish people wouldn't use #TY^$Y^$ bugzilla and just
post the important data to the list :(

> 
> -- 
> Jens Axboe
> 
---end quoted text---