how to enlarge value of max_sectors_kb

2018-01-12 Thread tang . junhui
From: Tang Junhui 

There is a machine with very little max_sectors_kb size:
[root@ceph151 queue]# pwd
/sys/block/sdd/queue
[root@ceph151 queue]# cat max_hw_sectors_kb 
256
[root@ceph151 queue]# cat max_sectors_kb 
256

The performance is very low when I run big I/Os.
I can not modify it directly, I need a little bigger,
so how can I enlarge it?

the system info:

[root@ceph151 queue]# uname -r
3.10.0-327.22.2.el7.x86_64

[root@ceph151 queue]# lsmod |grep scsi
scsi_transport_fc  64056  1 lpfc
scsi_tgt   20027  1 scsi_transport_fc

[root@ceph151 queue]# lsscsi
[0:0:0:0]diskASR7805  system   V1.0  /dev/sda 
[0:1:8:0]diskATA  Hitachi HUA72201 JP4O  /dev/sdb 
[0:1:9:0]diskATA  Hitachi HUA72201 JP4O  /dev/sdc 
[0:1:10:0]   diskATA  Hitachi HUA72201 JP4O  /dev/sdd 
[0:1:11:0]   diskATA  Hitachi HUA72201 JP4O  /dev/sde 
[0:1:12:0]   diskATA  Hitachi HUA72201 JP4O  /dev/sdf 
[0:1:13:0]   diskATA  Hitachi HUA72201 JP4O  /dev/sdg 
[0:1:14:0]   diskATA  Hitachi HUA72201 JP4O  /dev/sdh 
[0:1:15:0]   diskATA  Hitachi HUA72201 JP4O  /dev/sdi 
[0:1:16:0]   diskATA  Hitachi HUA72201 JP4O  /dev/sdj 
[0:1:17:0]   diskATA  Hitachi HUA72201 JP4O  /dev/sdk 
[0:1:18:0]   diskATA  Hitachi HUA72201 JP4O  /dev/sdl 
[0:1:19:0]   diskATA  Hitachi HUA72201 JP4O  /dev/sdm 
[0:1:20:0]   diskTOSHIBA  AL13SEB600   0101  -
[0:1:21:0]   diskTOSHIBA  AL13SEB600   0101  -
[0:3:0:0]enclosu ZTE CORP 720381 EvBd 255  1 - 

Any comment would be wellcome.

Thanks,
Tang Junhui


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  8:00pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 19:52 -0500, Mike Snitzer wrote:
> > It was 50 ms before it was 100 ms.  No real explaination for these
> > values other than they seem to make Bart's IB SRP testbed happy?
> 
> But that constant was not introduced by me in the dm code.

No actually it was (not that there's anything wrong with that):

commit 06eb061f48594aa369f6e852b352410298b317a8
Author: Bart Van Assche 
Date:   Fri Apr 7 16:50:44 2017 -0700

dm mpath: requeue after a small delay if blk_get_request() fails

If blk_get_request() returns ENODEV then multipath_clone_and_map()
causes a request to be requeued immediately. This can cause a kworker
thread to spend 100% of the CPU time of a single core in
__blk_mq_run_hw_queue() and also can cause device removal to never
finish.

Avoid this by only requeuing after a delay if blk_get_request() fails.
Additionally, reduce the requeue delay.

Cc: sta...@vger.kernel.org # 4.9+
Signed-off-by: Bart Van Assche 
Signed-off-by: Mike Snitzer 

Note that this commit actually details a different case where a
blk_get_request() (in existing code) return of -ENODEV is a very
compelling case to use DM_MAPIO_DELAY_REQUEUE.

SO I'll revisit what is appropriate in multipath_clone_and_map() on
Monday.

I need a break... have a good weekend Bart.


Re: [PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure

2018-01-12 Thread Ming Lei
On Fri, Jan 12, 2018 at 07:04:28PM +, Bart Van Assche wrote:
> On Thu, 2018-01-11 at 14:01 +0800, Ming Lei wrote:
> > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > index 86bf502a8e51..fcddf5a62581 100644
> > --- a/drivers/md/dm-mpath.c
> > +++ b/drivers/md/dm-mpath.c
> > @@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target 
> > *ti, struct request *rq,
> > if (queue_dying) {
> > atomic_inc(>pg_init_in_progress);
> > activate_or_offline_path(pgpath);
> > +   return DM_MAPIO_DELAY_REQUEUE;
> > }
> > -   return DM_MAPIO_DELAY_REQUEUE;
> > +
> > +   /*
> > +* blk-mq's SCHED_RESTART can cover this requeue, so
> > +* we needn't to deal with it by DELAY_REQUEUE. More
> > +* importantly, we have to return DM_MAPIO_REQUEUE
> > +* so that blk-mq can get the queue busy feedback,
> > +* otherwise I/O merge can be hurt.
> > +*/
> > +   if (q->mq_ops)
> > +   return DM_MAPIO_REQUEUE;
> > +   else
> > +   return DM_MAPIO_DELAY_REQUEUE;
> > }
> 
> Sorry but the approach of this patch looks wrong to me. I'm afraid that this
> approach will cause 100% CPU consumption if the underlying .queue_rq()
> function returns BLK_STS_RESOURCE for another reason than what can be detected
> by the .get_budget() call. This can happen if e.g. a SCSI LLD .queuecommand()
> implementation returns SCSI_MLQUEUE_HOST_BUSY. Many SCSI LLDs can do this:
> $ git grep 'SCSI_MLQUEUE_[^_]*_BUSY' | wc -l
> 204
> 
> Isn't this a severe regression?

No, it is not.

This DM_MAPIO_REQUEUE will be converted to BLK_STS_RESOURCE in dm-rq, then
it is returned to blk-mq. That means when running out of resource for
dispatching the underlying request, we tell blk-mq the truth. This
way has been used for other blk-mq drivers, such NVMe,..., more
importantly we have to provide this real feedback to blk-mq for fixing
the performance issue.

The blk_get_request(GFP_ATOMIC) is just like a kmalloc(ATOMIC),
and memory is shared system-wide too. You can see there are lots kmalloc(ATOMIC)
in NVMe IO path, when it fails, BLK_STS_RESOURCE is returned to blk-mq.

blk-mq has built-in SCHED_RESTART mechanism to handle out of RESOURCE
when BLK_STS_RESOURCE is observed.

We only implements .get_bugdet() in SCSI only, and other blk-mq drivers: NVMe,
..., often returns BLK_STS_RESOURCE to blk-mq too, so wrt. this usage there
isn't any difference between dm-mpath and other cases, right?


Thanks,
Ming


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Bart Van Assche
On Fri, 2018-01-12 at 19:52 -0500, Mike Snitzer wrote:
> It was 50 ms before it was 100 ms.  No real explaination for these
> values other than they seem to make Bart's IB SRP testbed happy?

But that constant was not introduced by me in the dm code. See e.g. the
following commits:

commit d548b34b062b60b4f4df295a0b4823dfda1f1fc4
Author: Mike Snitzer 
Date:   Thu Mar 5 22:21:10 2015 -0500

dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms
 
Commit 7eaceaccab ("block: remove per-queue plugging") didn't justify
DM's use of a 100ms delay; such an extended delay is a liability when
there is reason to re-kick the queue.
 
Signed-off-by: Mike Snitzer 

commit 7eaceaccab5f40bbfda044629a6298616aeaed50
Author: Jens Axboe 
Date:   Thu Mar 10 08:52:07 2011 +0100

block: remove per-queue plugging
 
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
 
Signed-off-by: Jens Axboe 

Bart. 

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  2:53pm -0500,
Elliott, Robert (Persistent Memory)  wrote:

> 
> 
> > -Original Message-
> > From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> > ow...@vger.kernel.org] On Behalf Of Bart Van Assche
> ...
> > The intention of commit 6077c2d706097c0 was to address the last mentioned
> > case. It may be possible to move the delayed queue rerun from the
> > dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> > wrong to rerun the queue immediately in case a SCSI target system returns
> > "BUSY".
> 
> Seeing SCSI BUSY mentioned here...
> 
> On its own, patch 6077c2d706 seems to be adding an arbitrarily selected
> magic value of 100 ms without explanation in the patch description or
> in the added code.

It was 50 ms before it was 100 ms.  No real explaination for these
values other than they seem to make Bart's IB SRP testbed happy?
 
> But, dm_request_original_request() already seems to have chosen that value
> for similar purposes:
> unsigned long delay_ms = delay_requeue ? 100 : 0;
> 
> So, making them share a #define would indicate there's a reason for that
> particular value.  Any change to the value would be picked up everywhere.
> 
> All the other callers of blk_mq_delay_run_hw_queue() use macros:
> drivers/md/dm-rq.c: blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
> drivers/nvme/host/fc.c: blk_mq_delay_run_hw_queue(queue->hctx, 
> NVMEFC_QUEUE_DELAY);
> drivers/scsi/scsi_lib.c:blk_mq_delay_run_hw_queue(hctx, 
> SCSI_QUEUE_DELAY);
> drivers/scsi/scsi_lib.c:
> blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);

Sure, I'll add a #define.

> Those both happen to be set to 3 (ms).

As for the value of 100, we're dealing with path faults so retrying them
extremely fast could be wasted effort.  But there is obviously no once
size fits all.  As storage gets faster 100 ms could prove to be very
bad.

But without tests to verify a change, there won't be one.

Thanks,
Mike


Re: [Drbd-dev] [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout

2018-01-12 Thread Eric Wheeler
Hello All,

We just noticed that discards to DRBD devices backed by dm-thin devices 
are fully allocating the thin blocks.

This behavior does not exist before 
ee472d83 block: add a flags argument to (__)blkdev_issue_zeroout

The problem exists somewhere between
[working] c20cfc27 block: stop using blkdev_issue_write_same for zeroing
  and
[broken]  45c21793 drbd: implement REQ_OP_WRITE_ZEROES

Note that c20cfc27 works as expected, but 45c21793 discards blocks 
being zeroed on the dm-thin backing device. All commits between those two 
produce the following error:

blkdiscard: /dev/drbd/by-res/test: BLKDISCARD ioctl failed: Input/output error

Also note that issuing a blkdiscard to the backing device directly 
discards as you would expect. This is just a problem when sending discards 
through DRBD.

Is there an easy way to solve this in the short term, even if the ultimate 
fix is more involved?

Thank you for your help!

-Eric

--
Eric Wheeler

On Wed, 5 Apr 2017, Christoph Hellwig wrote:

> drbd always wants its discard wire operations to zero the blocks, so
> use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
> reinventing it poorly.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Hannes Reinecke 
> ---
>  drivers/block/drbd/drbd_debugfs.c  |   3 --
>  drivers/block/drbd/drbd_int.h  |   6 ---
>  drivers/block/drbd/drbd_receiver.c | 102 
> ++---
>  drivers/block/drbd/drbd_req.c  |   6 +--
>  4 files changed, 7 insertions(+), 110 deletions(-)
> 
> diff --git a/drivers/block/drbd/drbd_debugfs.c 
> b/drivers/block/drbd/drbd_debugfs.c
> index de5c3ee8a790..494837e59f23 100644
> --- a/drivers/block/drbd/drbd_debugfs.c
> +++ b/drivers/block/drbd/drbd_debugfs.c
> @@ -236,9 +236,6 @@ static void seq_print_peer_request_flags(struct seq_file 
> *m, struct drbd_peer_re
>   seq_print_rq_state_bit(m, f & EE_CALL_AL_COMPLETE_IO, , "in-AL");
>   seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, , "C");
>   seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, , "set-in-sync");
> -
> - if (f & EE_IS_TRIM)
> - __seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, , 
> "zero-out", "trim");
>   seq_print_rq_state_bit(m, f & EE_WRITE_SAME, , "write-same");
>   seq_putc(m, '\n');
>  }
> diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
> index 724d1c50fc52..d5da45bb03a6 100644
> --- a/drivers/block/drbd/drbd_int.h
> +++ b/drivers/block/drbd/drbd_int.h
> @@ -437,9 +437,6 @@ enum {
>  
>   /* is this a TRIM aka REQ_DISCARD? */
>   __EE_IS_TRIM,
> - /* our lower level cannot handle trim,
> -  * and we want to fall back to zeroout instead */
> - __EE_IS_TRIM_USE_ZEROOUT,
>  
>   /* In case a barrier failed,
>* we need to resubmit without the barrier flag. */
> @@ -482,7 +479,6 @@ enum {
>  #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
>  #define EE_MAY_SET_IN_SYNC (1<<__EE_MAY_SET_IN_SYNC)
>  #define EE_IS_TRIM (1<<__EE_IS_TRIM)
> -#define EE_IS_TRIM_USE_ZEROOUT (1<<__EE_IS_TRIM_USE_ZEROOUT)
>  #define EE_RESUBMITTED (1<<__EE_RESUBMITTED)
>  #define EE_WAS_ERROR   (1<<__EE_WAS_ERROR)
>  #define EE_HAS_DIGEST  (1<<__EE_HAS_DIGEST)
> @@ -1561,8 +1557,6 @@ extern void start_resync_timer_fn(unsigned long data);
>  extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req);
>  
>  /* drbd_receiver.c */
> -extern int drbd_issue_discard_or_zero_out(struct drbd_device *device,
> - sector_t start, unsigned int nr_sectors, bool discard);
>  extern int drbd_receiver(struct drbd_thread *thi);
>  extern int drbd_ack_receiver(struct drbd_thread *thi);
>  extern void drbd_send_ping_wf(struct work_struct *ws);
> diff --git a/drivers/block/drbd/drbd_receiver.c 
> b/drivers/block/drbd/drbd_receiver.c
> index dc9a6dcd431c..bc1d296581f9 100644
> --- a/drivers/block/drbd/drbd_receiver.c
> +++ b/drivers/block/drbd/drbd_receiver.c
> @@ -1448,108 +1448,14 @@ void drbd_bump_write_ordering(struct drbd_resource 
> *resource, struct drbd_backin
>   drbd_info(resource, "Method to ensure write ordering: %s\n", 
> write_ordering_str[resource->write_ordering]);
>  }
>  
> -/*
> - * We *may* ignore the discard-zeroes-data setting, if so configured.
> - *
> - * Assumption is that it "discard_zeroes_data=0" is only because the backend
> - * may ignore partial unaligned discards.
> - *
> - * LVM/DM thin as of at least
> - *   LVM version: 2.02.115(2)-RHEL7 (2015-01-28)
> - *   Library version: 1.02.93-RHEL7 (2015-01-28)
> - *   Driver version:  4.29.0
> - * still behaves this way.
> - *
> - * For unaligned (wrt. alignment and granularity) or too small discards,
> - * we zero-out the initial (and/or) trailing unaligned partial chunks,
> - * but discard all the aligned full chunks.
> - *
> - * At least for LVM/DM thin, the result is effectively 
> "discard_zeroes_data=1".
> - */
> -int 

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  6:42pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 18:17 -0500, Mike Snitzer wrote:
> > @@ -1570,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, 
> > struct request *clone,
> > if (error && blk_path_error(error)) {
> > struct multipath *m = ti->private;
> >  
> > -   r = DM_ENDIO_REQUEUE;
> > +   if (r == BLK_STS_RESOURCE)
> > +   r = DM_ENDIO_DELAY_REQUEUE;
> > +   else
> > +   r = DM_ENDIO_REQUEUE;
> 
> Did you perhaps intend "error == BLK_STS_RESOURCE"?

Yes, it was a quick patch to get your thoughts.

> 
> > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> > index 9ba8453..da83f64 100644
> > --- a/include/linux/device-mapper.h
> > +++ b/include/linux/device-mapper.h
> > @@ -550,6 +550,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
> >  #define DM_ENDIO_DONE  0
> >  #define DM_ENDIO_INCOMPLETE1
> >  #define DM_ENDIO_REQUEUE   2
> > +#define DM_ENDIO_DELAY_REQUEUE 3
> >  
> >  /*
> >   * Definitions of return values from target map function.
> > @@ -557,7 +558,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
> >  #define DM_MAPIO_SUBMITTED 0
> >  #define DM_MAPIO_REMAPPED  1
> >  #define DM_MAPIO_REQUEUE   DM_ENDIO_REQUEUE
> > -#define DM_MAPIO_DELAY_REQUEUE 3
> > +#define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE
> >  #define DM_MAPIO_KILL  4
> >  
> >  #define dm_sector_div64(x, y)( \
> 
> Please consider to introduce enumeration types for the DM_ENDIO_* and the
> DM_MAPIO_* constants such that the compiler can catch what I reported above.

OK, point taken.

> Otherwise this patch looks fine to me.

Cool, thanks.

Mike


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Bart Van Assche
On Fri, 2018-01-12 at 18:17 -0500, Mike Snitzer wrote:
> @@ -1570,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, 
> struct request *clone,
>   if (error && blk_path_error(error)) {
>   struct multipath *m = ti->private;
>  
> - r = DM_ENDIO_REQUEUE;
> + if (r == BLK_STS_RESOURCE)
> + r = DM_ENDIO_DELAY_REQUEUE;
> + else
> + r = DM_ENDIO_REQUEUE;

Did you perhaps intend "error == BLK_STS_RESOURCE"?

> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index 9ba8453..da83f64 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -550,6 +550,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
>  #define DM_ENDIO_DONE0
>  #define DM_ENDIO_INCOMPLETE  1
>  #define DM_ENDIO_REQUEUE 2
> +#define DM_ENDIO_DELAY_REQUEUE   3
>  
>  /*
>   * Definitions of return values from target map function.
> @@ -557,7 +558,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
>  #define DM_MAPIO_SUBMITTED   0
>  #define DM_MAPIO_REMAPPED1
>  #define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE
> -#define DM_MAPIO_DELAY_REQUEUE   3
> +#define DM_MAPIO_DELAY_REQUEUE   DM_ENDIO_DELAY_REQUEUE
>  #define DM_MAPIO_KILL4
>  
>  #define dm_sector_div64(x, y)( \

Please consider to introduce enumeration types for the DM_ENDIO_* and the
DM_MAPIO_* constants such that the compiler can catch what I reported above.
Otherwise this patch looks fine to me.

Thanks,

Bart.

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  1:54pm -0500,
Bart Van Assche  wrote:
 
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case. It may be possible to move the delayed queue rerun from the
> dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> wrong to rerun the queue immediately in case a SCSI target system returns
> "BUSY".

This is my 3rd reply to this email.. 3rd time's the charm? ;)

Here is a patch that will kick the hw queues via blk_mq_requeue_work(),
indirectly from dm-rq.c:__dm_mq_kick_requeue_list(), after a delay if
BLK_STS_RESOURCE is returned from the target.

Your thoughts on this patch as an alternative to commit 6077c2d7060 ?

Thanks,
Mike

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index d8c7259..ab2cfdc 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1570,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, struct 
request *clone,
if (error && blk_path_error(error)) {
struct multipath *m = ti->private;
 
-   r = DM_ENDIO_REQUEUE;
+   if (r == BLK_STS_RESOURCE)
+   r = DM_ENDIO_DELAY_REQUEUE;
+   else
+   r = DM_ENDIO_REQUEUE;
 
if (pgpath)
fail_path(pgpath);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6b01298..ab0ed2d 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -315,6 +315,10 @@ static void dm_done(struct request *clone, blk_status_t 
error, bool mapped)
/* The target wants to requeue the I/O */
dm_requeue_original_request(tio, false);
break;
+   case DM_ENDIO_DELAY_REQUEUE:
+   /* The target wants to requeue the I/O after a delay */
+   dm_requeue_original_request(tio, true);
+   break;
default:
DMWARN("unimplemented target endio return value: %d", r);
BUG();
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 9ba8453..da83f64 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -550,6 +550,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
 #define DM_ENDIO_DONE  0
 #define DM_ENDIO_INCOMPLETE1
 #define DM_ENDIO_REQUEUE   2
+#define DM_ENDIO_DELAY_REQUEUE 3
 
 /*
  * Definitions of return values from target map function.
@@ -557,7 +558,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
 #define DM_MAPIO_SUBMITTED 0
 #define DM_MAPIO_REMAPPED  1
 #define DM_MAPIO_REQUEUE   DM_ENDIO_REQUEUE
-#define DM_MAPIO_DELAY_REQUEUE 3
+#define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE
 #define DM_MAPIO_KILL  4
 
 #define dm_sector_div64(x, y)( \




Re: [PATCH v2 3/4] scsi: Avoid that .queuecommand() gets called for a quiesced SCSI device

2018-01-12 Thread Bart Van Assche
On Thu, 2018-01-11 at 10:23 +0800, Ming Lei wrote:
> > not sufficient to prevent .queuecommand() calls from scsi_send_eh_cmnd().
> 
> Given it is error handling, do we need to prevent the .queuecommand() call
> in scsi_send_eh_cmnd()? Could you share us what the actual issue
> observed is from user view?

Please have a look at the kernel bug report in the description of patch 4/4 of
this series.

> > Hence surround the .queuecommand() call from the SCSI error handler with
> > blk_start_wait_if_quiesced() / blk_finish_wait_if_quiesced().
> > 
> > Note: converting the .queuecommand() call in scsi_send_eh_cmnd() into
> > code that calls blk_get_request(), e.g. scsi_execute_req(), is not an
> > option since scsi_send_eh_cmnd() can be called if all requests are
> > allocated and if no requests will make progress without aborting any
> > of these requests.
> 
> If we need to prevent the .queuecommand() in scsi_send_eh_cmnd(), what
> do you think of the approach by requeuing the EH command via
> scsi_mq_requeue_cmd()/scsi_requeue_cmd()? And the EH request will be
> dispatched finally when the queue becomes unquiesced or the STOPPED
> is cleared.

Calling scsi_mq_requeue_cmd() would trigger scsi_mq_uninit_cmd(),
blk_mq_put_driver_tag(), blk_mq_get_driver_tag() and scsi_mq_prep_fn().
That could cause scsi_eh_scmd_add() to be called multiple times for the
same SCSI command. However, that would break one of the assumptions
scsi_eh_scmd_add() is based on, namely that that function gets called
only once per SCSI command that is in progress.

Bart. 

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  1:54pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> > OK, you have the stage: please give me a pointer to your best
> > explaination of the several.
> 
> Since the previous discussion about this topic occurred more than a month
> ago it could take more time to look up an explanation than to explain it
> again. Anyway, here we go. As you know a block layer request queue needs to
> be rerun if one or more requests are waiting and a previous condition that
> prevented the request to be executed has been cleared. For the dm-mpath
> driver, examples of such conditions are no tags available, a path that is
> busy (see also pgpath_busy()), path initialization that is in progress
> (pg_init_in_progress) or a request completes with status, e.g. if the
> SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
> of these conditions, e.g. path initialization completes, a callback
> function in the dm-mpath driver is called and it is possible to explicitly
> rerun the queue. I agree that for such scenario's a delayed queue run should
> not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
> SCSI request over a fabric and the SCSI target replies with "BUSY" then the
> SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> maximum number of retries has been reached (see also scsi_io_completion()).
> In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> to the initiator, there is no other approach for the SCSI initiator to
> figure out whether it can queue another request than to resubmit the
> request. The worst possible strategy is to resubmit a request immediately
> because that will cause a significant fraction of the fabric bandwidth to
> be used just for replying "BUSY" to requests that can't be processed
> immediately.

The thing is, the 2 scenarios you are most concerned about have
_nothing_ to do with dm_mq_queue_rq() at all.  They occur as an error in
the IO path _after_ the request is successfully retrieved with
blk_get_request() and then submitted.
 
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case.

So it really makes me question why you think commit 6077c2d706097c0
addresses the issue you think it does.  And gives me more conviction to
remove 6077c2d706097c0.

It may help just by virtue of blindly kicking the queue if
blk_get_request() fails (regardless of the target is responding with
BUSY or not).  Very unsatisfying to say the least.

I think it'd be much more beneficial for dm-mpath.c:multipath_end_io()
to be trained to be respond more intelligently to BLK_STS_RESOURCE.

E.g. set BLK_MQ_S_SCHED_RESTART if requests are known to be outstanding
on the path.  This is one case where Ming said the queue would be
re-run, as detailed in this header:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=5b18cff4baedde77e0d69bd62a13ae78f9488d89

And Jens has reinforced to me that BLK_MQ_S_SCHED_RESTART is a means to
kicking the queue more efficiently.  _BUT_ I'm not seeing any external
blk-mq interface that exposes this capability to a blk-mq driver.  As is
BLK_MQ_S_SCHED_RESTART gets set very much in the bowels of blk-mq
(blk_mq_sched_mark_restart_hctx).

SO I have to do more homework here...

Ming or Jens: might you be able to shed some light on how dm-mpath
would/could set BLK_MQ_S_SCHED_RESTART?  A new function added that can
be called from q->softirq_done_fn in response to BLK_STS_RESOURCE?  Or
is this level of control (and blk-mq detail) something that a blk-mq
driver should need to care about?

Anyway, we may need to get to that level of blk-mq driver controlled
retry optimization.

But Bart: again this is a vastly different feedback loop than the
stabbing in the dark your commit 6077c2d706097c0 is doing.  

Mike


Re: [PATCH] blk-mq-debugfs: Also show requests that have not yet been started

2018-01-12 Thread Bart Van Assche
On Fri, 2018-01-12 at 15:05 -0700, Jens Axboe wrote:
> On 1/12/18 3:00 PM, Bart Van Assche wrote:
> > On Fri, 2018-01-12 at 14:55 -0700, Jens Axboe wrote:
> > > On 1/12/18 2:52 PM, Bart Van Assche wrote:
> > > > When debugging e.g. the SCSI timeout handler it is important that
> > > > requests that have not yet been started or that already have
> > > > completed are also reported through debugfs.
> > > > 
> > > > This patch depends on a patch that went upstream recently, namely
> > > > commit 14e3062fb185 ("scsi: core: Fix a scsi_show_rq() NULL pointer
> > > > dereference").
> > > 
> > > Why don't we just kill the check, and dump any request that has a
> > > matching hctx? We already know the bit was set, so just print
> > > all of them.
> > 
> > It is very helpful during debugging that requests owned by a block driver 
> > and
> > requests owned by the block layer core show up in different debugfs files.
> > Removing the check completely would cause all requests to show up in the 
> > same
> > debugfs file and would make interpreting the contents of these debugfs files
> > much harder.
> 
> Yeah, we'd have to make it just one file at that point. I'm not hugely
> against the queuelist check, but probably warrants a comment as it's not
> immediately clear (as opposed to the idle check, or the previous START
> bit check).

How about the below?

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 19db3f583bf1..da859dac442b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -394,6 +394,12 @@ struct show_busy_params {
 };
 
 /*
+ * Show "busy" requests - these are the requests owned by the block driver.
+ * The test list_empty(>queuelist) is used to figure out whether or not
+ * a request is owned by the block driver. That test works because the block
+ * layer core uses list_del_init() consistently to remove a request from one
+ * of the request lists.
+ *
  * Note: the state of a request may change while this function is in progress,
  * e.g. due to a concurrent blk_mq_finish_request() call.
  */
@@ -402,7 +408,7 @@ static void hctx_show_busy_rq(struct request *rq, void 
*data, bool reserved)
const struct show_busy_params *params = data;
 
if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
-   blk_mq_rq_state(rq) != MQ_RQ_IDLE)
+   list_empty(>queuelist))
__blk_mq_debugfs_rq_show(params->m,
 list_entry_rq(>queuelist));
 }

Thanks,

Bart.

Re: [PATCH] blk-mq-debugfs: Also show requests that have not yet been started

2018-01-12 Thread Jens Axboe
On 1/12/18 3:00 PM, Bart Van Assche wrote:
> On Fri, 2018-01-12 at 14:55 -0700, Jens Axboe wrote:
>> On 1/12/18 2:52 PM, Bart Van Assche wrote:
>>> When debugging e.g. the SCSI timeout handler it is important that
>>> requests that have not yet been started or that already have
>>> completed are also reported through debugfs.
>>>
>>> This patch depends on a patch that went upstream recently, namely
>>> commit 14e3062fb185 ("scsi: core: Fix a scsi_show_rq() NULL pointer
>>> dereference").
>>
>> Why don't we just kill the check, and dump any request that has a
>> matching hctx? We already know the bit was set, so just print
>> all of them.
> 
> It is very helpful during debugging that requests owned by a block driver and
> requests owned by the block layer core show up in different debugfs files.
> Removing the check completely would cause all requests to show up in the same
> debugfs file and would make interpreting the contents of these debugfs files
> much harder.

Yeah, we'd have to make it just one file at that point. I'm not hugely
against the queuelist check, but probably warrants a comment as it's not
immediately clear (as opposed to the idle check, or the previous START
bit check).

-- 
Jens Axboe



Re: [PATCHSET v5] blk-mq: reimplement timeout handling

2018-01-12 Thread Jens Axboe
On 1/12/18 2:19 PM, Bart Van Assche wrote:
> On Fri, 2018-01-12 at 14:07 -0700, Jens Axboe wrote:
>> You're really not making it easy for folks to run this :-)
> 
> My hope is that the ib_srp and ib_srpt patches will be accepted upstream soon.
> As long as these are not upstream, anyone who wants to retrieve these patches
> is welcome to clone 
> https://github.com/bvanassche/linux/tree/block-scsi-for-next,
> a kernel tree with all my pending patches.
> 
>> Do you have the matching blk-mq debugfs output for the device?
> 
> Sorry but I only retrieved the blk-mq debugfs several minutes after the hang
> started so I'm not sure the state information is relevant. Anyway, I have 
> attached
> it to this e-mail. The most remarkable part is the following:
> 
> ./9ddfa913/requeue_list:9646711c {.op=READ, .state=idle, 
> gen=0x1
> 18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, 
> complete
> =0, .tag=-1, .internal_tag=217}

Two things come to mind here:

1) We forgot to add RQF_STARTED to the debugfs bits, I just rectified
   that.

2) You are using a scheduler (which one?). The request was inserted, and
   retrieved by the driver, then requeued. After this requeue,
   apparently nothing happened. The queue should have been re-run, but
   that didn't happen. What are the queue/hctx states?

-- 
Jens Axboe



Re: [PATCH] blk-mq-debugfs: Also show requests that have not yet been started

2018-01-12 Thread Jens Axboe
On 1/12/18 2:52 PM, Bart Van Assche wrote:
> When debugging e.g. the SCSI timeout handler it is important that
> requests that have not yet been started or that already have
> completed are also reported through debugfs.
> 
> This patch depends on a patch that went upstream recently, namely
> commit 14e3062fb185 ("scsi: core: Fix a scsi_show_rq() NULL pointer
> dereference").

Why don't we just kill the check, and dump any request that has a
matching hctx? We already know the bit was set, so just print
all of them.

-- 
Jens Axboe



[PATCH] blk-mq-debugfs: Also show requests that have not yet been started

2018-01-12 Thread Bart Van Assche
When debugging e.g. the SCSI timeout handler it is important that
requests that have not yet been started or that already have
completed are also reported through debugfs.

This patch depends on a patch that went upstream recently, namely
commit 14e3062fb185 ("scsi: core: Fix a scsi_show_rq() NULL pointer
dereference").

Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Martin K. Petersen 
---
 block/blk-mq-debugfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 19db3f583bf1..ccd5ef08c3f0 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -402,7 +402,7 @@ static void hctx_show_busy_rq(struct request *rq, void 
*data, bool reserved)
const struct show_busy_params *params = data;
 
if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
-   blk_mq_rq_state(rq) != MQ_RQ_IDLE)
+   list_empty(>queuelist))
__blk_mq_debugfs_rq_show(params->m,
 list_entry_rq(>queuelist));
 }
-- 
2.15.1



Re: [PATCHSET v5] blk-mq: reimplement timeout handling

2018-01-12 Thread Bart Van Assche
On Fri, 2018-01-12 at 14:07 -0700, Jens Axboe wrote:
> You're really not making it easy for folks to run this :-)

My hope is that the ib_srp and ib_srpt patches will be accepted upstream soon.
As long as these are not upstream, anyone who wants to retrieve these patches
is welcome to clone 
https://github.com/bvanassche/linux/tree/block-scsi-for-next,
a kernel tree with all my pending patches.

> Do you have the matching blk-mq debugfs output for the device?

Sorry but I only retrieved the blk-mq debugfs several minutes after the hang
started so I'm not sure the state information is relevant. Anyway, I have 
attached
it to this e-mail. The most remarkable part is the following:

./9ddfa913/requeue_list:9646711c {.op=READ, .state=idle, gen=0x1
18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, complete
=0, .tag=-1, .internal_tag=217}

The hexadecimal number at the start is the request_queue pointer (I modified the
blk-mq-debugfs code such that queues are registered with there address just 
after
creation and until a name is assigned). This is a dm-mpath queue.

Bart../9ddfa913/hctx0/cpu3/completed:29 0
./9ddfa913/hctx0/cpu3/merged:0
./9ddfa913/hctx0/cpu3/dispatched:30 0
./9ddfa913/hctx0/cpu2/completed:0 0
./9ddfa913/hctx0/cpu2/merged:0
./9ddfa913/hctx0/cpu2/dispatched:0 0
./9ddfa913/hctx0/cpu1/completed:0 0
./9ddfa913/hctx0/cpu1/merged:0
./9ddfa913/hctx0/cpu1/dispatched:0 0
./9ddfa913/hctx0/cpu0/completed:0 0
./9ddfa913/hctx0/cpu0/merged:0
./9ddfa913/hctx0/cpu0/dispatched:0 0
./9ddfa913/hctx0/active:0
./9ddfa913/hctx0/run:92
./9ddfa913/hctx0/queued:30
./9ddfa913/hctx0/dispatched:   00
./9ddfa913/hctx0/dispatched:   197
./9ddfa913/hctx0/dispatched:   20
./9ddfa913/hctx0/dispatched:   40
./9ddfa913/hctx0/dispatched:   80
./9ddfa913/hctx0/dispatched:  160
./9ddfa913/hctx0/dispatched:  32+   0
./9ddfa913/hctx0/io_poll:considered=0
./9ddfa913/hctx0/io_poll:invoked=0
./9ddfa913/hctx0/io_poll:success=0
./9ddfa913/hctx0/sched_tags_bitmap::      
  
./9ddfa913/hctx0/sched_tags_bitmap:0010:      
0002  
./9ddfa913/hctx0/sched_tags:nr_tags=256
./9ddfa913/hctx0/sched_tags:nr_reserved_tags=0
./9ddfa913/hctx0/sched_tags:active_queues=0
./9ddfa913/hctx0/sched_tags:
./9ddfa913/hctx0/sched_tags:bitmap_tags:
./9ddfa913/hctx0/sched_tags:depth=256
./9ddfa913/hctx0/sched_tags:busy=1
./9ddfa913/hctx0/sched_tags:bits_per_word=64
./9ddfa913/hctx0/sched_tags:map_nr=4
./9ddfa913/hctx0/sched_tags:alloc_hint={45, 56, 144, 218}
./9ddfa913/hctx0/sched_tags:wake_batch=8
./9ddfa913/hctx0/sched_tags:wake_index=0
./9ddfa913/hctx0/sched_tags:ws={
./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive},
./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive},
./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive},
./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive},
./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive},
./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive},
./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive},
./9ddfa913/hctx0/sched_tags:{.wait_cnt=8, .wait=inactive},
./9ddfa913/hctx0/sched_tags:}
./9ddfa913/hctx0/sched_tags:round_robin=0
./9ddfa913/hctx0/tags_bitmap::       
 
./9ddfa913/hctx0/tags_bitmap:0010:       
 
./9ddfa913/hctx0/tags_bitmap:0020:       
 
./9ddfa913/hctx0/tags_bitmap:0030:       
 
./9ddfa913/hctx0/tags_bitmap:0040:       
 
./9ddfa913/hctx0/tags_bitmap:0050:       
 
./9ddfa913/hctx0/tags_bitmap:0060:       
 
./9ddfa913/hctx0/tags_bitmap:0070:       
 
./9ddfa913/hctx0/tags_bitmap:0080:       
 
./9ddfa913/hctx0/tags_bitmap:0090:       
 
./9ddfa913/hctx0/tags_bitmap:00a0:       
 
./9ddfa913/hctx0/tags_bitmap:00b0:       
 
./9ddfa913/hctx0/tags_bitmap:00c0:       
 
./9ddfa913/hctx0/tags_bitmap:00d0:       
 
./9ddfa913/hctx0/tags_bitmap:00e0:      

Re: [PATCHSET v5] blk-mq: reimplement timeout handling

2018-01-12 Thread Jens Axboe
On 1/12/18 1:57 PM, Bart Van Assche wrote:
> On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote:
>> Currently, blk-mq timeout path synchronizes against the usual
>> issue/completion path using a complex scheme involving atomic
>> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
>> rules.  Unfortunatley, it contains quite a few holes.
> 
> Hello Tejun,
> 
> With this patch series applied I see weird hangs in blk_mq_get_tag() when I
> run the srp-test software. If I pull Jens' latest for-next branch and revert
> this patch series then the srp-test software runs successfully. Note: if you
> don't have InfiniBand hardware available then you will need the RDMA/CM
> patches for the SRP initiator and target drivers that have been posted
> recently on the linux-rdma mailing list to run the srp-test software.

You're really not making it easy for folks to run this :-)

> This is how I run the srp-test software in a VM:
> 
> ./run_tests -c -d -r 10
> 
> Here is an example of what SysRq-w reported when the hang occurred:
> 
> sysrq: SysRq : Show Blocked State
>  taskPC stack   pid father
> kworker/u8:0D12864 5  2 0x8000
> Workqueue: events_unbound sd_probe_async [sd_mod]
> Call Trace:
> ? __schedule+0x2b4/0xbb0
> schedule+0x2d/0x90
> io_schedule+0xd/0x30
> blk_mq_get_tag+0x169/0x290
> ? finish_wait+0x80/0x80
> blk_mq_get_request+0x16a/0x4f0
> blk_mq_alloc_request+0x59/0xc0
> blk_get_request_flags+0x3f/0x260
> scsi_execute+0x33/0x1e0 [scsi_mod]
> read_capacity_16.part.35+0x9c/0x460 [sd_mod]
> sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod]
> sd_probe_async+0xf2/0x1a0 [sd_mod]
> process_one_work+0x21c/0x6d0
> worker_thread+0x35/0x380
> ? process_one_work+0x6d0/0x6d0
> kthread+0x117/0x130
> ? kthread_create_worker_on_cpu+0x40/0x40
> ret_from_fork+0x24/0x30
> systemd-udevd   D13672  1048285 0x0100
> Call Trace:
> ? __schedule+0x2b4/0xbb0
> schedule+0x2d/0x90
> io_schedule+0xd/0x30
> generic_file_read_iter+0x32f/0x970
> ? page_cache_tree_insert+0x100/0x100
> __vfs_read+0xcc/0x120
> vfs_read+0x96/0x140
> SyS_read+0x40/0xa0
> do_syscall_64+0x5f/0x1b0
> entry_SYSCALL64_slow_path+0x25/0x25

Do you have the matching blk-mq debugfs output for the device?

-- 
Jens Axboe



Re: [PATCHSET v5] blk-mq: reimplement timeout handling

2018-01-12 Thread Bart Van Assche
On Tue, 2018-01-09 at 08:29 -0800, Tejun Heo wrote:
> Currently, blk-mq timeout path synchronizes against the usual
> issue/completion path using a complex scheme involving atomic
> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> rules.  Unfortunatley, it contains quite a few holes.

Hello Tejun,

With this patch series applied I see weird hangs in blk_mq_get_tag() when I
run the srp-test software. If I pull Jens' latest for-next branch and revert
this patch series then the srp-test software runs successfully. Note: if you
don't have InfiniBand hardware available then you will need the RDMA/CM
patches for the SRP initiator and target drivers that have been posted
recently on the linux-rdma mailing list to run the srp-test software.

This is how I run the srp-test software in a VM:

./run_tests -c -d -r 10

Here is an example of what SysRq-w reported when the hang occurred:

sysrq: SysRq : Show Blocked State
 taskPC stack   pid father
kworker/u8:0D12864 5  2 0x8000
Workqueue: events_unbound sd_probe_async [sd_mod]
Call Trace:
? __schedule+0x2b4/0xbb0
schedule+0x2d/0x90
io_schedule+0xd/0x30
blk_mq_get_tag+0x169/0x290
? finish_wait+0x80/0x80
blk_mq_get_request+0x16a/0x4f0
blk_mq_alloc_request+0x59/0xc0
blk_get_request_flags+0x3f/0x260
scsi_execute+0x33/0x1e0 [scsi_mod]
read_capacity_16.part.35+0x9c/0x460 [sd_mod]
sd_revalidate_disk+0x14bb/0x1cb0 [sd_mod]
sd_probe_async+0xf2/0x1a0 [sd_mod]
process_one_work+0x21c/0x6d0
worker_thread+0x35/0x380
? process_one_work+0x6d0/0x6d0
kthread+0x117/0x130
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x24/0x30
systemd-udevd   D13672  1048285 0x0100
Call Trace:
? __schedule+0x2b4/0xbb0
schedule+0x2d/0x90
io_schedule+0xd/0x30
generic_file_read_iter+0x32f/0x970
? page_cache_tree_insert+0x100/0x100
__vfs_read+0xcc/0x120
vfs_read+0x96/0x140
SyS_read+0x40/0xa0
do_syscall_64+0x5f/0x1b0
entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f8ce6d08d11
RSP: 002b:7fff96dec288 EFLAGS: 0246 ORIG_RAX: 
RAX: ffda RBX: 5651de7f6e10 RCX: 7f8ce6d08d11
RDX: 0040 RSI: 5651de7f6e38 RDI: 0007
RBP: 5651de7ea500 R08: 7f8ce6cf1c20 R09: 5651de7f6e10
R10: 006f R11: 0246 R12: 01ff
R13: 01ff0040 R14: 5651de7ea550 R15: 0040
systemd-udevd   D13496  1049285 0x0100
Call Trace:
? __schedule+0x2b4/0xbb0
schedule+0x2d/0x90
io_schedule+0xd/0x30
blk_mq_get_tag+0x169/0x290
? finish_wait+0x80/0x80
blk_mq_get_request+0x16a/0x4f0
blk_mq_make_request+0x105/0x8e0
? generic_make_request+0xd6/0x3d0
generic_make_request+0x103/0x3d0
? submit_bio+0x57/0x110
submit_bio+0x57/0x110
mpage_readpages+0x13b/0x160
? I_BDEV+0x10/0x10
? rcu_read_lock_sched_held+0x66/0x70
? __alloc_pages_nodemask+0x2e8/0x360
__do_page_cache_readahead+0x2a4/0x370
? force_page_cache_readahead+0xaf/0x110
force_page_cache_readahead+0xaf/0x110
generic_file_read_iter+0x743/0x970
? find_held_lock+0x2d/0x90
? _raw_spin_unlock+0x29/0x40
__vfs_read+0xcc/0x120
vfs_read+0x96/0x140
SyS_read+0x40/0xa0
do_syscall_64+0x5f/0x1b0
entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f8ce6d08d11
RSP: 002b:7fff96dec8b8 EFLAGS: 0246 ORIG_RAX: 
RAX: ffda RBX: 7f8ce7085010 RCX: 7f8ce6d08d11
RDX: 0004 RSI: 7f8ce7085038 RDI: 000f
RBP: 5651de7ec840 R08:  R09: 7f8ce7085010
R10: 7f8ce7085028 R11: 0246 R12: 
R13: 0004 R14: 5651de7ec890 R15: 0004
systemd-udevd   D13672  1055285 0x0100
Call Trace:
? __schedule+0x2b4/0xbb0
schedule+0x2d/0x90
io_schedule+0xd/0x30
blk_mq_get_tag+0x169/0x290
? finish_wait+0x80/0x80
blk_mq_get_request+0x16a/0x4f0
blk_mq_make_request+0x105/0x8e0
? generic_make_request+0xd6/0x3d0
generic_make_request+0x103/0x3d0
? submit_bio+0x57/0x110
submit_bio+0x57/0x110
mpage_readpages+0x13b/0x160
? I_BDEV+0x10/0x10
? rcu_read_lock_sched_held+0x66/0x70
? __alloc_pages_nodemask+0x2e8/0x360
__do_page_cache_readahead+0x2a4/0x370
? force_page_cache_readahead+0xaf/0x110
force_page_cache_readahead+0xaf/0x110
generic_file_read_iter+0x743/0x970
__vfs_read+0xcc/0x120
vfs_read+0x96/0x140
SyS_read+0x40/0xa0
do_syscall_64+0x5f/0x1b0
entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f8ce6d08d11
RSP: 002b:7fff96dec848 EFLAGS: 0246 ORIG_RAX: 
RAX: ffda RBX: 5651de7ec300 RCX: 7f8ce6d08d11
RDX: 0100 RSI: 5651de7ec328 RDI: 000f
RBP: 5651de7ea500 R08:  R09: 5651de7ec300
R10: 5651de7ec318 R11: 0246 R12: 01ffe000
R13: 01ffe100 R14: 5651de7ea550 R15: 0100

Please let me know if you need more information.

Bart.

RE: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> ow...@vger.kernel.org] On Behalf Of Bart Van Assche
...
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case. It may be possible to move the delayed queue rerun from the
> dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> wrong to rerun the queue immediately in case a SCSI target system returns
> "BUSY".

Seeing SCSI BUSY mentioned here...

On its own, patch 6077c2d706 seems to be adding an arbitrarily selected
magic value of 100 ms without explanation in the patch description or
in the added code.

But, dm_request_original_request() already seems to have chosen that value
for similar purposes:
unsigned long delay_ms = delay_requeue ? 100 : 0;

So, making them share a #define would indicate there's a reason for that
particular value.  Any change to the value would be picked up everywhere.

All the other callers of blk_mq_delay_run_hw_queue() use macros:
drivers/md/dm-rq.c: blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
drivers/nvme/host/fc.c: blk_mq_delay_run_hw_queue(queue->hctx, 
NVMEFC_QUEUE_DELAY);
drivers/scsi/scsi_lib.c:blk_mq_delay_run_hw_queue(hctx, 
SCSI_QUEUE_DELAY);
drivers/scsi/scsi_lib.c:blk_mq_delay_run_hw_queue(hctx, 
SCSI_QUEUE_DELAY);

Those both happen to be set to 3 (ms).


---
Robert Elliott, HPE Persistent Memory





Re: [PATCH 1/2] genirq/affinity: assign vectors to all possible CPUs

2018-01-12 Thread Thomas Gleixner
On Fri, 12 Jan 2018, Ming Lei wrote:

> From: Christoph Hellwig 
> 
> Currently we assign managed interrupt vectors to all present CPUs.  This
> works fine for systems were we only online/offline CPUs.  But in case of
> systems that support physical CPU hotplug (or the virtualized version of
> it) this means the additional CPUs covered for in the ACPI tables or on
> the command line are not catered for.  To fix this we'd either need to
> introduce new hotplug CPU states just for this case, or we can start
> assining vectors to possible but not present CPUs.
> 
> Reported-by: Christian Borntraeger 
> Tested-by: Christian Borntraeger 
> Tested-by: Stefan Haberland 
> Cc: linux-ker...@vger.kernel.org
> Cc: Thomas Gleixner 

FWIW, Acked-by: Thomas Gleixner 



Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  1:54pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> > OK, you have the stage: please give me a pointer to your best
> > explaination of the several.
> 
> Since the previous discussion about this topic occurred more than a month
> ago it could take more time to look up an explanation than to explain it
> again. Anyway, here we go. As you know a block layer request queue needs to
> be rerun if one or more requests are waiting and a previous condition that
> prevented the request to be executed has been cleared. For the dm-mpath
> driver, examples of such conditions are no tags available, a path that is
> busy (see also pgpath_busy()), path initialization that is in progress
> (pg_init_in_progress) or a request completes with status, e.g. if the
> SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
> of these conditions, e.g. path initialization completes, a callback
> function in the dm-mpath driver is called and it is possible to explicitly
> rerun the queue. I agree that for such scenario's a delayed queue run should
> not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
> SCSI request over a fabric and the SCSI target replies with "BUSY" then the
> SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> maximum number of retries has been reached (see also scsi_io_completion()).
> In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> to the initiator, there is no other approach for the SCSI initiator to
> figure out whether it can queue another request than to resubmit the
> request. The worst possible strategy is to resubmit a request immediately
> because that will cause a significant fraction of the fabric bandwidth to
> be used just for replying "BUSY" to requests that can't be processed
> immediately.
> 
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case. It may be possible to move the delayed queue rerun from the
> dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> wrong to rerun the queue immediately in case a SCSI target system returns
> "BUSY".

OK, thank you very much for this.  Really helps.

For starters multipath_clone_and_map() could do a fair amount more with
the insight that a SCSI "BUSY" was transmitted back.  If both blk-mq
being out of tags and SCSI "BUSY" simply return BLK_STS_RESOURCE then
dm-mpath doesn't have the ability to behave more intelligently.

Anyway, armed with this info I'll have a think about what we might do to
tackle this problem head on.

Thanks,
Mike


Re: [PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure

2018-01-12 Thread Bart Van Assche
On Thu, 2018-01-11 at 14:01 +0800, Ming Lei wrote:
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 86bf502a8e51..fcddf5a62581 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target *ti, 
> struct request *rq,
>   if (queue_dying) {
>   atomic_inc(>pg_init_in_progress);
>   activate_or_offline_path(pgpath);
> + return DM_MAPIO_DELAY_REQUEUE;
>   }
> - return DM_MAPIO_DELAY_REQUEUE;
> +
> + /*
> +  * blk-mq's SCHED_RESTART can cover this requeue, so
> +  * we needn't to deal with it by DELAY_REQUEUE. More
> +  * importantly, we have to return DM_MAPIO_REQUEUE
> +  * so that blk-mq can get the queue busy feedback,
> +  * otherwise I/O merge can be hurt.
> +  */
> + if (q->mq_ops)
> + return DM_MAPIO_REQUEUE;
> + else
> + return DM_MAPIO_DELAY_REQUEUE;
>   }

Sorry but the approach of this patch looks wrong to me. I'm afraid that this
approach will cause 100% CPU consumption if the underlying .queue_rq()
function returns BLK_STS_RESOURCE for another reason than what can be detected
by the .get_budget() call. This can happen if e.g. a SCSI LLD .queuecommand()
implementation returns SCSI_MLQUEUE_HOST_BUSY. Many SCSI LLDs can do this:
$ git grep 'SCSI_MLQUEUE_[^_]*_BUSY' | wc -l
204

Isn't this a severe regression?

Bart.

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Bart Van Assche
On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> OK, you have the stage: please give me a pointer to your best
> explaination of the several.

Since the previous discussion about this topic occurred more than a month
ago it could take more time to look up an explanation than to explain it
again. Anyway, here we go. As you know a block layer request queue needs to
be rerun if one or more requests are waiting and a previous condition that
prevented the request to be executed has been cleared. For the dm-mpath
driver, examples of such conditions are no tags available, a path that is
busy (see also pgpath_busy()), path initialization that is in progress
(pg_init_in_progress) or a request completes with status, e.g. if the
SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
of these conditions, e.g. path initialization completes, a callback
function in the dm-mpath driver is called and it is possible to explicitly
rerun the queue. I agree that for such scenario's a delayed queue run should
not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
SCSI request over a fabric and the SCSI target replies with "BUSY" then the
SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
maximum number of retries has been reached (see also scsi_io_completion()).
In that last case, if a SCSI target sends a "BUSY" reply over the wire back
to the initiator, there is no other approach for the SCSI initiator to
figure out whether it can queue another request than to resubmit the
request. The worst possible strategy is to resubmit a request immediately
because that will cause a significant fraction of the fabric bandwidth to
be used just for replying "BUSY" to requests that can't be processed
immediately.

The intention of commit 6077c2d706097c0 was to address the last mentioned
case. It may be possible to move the delayed queue rerun from the
dm_queue_rq() into dm_requeue_original_request(). But I think it would be
wrong to rerun the queue immediately in case a SCSI target system returns
"BUSY".

Bart.

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at 12:46pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 12:40 -0500, Mike Snitzer wrote:
> > You've not explained it many times.
> 
> That's not correct. I have already several times posted a detailed and easy
> to understand explanation

OK, you have the stage: please give me a pointer to your best
explaination of the several.

> > We cannot get a straight answer from you.
>
> That is a gross and incorrect statement. Please calm down.

I'm perfectly calm.  I'm just tired of how you sow controversy.


Re: [PATCH 0/2] blk-mq: support physical CPU hotplug

2018-01-12 Thread Jens Axboe
On 1/11/18 7:53 PM, Ming Lei wrote:
> Hi,
> 
> This two patches support physical CPU hotplug, so that we can make blk-mq
> scale well when new physical CPU is added or removed, and this use case
> is normal for VM world.
> 
> Also this patchset fixes the following warning reported by Christian
> Borntraeger:
> 
>   https://marc.info/?l=linux-block=151092973417143=2
> 
> Christoph Hellwig (2):
>   genirq/affinity: assign vectors to all possible CPUs
>   blk-mq: simplify queue mapping & schedule with each possisble CPU

Applied, thanks Ming.

-- 
Jens Axboe



Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Bart Van Assche
On Fri, 2018-01-12 at 12:40 -0500, Mike Snitzer wrote:
> You've not explained it many times.

That's not correct. I have already several times posted a detailed and easy
to understand explanation

> We cannot get a straight answer from you.

That is a gross and incorrect statement. Please calm down.

Bart.

Re: [GIT PULL] one nvme fix for Linux 4.15

2018-01-12 Thread Jens Axboe
On 1/12/18 8:09 AM, Christoph Hellwig wrote:
> The following changes since commit 254beb84faccbe2f4eda0b51924857bdfb679969:
> 
>   nvme-fcloop: avoid possible uninitialized variable warning (2017-12-29 
> 10:37:21 +0100)
> 
> are available in the git repository at:
> 
>   git://git.infradead.org/nvme.git nvme-4.15
> 
> for you to fetch changes up to 6b018235b4daabae96d855219fae59c3fb8be417:
> 
>   nvme-fabrics: initialize default host->id in nvmf_host_default() 
> (2018-01-08 10:52:03 +0100)

Pulled, thanks.

-- 
Jens Axboe



Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at 12:26pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 12:18 -0500, Mike Snitzer wrote:
> > This is going upstream for 4.16:
> > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.16=5b18cff4baedde77e0d69bd62a13ae78f9488d89
> 
> That is really gross. I have explained many times in detail on the dm-devel
> list why that blk_mq_delay_run_hw_queue() call is necessary. So I think it is
> wrong if you remove it. Isn't the responsibility of you as the device mapper
> kernel maintainer to avoid regressions instead of introducing regressions?

Please stop, seriously.

You've not explained it many times.  We cannot get a straight answer
from you.  No analysis that establishes that if an underlying dm-mq
multipath path is out of tags (shared or otherwise) that dm-rq core
_must_ run the hw queue after a delay.  

Commit 6077c2d7060 just papers over a real blk-mq problem (that may now
be fixed).  Your assertions that blk-mq would need to otherwise poll
(and waste resources so it can immediately retry) ignores that blk-mq
_should_ make progress as requests complete or if/when a path is
recovered, etc.  So I'm not going to accept your dysfuctional reasoning
on this, sorry.  _THAT_ is my job as a maintainer. 


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Bart Van Assche
On Fri, 2018-01-12 at 12:18 -0500, Mike Snitzer wrote:
> This is going upstream for 4.16:
> https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.16=5b18cff4baedde77e0d69bd62a13ae78f9488d89

That is really gross. I have explained many times in detail on the dm-devel
list why that blk_mq_delay_run_hw_queue() call is necessary. So I think it is
wrong if you remove it. Isn't the responsibility of you as the device mapper
kernel maintainer to avoid regressions instead of introducing regressions?

Bart.

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Thu, Jan 11 2018 at 10:33pm -0500,
Ming Lei  wrote:

> On Thu, Jan 11, 2018 at 08:57:21PM -0500, Mike Snitzer wrote:
> > On Thu, Jan 11 2018 at  8:42pm -0500,
> > Ming Lei  wrote:
> > 
> > > On Thu, Jan 11, 2018 at 10:37:37PM +, Bart Van Assche wrote:
> > > > On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> > > > > Bart, if for some reason we regress for some workload you're able to
> > > > > more readily test we can deal with it.  But I'm too encouraged by 
> > > > > Ming's
> > > > > performance improvements to hold these changes back any further.
> > > > 
> > > > Sorry Mike but I think Ming's measurement results are not sufficient to
> > > > motivate upstreaming of these patches. What I remember from previous 
> > > > versions
> > > > of this patch series is that Ming measured the performance impact of 
> > > > this
> > > > patch series only for the Emulex FC driver (lpfc). That driver limits 
> > > > queue
> > > > depth to 3. Instead of modifying the dm code, that driver needs to be 
> > > > fixed
> > > > such that it allows larger queue depths.
> > > > 
> > > > Additionally, some time ago I had explained in detail why I think that 
> > > > patch
> > > > 2/5 in this series is wrong and why it will introduce fairness 
> > > > regressions
> > > > in multi-LUN workloads. I think we need performance measurements for 
> > > > this
> > > > patch series for at least the following two configurations before this 
> > > > patch
> > > > series can be considered for upstream inclusion:
> > > > * The performance impact of this patch series for SCSI devices with a
> > > >   realistic queue depth (e.g. 64 or 128).
> > > 
> > > Please look at the cover letter or patch 5's commit log, it mentions that
> > > dm-mpath over virtio-scsi is tested, and the default queue depth of 
> > > virito-scsi
> > > is 128.
> > > 
> > > > * The performance impact for this patch series for a SCSI host with 
> > > > which
> > > >   multiple LUNs are associated and for which the target system often 
> > > > replies
> > > >   with the SCSI "BUSY" status.
> > > 
> > > I don't think this patch is related with this case, this patch just 
> > > provides the
> > > BUSY feedback from underlying queue to blk-mq via dm-rq.
> > 
> > Hi Ming,
> > 
> > Please see https://www.redhat.com/archives/dm-devel/2017-April/msg00190.html
> > 
> > Specifically:
> > "That patch [commit 6077c2d706] can be considered as a first step that
> > can be refined further, namely by modifying the dm-rq code further such
> > that dm-rq queues are only rerun after the busy condition has been
> > cleared. The patch at the start of this thread is easier to review and
> > easier to test than any patch that would only rerun dm-rq queues after
> > the busy condition has been cleared."
> > 
> > Do you have time to reason through Bart's previous proposal for how to
> > better address the specific issue that was documented to be fixed in the
> > header for commit 6077c2d706 ?
> 
> Hi Mike,
> 
> Recently we fixed one such issue in blk-mq (eb619fdb2d4cb8b3d34), and I
> highly guess that may fix Bart's case.

Wow, kind of staggering that there was no mention of this fix prior to
now.  Seems _very_ relevant (like the real fix).

> Let's see this commit 6077c2d706 again, I don't know the idea behind the
> fix, which just adds random of 100ms, and this delay degrades performance a
> lot, since no hctx can be scheduled any more during the delay.

I'd love to try to reproduce the issue documented in commit
6077c2d706 but sadly I cannot get that srp-test to work on my system.
Just fails with:

# while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
Unloaded the ib_srpt kernel module
Unloaded the rdma_rxe kernel module
Zero-initializing /dev/ram0 ... done
Zero-initializing /dev/ram1 ... done
Configured SRP target driver
Running test srp-test/tests/02-mq ...
Test file I/O on top of multipath concurrently with logout and login (0 min; mq)
SRP login failed
Test srp-test/tests/02-mq failed

Think the login is failing due to target not being setup properly?
Yeap, now from reading this thread, it is clear that srp-test only works
if you have an IB HCA:
https://patchwork.kernel.org/patch/10041381/

> So again it is definitely a workaround, no any reasoning, no any theory, just
> say it is a fix. Are there anyone who can explain the story?
> 
> IMO, the blk_get_request() in dm-mpath is just like a kmalloc(ATOMIC) which
> is used in other blk-mq drivers' IO path too, so don't know why dm-mq is so
> strange to require such ugly 100ms delay.

The header for commit 6077c2d706 notes that same action that Jens took
to unwedge the request stalls (in the patchwork thread I referenced
above), with:

echo run > /sys/kernel/debug/block/nullb1/state
vs what Bart noted in commit 6077c2d706:
echo run >/sys/kernel/debug/block/dm-0/state

Really does feel like the issue Jens' shared tags fix addressed _is_ the
root cause that commit 6077c2d706 worked around.


[for-4.16 PATCH v6 2/4] block: properly protect the 'queue' kobj in blk_unregister_queue

2018-01-12 Thread Mike Snitzer
The original commit e9a823fb34a8b (block: fix warning when I/O elevator
is changed as request_queue is being removed) is pretty conflated.
"conflated" because the resource being protected by q->sysfs_lock isn't
the queue_flags (it is the 'queue' kobj).

q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
from racing with blk_unregister_queue():
1) By holding q->sysfs_lock first, __elevator_change() can complete
before a racing blk_unregister_queue().
2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
in case elv_iosched_store() loses the race with blk_unregister_queue(),
it needs a way to know the 'queue' kobj isn't there.

Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
held until after the 'queue' kobj is removed.

To do so blk_mq_unregister_dev() must not also take q->sysfs_lock.  So
rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().

Also, blk_unregister_queue() should use q->queue_lock to protect against
any concurrent writes to q->queue_flags -- even though chances are the
queue is being cleaned up so no concurrent writes are likely.

Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as 
request_queue is being removed")
Signed-off-by: Mike Snitzer 
---
 block/blk-mq-sysfs.c |  9 +
 block/blk-sysfs.c| 13 ++---
 2 files changed, 11 insertions(+), 11 deletions(-)

v6: blk_mq_unregister_dev now requires q->sysfs_lock be held, Ming: I am
not seeing any lockdep complaints with this.  I've tested bio-based,
blk-mq and old .request_fn request-based.

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 79969c3c234f..a54b4b070f1c 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -248,7 +248,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
return ret;
 }
 
-static void __blk_mq_unregister_dev(struct device *dev, struct request_queue 
*q)
+void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
 {
struct blk_mq_hw_ctx *hctx;
int i;
@@ -265,13 +265,6 @@ static void __blk_mq_unregister_dev(struct device *dev, 
struct request_queue *q)
q->mq_sysfs_init_done = false;
 }
 
-void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
-{
-   mutex_lock(>sysfs_lock);
-   __blk_mq_unregister_dev(dev, q);
-   mutex_unlock(>sysfs_lock);
-}
-
 void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx)
 {
kobject_init(>kobj, _mq_hw_ktype);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 870484eaed1f..9272452ff456 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -929,12 +929,17 @@ void blk_unregister_queue(struct gendisk *disk)
if (WARN_ON(!q))
return;
 
+   /*
+* Protect against the 'queue' kobj being accessed
+* while/after it is removed.
+*/
mutex_lock(>sysfs_lock);
-   queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
-   mutex_unlock(>sysfs_lock);
 
-   wbt_exit(q);
+   spin_lock_irq(q->queue_lock);
+   queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
+   spin_unlock_irq(q->queue_lock);
 
+   wbt_exit(q);
 
if (q->mq_ops)
blk_mq_unregister_dev(disk_to_dev(disk), q);
@@ -946,4 +951,6 @@ void blk_unregister_queue(struct gendisk *disk)
kobject_del(>kobj);
blk_trace_remove_sysfs(disk_to_dev(disk));
kobject_put(_to_dev(disk)->kobj);
+
+   mutex_unlock(>sysfs_lock);
 }
-- 
2.15.0



Re: [for-4.16 PATCH v5 2/4] block: properly protect the 'queue' kobj in blk_unregister_queue

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at 10:17am -0500,
Ming Lei  wrote:

> On Fri, Jan 12, 2018 at 10:06:04AM -0500, Mike Snitzer wrote:
> > The original commit e9a823fb34a8b (block: fix warning when I/O elevator
> > is changed as request_queue is being removed) is pretty conflated.
> > "conflated" because the resource being protected by q->sysfs_lock isn't
> > the queue_flags (it is the 'queue' kobj).
> > 
> > q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
> > from racing with blk_unregister_queue():
> > 1) By holding q->sysfs_lock first, __elevator_change() can complete
> > before a racing blk_unregister_queue().
> > 2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
> > in case elv_iosched_store() loses the race with blk_unregister_queue(),
> > it needs a way to know the 'queue' kobj isn't there.
> > 
> > Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
> > held until after the 'queue' kobj is removed.
> 
> This way will cause deadlock, see blow.

Ngh... I thought I tested blk-mq with this patch applied, apparently not.

> > 
> > Also, blk_unregister_queue() should use q->queue_lock to protect against
> > any concurrent writes to q->queue_flags -- even though chances are the
> > queue is being cleaned up so no concurrent writes are likely.
> > 
> > Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as 
> > request_queue is being removed")
> > Signed-off-by: Mike Snitzer 
> > ---
> >  block/blk-sysfs.c | 13 ++---
> >  1 file changed, 10 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index 870484eaed1f..9272452ff456 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -929,12 +929,17 @@ void blk_unregister_queue(struct gendisk *disk)
> > if (WARN_ON(!q))
> > return;
> >  
> > +   /*
> > +* Protect against the 'queue' kobj being accessed
> > +* while/after it is removed.
> > +*/
> > mutex_lock(>sysfs_lock);
> > -   queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
> > -   mutex_unlock(>sysfs_lock);
> >  
> > -   wbt_exit(q);
> > +   spin_lock_irq(q->queue_lock);
> > +   queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
> > +   spin_unlock_irq(q->queue_lock);
> >  
> > +   wbt_exit(q);
> >  
> > if (q->mq_ops)
> > blk_mq_unregister_dev(disk_to_dev(disk), q);
> 
> void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
> {
> mutex_lock(>sysfs_lock);
> __blk_mq_unregister_dev(dev, q);
> mutex_unlock(>sysfs_lock);
> }
> 
> > @@ -946,4 +951,6 @@ void blk_unregister_queue(struct gendisk *disk)
> > kobject_del(>kobj);
> > blk_trace_remove_sysfs(disk_to_dev(disk));
> > kobject_put(_to_dev(disk)->kobj);
> > +
> > +   mutex_unlock(>sysfs_lock);
> >  }
> 
> Except for above, I remember there is also lockdep warning between
> sysfs_lock and driver core's lock if the sysfs_lock is extended in this
> way(I tried it before, but forget the details now), so please just hold
> queue_lock inside the sysfs lock.

No good deed goes unpunished.

I'll fix this up.

Mike


Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000436

2018-01-12 Thread Paolo Valente


> Il giorno 12 gen 2018, alle ore 09:29, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 12 gen 2018, alle ore 05:18, Ming Lei  ha 
>> scritto:
>> 
>> On Thu, Jan 11, 2018 at 08:40:54AM -0700, Jens Axboe wrote:
>>> On 1/11/18 2:41 AM, Paolo Valente wrote:
 
 
> Il giorno 11 gen 2018, alle ore 10:30, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 10 gen 2018, alle ore 05:58, Ming Lei  ha 
>> scritto:
>> 
>> Hi Paolo,
>> 
>> Looks this one is introduced in recent merge, and it is triggered
>> in test of IO vs. removing device on the latest for-next of block
>> tree:
>> 
>> [  296.151615] BUG: unable to handle kernel NULL pointer dereference at 
>> 0436
>> [  296.152302] IP: percpu_counter_add_batch+0x25/0x9d
>> [  296.152698] PGD 0 P4D 0
>> [  296.152916] Oops:  [#1] PREEMPT SMP
>> [  296.153233] Dumping ftrace buffer:
>> [  296.153517](ftrace buffer empty)
>> [  296.153817] Modules linked in: scsi_debug(E) null_blk(E) isofs(E) 
>> iTCO_wdt(E) iTCO_vendor_support(E) i2c_i801(E) i2c_core(E) lpc_ich(E) 
>> mfd_core(E) ip_tables(E) sr_mod(E) cdrom(E) sd_mod(E) usb_storage(E) 
>> ahci(E) libahci(E) libata(E) crc32c_intel(E) virtio_scsi(E) 
>> qemu_fw_cfg(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last 
>> unloaded: scsi_debug]
>> [  296.156199] CPU: 2 PID: 2001 Comm: scsi_id Tainted: GE
>> 4.15.0-rc7790529290ab3_my_v4.15-rc-block-for-next #4
>> [  296.156961] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
>> 1.9.3-1.fc25 04/01/2014
>> [  296.157546] RIP: 0010:percpu_counter_add_batch+0x25/0x9d
>> [  296.157920] RSP: 0018:c90001133b28 EFLAGS: 00010002
>> [  296.158285] RAX: 0002 RBX: ff6e RCX: 
>> 0001
>> [  296.158779] RDX: 3fff RSI: 81e95429 RDI: 
>> 81e5e64c
>> [  296.159276] RBP: 0416 R08: 0001 R09: 
>> 0001
>> [  296.159770] R10: c90001133aa0 R11: 88007e96caf0 R12: 
>> 3fff
>> [  296.160273] R13: 0001 R14: 0001 R15: 
>> e8d0b180
>> [  296.160780] FS:  7f635988af80() GS:88007cf0() 
>> knlGS:
>> [  296.161325] CS:  0010 DS:  ES:  CR0: 80050033
>> [  296.161744] CR2: 0436 CR3: 717f1005 CR4: 
>> 003606e0
>> [  296.162258] DR0:  DR1:  DR2: 
>> 
>> [  296.162764] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [  296.163246] Call Trace:
>> [  296.163445]  bfqg_stats_update_io_add+0x35/0xdc
>> [  296.163754]  bfq_insert_requests+0xbb9/0xbf2
>> [  296.164049]  ? blk_rq_bio_prep+0x51/0x5d
>> [  296.164353]  ? blk_rq_append_bio+0x32/0x78
>> [  296.164633]  ? blk_rq_map_user_iov+0x11b/0x1a0
>> [  296.164940]  blk_mq_sched_insert_request+0xaf/0x12f
>> [  296.165273]  blk_execute_rq+0x4b/0x93
>> [  296.165586]  sg_io+0x236/0x38a
>> [  296.165800]  scsi_cmd_ioctl+0x1d4/0x381
>> [  296.166065]  ? seccomp_run_filters+0xee/0x12d
>> [  296.166362]  ? preempt_count_add+0x6d/0x8c
>> [  296.166643]  sd_ioctl+0xbb/0xde [sd_mod]
>> [  296.166915]  blkdev_ioctl+0x7f2/0x850
>> [  296.167167]  block_ioctl+0x39/0x3c
>> [  296.167401]  vfs_ioctl+0x1b/0x28
>> [  296.167622]  do_vfs_ioctl+0x4bc/0x542
>> [  296.167877]  ? syscall_trace_enter+0x164/0x261
>> [  296.168180]  SyS_ioctl+0x3e/0x5a
>> [  296.168402]  do_syscall_64+0x71/0x168
>> [  296.168654]  entry_SYSCALL64_slow_path+0x25/0x25
>> [  296.168970] RIP: 0033:0x7f63593a1dc7
>> [  296.169214] RSP: 002b:7fff87210878 EFLAGS: 0246 ORIG_RAX: 
>> 0010
>> [  296.169823] RAX: ffda RBX: 7fff872108b0 RCX: 
>> 7f63593a1dc7
>> [  296.170303] RDX: 7fff872108b0 RSI: 2285 RDI: 
>> 0004
>> [  296.170783] RBP: 7fff87210f00 R08: 2006 R09: 
>> fe00
>> [  296.171266] R10: 7fff87210910 R11: 0246 R12: 
>> 
>> [  296.171745] R13: 7fff872108b0 R14: 7fff872108ba R15: 
>> 7fff872112b0
>> [  296.172228] Code: 5d 41 5c 41 5d c3 41 55 41 54 49 89 f5 55 53 48 89 
>> fd bf 01 00 00 00 41 89 d4 e8 8c 19 d4 ff 48 c7 c7 29 54 e9 81 e8 fd 3d 
>> ff ff <48> 8b 45 20 49 63 d4 65 8b 18 48 63 db 4c 01 eb 48 39 d3 7d 0b
>> [  296.173594] RIP: percpu_counter_add_batch+0x25/0x9d RSP: 
>> c90001133b28
>> [  296.174054] CR2: 0436
>> [  296.174283] ---[ end trace 87b58f9235daac7e ]---
>> [  296.174673] Kernel panic - not syncing: 

Re: [for-4.16 PATCH v5 2/4] block: properly protect the 'queue' kobj in blk_unregister_queue

2018-01-12 Thread Ming Lei
On Fri, Jan 12, 2018 at 10:06:04AM -0500, Mike Snitzer wrote:
> The original commit e9a823fb34a8b (block: fix warning when I/O elevator
> is changed as request_queue is being removed) is pretty conflated.
> "conflated" because the resource being protected by q->sysfs_lock isn't
> the queue_flags (it is the 'queue' kobj).
> 
> q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
> from racing with blk_unregister_queue():
> 1) By holding q->sysfs_lock first, __elevator_change() can complete
> before a racing blk_unregister_queue().
> 2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
> in case elv_iosched_store() loses the race with blk_unregister_queue(),
> it needs a way to know the 'queue' kobj isn't there.
> 
> Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
> held until after the 'queue' kobj is removed.

This way will cause deadlock, see blow.

> 
> Also, blk_unregister_queue() should use q->queue_lock to protect against
> any concurrent writes to q->queue_flags -- even though chances are the
> queue is being cleaned up so no concurrent writes are likely.
> 
> Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as 
> request_queue is being removed")
> Signed-off-by: Mike Snitzer 
> ---
>  block/blk-sysfs.c | 13 ++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 870484eaed1f..9272452ff456 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -929,12 +929,17 @@ void blk_unregister_queue(struct gendisk *disk)
>   if (WARN_ON(!q))
>   return;
>  
> + /*
> +  * Protect against the 'queue' kobj being accessed
> +  * while/after it is removed.
> +  */
>   mutex_lock(>sysfs_lock);
> - queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
> - mutex_unlock(>sysfs_lock);
>  
> - wbt_exit(q);
> + spin_lock_irq(q->queue_lock);
> + queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
> + spin_unlock_irq(q->queue_lock);
>  
> + wbt_exit(q);
>  
>   if (q->mq_ops)
>   blk_mq_unregister_dev(disk_to_dev(disk), q);

void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
{
mutex_lock(>sysfs_lock);
__blk_mq_unregister_dev(dev, q);
mutex_unlock(>sysfs_lock);
}

> @@ -946,4 +951,6 @@ void blk_unregister_queue(struct gendisk *disk)
>   kobject_del(>kobj);
>   blk_trace_remove_sysfs(disk_to_dev(disk));
>   kobject_put(_to_dev(disk)->kobj);
> +
> + mutex_unlock(>sysfs_lock);
>  }

Except for above, I remember there is also lockdep warning between
sysfs_lock and driver core's lock if the sysfs_lock is extended in this
way(I tried it before, but forget the details now), so please just hold
queue_lock inside the sysfs lock.

Thanks,
Ming


Re: [PATCH IMPROVEMENT] block, bfq: limit sectors served with interactive weight raising

2018-01-12 Thread Jens Axboe
On 1/12/18 3:20 AM, Paolo Valente wrote:
> 
> 
>> Il giorno 12 gen 2018, alle ore 11:15, Holger Hoffstätte 
>>  ha scritto:
>>
>> On 01/12/18 06:58, Paolo Valente wrote:
>>>
>>>
 Il giorno 28 dic 2017, alle ore 15:00, Holger Hoffstätte 
  ha scritto:


 On 12/28/17 12:19, Paolo Valente wrote:
 (snip half a tech report ;)

 So either this or the previous patch ("limit tags for writes and async I/O"
 can lead to a hard, unrecoverable hang with heavy writes. Since I couldn't
 log into the affected system anymore I couldn't get any stack traces, 
 blk-mq
 debug output etc. but there was nothing in dmesg/on the console, so it
 wasn't a BUG/OOPS.

 -h
>>>
>>> Hi Holger,
>>> if, as I guess, this problem hasn't gone away for you, I have two
>>> requests:
>>> 1) could you share your exact test
>>> 2) if nothing happens in my systems with your test, would you be
>>> willing to retry with the dev version of bfq?  It should be able to
>>> tell us what takes to your hang.  If you are willing to do this test,
>>> I'll prepare a branch with everything already configured for you.
>>
>> Hi,
>>
>> thanks for following up but there's no need for any of that; it turned out
>> to be something else since I got the same hang without those patches at
>> least once (during a btrfs balance, even though it didn't look like btrfs'
>> fault directly; more like block/mm/helpers.
>>
>> So on January 7 I posted to linux-block et.al. where I said
>> "So this turned out to be something else, sorry for the false alarm."
>> but apparently that didn't make it through since it's not in the
>> archives either. Sorry.
>>
>> Long story short, the good news is that I've been running with both patches
>> since then without any issue. :)
>>
> 
> Wow, what a relief! :)
> 
> So, Jens, being the only issue reported gone, can you please consider
> queueing this patch and the other pending one [1]?  They are both
> critical for bfq performance.

Please just resend them.

-- 
Jens Axboe



[for-4.16 PATCH v5 1/4] block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN

2018-01-12 Thread Mike Snitzer
device_add_disk() will only call bdi_register_owner() if
!GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
bdi_unregister() if !GENHD_FL_HIDDEN.

Found with code inspection.  bdi_unregister() won't do any harm if
bdi_register_owner() wasn't used but best to avoid the unnecessary
call to bdi_unregister().

Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
Signed-off-by: Mike Snitzer 
Reviewed-by: Ming Lei 
Reviewed-by: Hannes Reinecke 
---
 block/genhd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index 96a66f671720..00620e01e043 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -725,7 +725,8 @@ void del_gendisk(struct gendisk *disk)
 * Unregister bdi before releasing device numbers (as they can
 * get reused and we'd get clashes in sysfs).
 */
-   bdi_unregister(disk->queue->backing_dev_info);
+   if (!(disk->flags & GENHD_FL_HIDDEN))
+   bdi_unregister(disk->queue->backing_dev_info);
blk_unregister_queue(disk);
} else {
WARN_ON(1);
-- 
2.15.0



[for-4.16 PATCH v5 2/4] block: properly protect the 'queue' kobj in blk_unregister_queue

2018-01-12 Thread Mike Snitzer
The original commit e9a823fb34a8b (block: fix warning when I/O elevator
is changed as request_queue is being removed) is pretty conflated.
"conflated" because the resource being protected by q->sysfs_lock isn't
the queue_flags (it is the 'queue' kobj).

q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
from racing with blk_unregister_queue():
1) By holding q->sysfs_lock first, __elevator_change() can complete
before a racing blk_unregister_queue().
2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
in case elv_iosched_store() loses the race with blk_unregister_queue(),
it needs a way to know the 'queue' kobj isn't there.

Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
held until after the 'queue' kobj is removed.

Also, blk_unregister_queue() should use q->queue_lock to protect against
any concurrent writes to q->queue_flags -- even though chances are the
queue is being cleaned up so no concurrent writes are likely.

Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as 
request_queue is being removed")
Signed-off-by: Mike Snitzer 
---
 block/blk-sysfs.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 870484eaed1f..9272452ff456 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -929,12 +929,17 @@ void blk_unregister_queue(struct gendisk *disk)
if (WARN_ON(!q))
return;
 
+   /*
+* Protect against the 'queue' kobj being accessed
+* while/after it is removed.
+*/
mutex_lock(>sysfs_lock);
-   queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
-   mutex_unlock(>sysfs_lock);
 
-   wbt_exit(q);
+   spin_lock_irq(q->queue_lock);
+   queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
+   spin_unlock_irq(q->queue_lock);
 
+   wbt_exit(q);
 
if (q->mq_ops)
blk_mq_unregister_dev(disk_to_dev(disk), q);
@@ -946,4 +951,6 @@ void blk_unregister_queue(struct gendisk *disk)
kobject_del(>kobj);
blk_trace_remove_sysfs(disk_to_dev(disk));
kobject_put(_to_dev(disk)->kobj);
+
+   mutex_unlock(>sysfs_lock);
 }
-- 
2.15.0



[for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-12 Thread Mike Snitzer
Hi Jens,

I'm submitting this v5 with more feeling now ;)

I've distilled the changes down to be quite minimal.  Hopefully this
will help you and others review.

I've also done a dry-run of applying 4.16 block changes and then
merging dm-4.16; no merge conflicts:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=block-4.16
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=dm-4.16
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=block-4.16_dm-4.16

And all tests very well for me.

Please consider for 4.16, thanks!

Mike

Mike Snitzer (4):
  block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN
  block: properly protect the 'queue' kobj in blk_unregister_queue
  block: allow gendisk's request_queue registration to be deferred
  dm: fix incomplete request_queue initialization

 block/blk-sysfs.c | 18 +++---
 block/genhd.c | 23 +++
 drivers/md/dm-rq.c|  9 -
 drivers/md/dm.c   | 11 ++-
 include/linux/genhd.h |  5 +
 5 files changed, 49 insertions(+), 17 deletions(-)

-- 
2.15.0



[for-4.16 PATCH v5 3/4] block: allow gendisk's request_queue registration to be deferred

2018-01-12 Thread Mike Snitzer
Since I can remember DM has forced the block layer to allow the
allocation and initialization of the request_queue to be distinct
operations.  Reason for this is block/genhd.c:add_disk() has requires
that the request_queue (and associated bdi) be tied to the gendisk
before add_disk() is called -- because add_disk() also deals with
exposing the request_queue via blk_register_queue().

DM's dynamic creation of arbitrary device types (and associated
request_queue types) requires the DM device's gendisk be available so
that DM table loads can establish a master/slave relationship with
subordinate devices that are referenced by loaded DM tables -- using
bd_link_disk_holder().  But until these DM tables, and their associated
subordinate devices, are known DM cannot know what type of request_queue
it needs -- nor what its queue_limits should be.

This chicken and egg scenario has created all manner of problems for DM
and, at times, the block layer.

Summary of changes:

- Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
  that drivers may use to add a disk without also calling
  blk_register_queue().  Driver must call blk_register_queue() once its
  request_queue is fully initialized.

- Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
  is not set.  It won't be set if driver used add_disk_no_queue_reg()
  but driver encounters an error and must del_gendisk() before calling
  blk_register_queue().

- Export blk_register_queue().

These changes allow DM to use add_disk_no_queue_reg() to anchor its
gendisk as the "master" for master/slave relationships DM must establish
with subordinate devices referenced in DM tables that get loaded.  Once
all "slave" devices for a DM device are known its request_queue can be
properly initialized and then advertised via sysfs -- important
improvement being that no request_queue resource initialization
performed by blk_register_queue() is missed for DM devices anymore.

Signed-off-by: Mike Snitzer 
Reviewed-by: Ming Lei 
---
 block/blk-sysfs.c |  5 +
 block/genhd.c | 20 +---
 include/linux/genhd.h |  5 +
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9272452ff456..4a6a40ffd78e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -921,6 +921,7 @@ int blk_register_queue(struct gendisk *disk)
mutex_unlock(>sysfs_lock);
return ret;
 }
+EXPORT_SYMBOL_GPL(blk_register_queue);
 
 void blk_unregister_queue(struct gendisk *disk)
 {
@@ -929,6 +930,10 @@ void blk_unregister_queue(struct gendisk *disk)
if (WARN_ON(!q))
return;
 
+   /* Return early if disk->queue was never registered. */
+   if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags))
+   return;
+
/*
 * Protect against the 'queue' kobj being accessed
 * while/after it is removed.
diff --git a/block/genhd.c b/block/genhd.c
index 00620e01e043..d4aaf0cae9ad 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -629,7 +629,7 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
 }
 
 /**
- * device_add_disk - add partitioning information to kernel list
+ * device_add_disk_no_queue_reg - add partitioning information to kernel list
  * @parent: parent device for the disk
  * @disk: per-device partitioning information
  *
@@ -638,7 +638,7 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
  *
  * FIXME: error handling
  */
-void device_add_disk(struct device *parent, struct gendisk *disk)
+void device_add_disk_no_queue_reg(struct device *parent, struct gendisk *disk)
 {
dev_t devt;
int retval;
@@ -682,7 +682,6 @@ void device_add_disk(struct device *parent, struct gendisk 
*disk)
exact_match, exact_lock, disk);
}
register_disk(parent, disk);
-   blk_register_queue(disk);
 
/*
 * Take an extra ref on queue which will be put on disk_release()
@@ -693,6 +692,21 @@ void device_add_disk(struct device *parent, struct gendisk 
*disk)
disk_add_events(disk);
blk_integrity_add(disk);
 }
+EXPORT_SYMBOL(device_add_disk_no_queue_reg);
+
+/**
+ * device_add_disk - add disk information to kernel list
+ * @parent: parent device for the disk
+ * @disk: per-device disk information
+ *
+ * Adds partitioning information and also registers the
+ * associated request_queue to @disk.
+ */
+void device_add_disk(struct device *parent, struct gendisk *disk)
+{
+   device_add_disk_no_queue_reg(parent, disk);
+   blk_register_queue(disk);
+}
 EXPORT_SYMBOL(device_add_disk);
 
 void del_gendisk(struct gendisk *disk)
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 5144ebe046c9..5e3531027b51 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -395,6 +395,11 @@ static inline void add_disk(struct gendisk *disk)
 {

[for-4.16 PATCH v5 4/4] dm: fix incomplete request_queue initialization

2018-01-12 Thread Mike Snitzer
DM is no longer prone to having its request_queue be improperly
initialized.

Summary of changes:

- defer DM's blk_register_queue() from add_disk()-time until
  dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev().

- dm_setup_md_queue() is updated to fully initialize DM's request_queue
  (_after_ all table loads have occurred and the request_queue's type,
  features and limits are known).

A very welcome side-effect of these changes is DM no longer needs to:
1) backfill the "mq" sysfs entry (because historically DM didn't
initialize the request_queue to use blk-mq until _after_
blk_register_queue() was called via add_disk()).
2) call elv_register_queue() to get .request_fn request-based DM
device's "iosched" exposed in syfs.

In addition, blk-mq debugfs support is now made available because
request-based DM's blk-mq request_queue is now properly initialized
before dm_setup_md_queue() calls blk_register_queue().

These changes also stave off the need to introduce new DM-specific
workarounds in block core, e.g. this proposal:
https://patchwork.kernel.org/patch/10067961/

In the end DM devices should be less unicorn in nature (relative to
initialization and availability of block core infrastructure provided by
the request_queue).

Signed-off-by: Mike Snitzer 
Tested-by: Ming Lei 
---
 drivers/md/dm-rq.c |  9 -
 drivers/md/dm.c| 11 ++-
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 9d32f25489c2..c28357f5cb0e 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -713,8 +713,6 @@ int dm_old_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
return error;
}
 
-   elv_register_queue(md->queue);
-
return 0;
 }
 
@@ -812,15 +810,8 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
}
dm_init_md_queue(md);
 
-   /* backfill 'mq' sysfs registration normally done in blk_register_queue 
*/
-   err = blk_mq_register_dev(disk_to_dev(md->disk), q);
-   if (err)
-   goto out_cleanup_queue;
-
return 0;
 
-out_cleanup_queue:
-   blk_cleanup_queue(q);
 out_tag_set:
blk_mq_free_tag_set(md->tag_set);
 out_kfree_tag_set:
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7475739fee49..8c26bfc35335 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1761,7 +1761,7 @@ static struct mapped_device *alloc_dev(int minor)
goto bad;
md->dax_dev = dax_dev;
 
-   add_disk(md->disk);
+   add_disk_no_queue_reg(md->disk);
format_dev_t(md->name, MKDEV(_major, minor));
 
md->wq = alloc_workqueue("kdmflush", WQ_MEM_RECLAIM, 0);
@@ -2021,6 +2021,7 @@ EXPORT_SYMBOL_GPL(dm_get_queue_limits);
 int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
 {
int r;
+   struct queue_limits limits;
enum dm_queue_mode type = dm_get_md_type(md);
 
switch (type) {
@@ -2057,6 +2058,14 @@ int dm_setup_md_queue(struct mapped_device *md, struct 
dm_table *t)
break;
}
 
+   r = dm_calculate_queue_limits(t, );
+   if (r) {
+   DMERR("Cannot calculate initial queue limits");
+   return r;
+   }
+   dm_table_set_restrictions(t, md->queue, );
+   blk_register_queue(md->disk);
+
return 0;
 }
 
-- 
2.15.0



Re: [for-4.16 PATCH v4 2/4] block: use queue_lock when clearing QUEUE_FLAG_REGISTERED in blk_unregister_queue

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  9:14am -0500,
Ming Lei  wrote:

> On Fri, Jan 12, 2018 at 07:53:40AM -0500, Mike Snitzer wrote:
> > On Fri, Jan 12 2018 at  2:09am -0500,
> > Ming Lei  wrote:
> > 
> > > On Thu, Jan 11, 2018 at 03:14:15PM -0500, Mike Snitzer wrote:
> > > > blk_unregister_queue() must protect against any modifications of
> > > > q->queue_flags (not just those performed in blk-sysfs.c).  Therefore
> > > > q->queue_lock needs to be used rather than q->sysfs_lock.
> > > > 
> > > > Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed 
> > > > as request_queue is being removed")
> > > > Cc: sta...@vger.kernel.org # 4.14+
> > > > Reported-by: Bart Van Assche 
> > > > Signed-off-by: Mike Snitzer 
> > > > ---
> > > >  block/blk-sysfs.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > > > index 870484eaed1f..52f57539f1c7 100644
> > > > --- a/block/blk-sysfs.c
> > > > +++ b/block/blk-sysfs.c
> > > > @@ -929,9 +929,9 @@ void blk_unregister_queue(struct gendisk *disk)
> > > > if (WARN_ON(!q))
> > > > return;
> > > >  
> > > > -   mutex_lock(>sysfs_lock);
> > > > +   spin_lock_irq(q->queue_lock);
> > > > queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
> > > > -   mutex_unlock(>sysfs_lock);
> > > > +   spin_unlock_irq(q->queue_lock);
> > > >  
> > > > wbt_exit(q);
> > > 
> > > Hi Mike,
> > > 
> > > This change may not be correct, since at least e9a823fb34a8b depends
> > > on q->sysfs_lock to sync between testing the flag in __elevator_change()
> > > and clearing it here.
> > 
> > The header for commit e9a823fb34a8b says:
> > To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when
> > changing the elevator and use the request_queue's sysfs_lock to
> > serialize between clearing the flag and the elevator testing the flag.
> > 
> > The reality is sysfs_lock isn't needed to serialize between
> > blk_unregister_queue() clearing the flag and __elevator_change() testing
> > the flag.
> 
> Without holding sysfs_lock, __elevator_change() may just be called after
> q->kobj is deleted from blk_unregister_queue(), then the warning of
> 'WARNING: CPU: 3 PID: 14075 at lib/kobject.c:244 
> kobject_add_internal+0x103/0x2d0'
> can be triggered again.
> 
> BTW, __elevator_change() is always run with holding sysfs_lock.

Yes, I'm well aware.  Please see v5's PATCH 02/04 which is inbound now.

Thanks,
Mike


Re: [for-4.16 PATCH v4 2/4] block: use queue_lock when clearing QUEUE_FLAG_REGISTERED in blk_unregister_queue

2018-01-12 Thread Ming Lei
On Fri, Jan 12, 2018 at 07:53:40AM -0500, Mike Snitzer wrote:
> On Fri, Jan 12 2018 at  2:09am -0500,
> Ming Lei  wrote:
> 
> > On Thu, Jan 11, 2018 at 03:14:15PM -0500, Mike Snitzer wrote:
> > > blk_unregister_queue() must protect against any modifications of
> > > q->queue_flags (not just those performed in blk-sysfs.c).  Therefore
> > > q->queue_lock needs to be used rather than q->sysfs_lock.
> > > 
> > > Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as 
> > > request_queue is being removed")
> > > Cc: sta...@vger.kernel.org # 4.14+
> > > Reported-by: Bart Van Assche 
> > > Signed-off-by: Mike Snitzer 
> > > ---
> > >  block/blk-sysfs.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > > index 870484eaed1f..52f57539f1c7 100644
> > > --- a/block/blk-sysfs.c
> > > +++ b/block/blk-sysfs.c
> > > @@ -929,9 +929,9 @@ void blk_unregister_queue(struct gendisk *disk)
> > >   if (WARN_ON(!q))
> > >   return;
> > >  
> > > - mutex_lock(>sysfs_lock);
> > > + spin_lock_irq(q->queue_lock);
> > >   queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
> > > - mutex_unlock(>sysfs_lock);
> > > + spin_unlock_irq(q->queue_lock);
> > >  
> > >   wbt_exit(q);
> > 
> > Hi Mike,
> > 
> > This change may not be correct, since at least e9a823fb34a8b depends
> > on q->sysfs_lock to sync between testing the flag in __elevator_change()
> > and clearing it here.
> 
> The header for commit e9a823fb34a8b says:
> To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when
> changing the elevator and use the request_queue's sysfs_lock to
> serialize between clearing the flag and the elevator testing the flag.
> 
> The reality is sysfs_lock isn't needed to serialize between
> blk_unregister_queue() clearing the flag and __elevator_change() testing
> the flag.

Without holding sysfs_lock, __elevator_change() may just be called after
q->kobj is deleted from blk_unregister_queue(), then the warning of
'WARNING: CPU: 3 PID: 14075 at lib/kobject.c:244 
kobject_add_internal+0x103/0x2d0'
can be triggered again.

BTW, __elevator_change() is always run with holding sysfs_lock.

Thanks,
Ming


Re: [for-4.16 PATCH v4 2/4] block: use queue_lock when clearing QUEUE_FLAG_REGISTERED in blk_unregister_queue

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  2:09am -0500,
Ming Lei  wrote:

> On Thu, Jan 11, 2018 at 03:14:15PM -0500, Mike Snitzer wrote:
> > blk_unregister_queue() must protect against any modifications of
> > q->queue_flags (not just those performed in blk-sysfs.c).  Therefore
> > q->queue_lock needs to be used rather than q->sysfs_lock.
> > 
> > Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as 
> > request_queue is being removed")
> > Cc: sta...@vger.kernel.org # 4.14+
> > Reported-by: Bart Van Assche 
> > Signed-off-by: Mike Snitzer 
> > ---
> >  block/blk-sysfs.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index 870484eaed1f..52f57539f1c7 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -929,9 +929,9 @@ void blk_unregister_queue(struct gendisk *disk)
> > if (WARN_ON(!q))
> > return;
> >  
> > -   mutex_lock(>sysfs_lock);
> > +   spin_lock_irq(q->queue_lock);
> > queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
> > -   mutex_unlock(>sysfs_lock);
> > +   spin_unlock_irq(q->queue_lock);
> >  
> > wbt_exit(q);
> 
> Hi Mike,
> 
> This change may not be correct, since at least e9a823fb34a8b depends
> on q->sysfs_lock to sync between testing the flag in __elevator_change()
> and clearing it here.

The header for commit e9a823fb34a8b says:
To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when
changing the elevator and use the request_queue's sysfs_lock to
serialize between clearing the flag and the elevator testing the flag.

The reality is sysfs_lock isn't needed to serialize between
blk_unregister_queue() clearing the flag and __elevator_change() testing
the flag.

The original commit e9a823fb34a8b is pretty conflated.  "conflated" because
the resource being protected isn't the queue_flags (it is the 'queue'
kobj).

I'll respin v5 of this patchset to fix this up first, and then apply the
changes I _really_ need to land (DM queue initialization fix).

And then I'm going to slowly step away from block core and _not_ allow
myself to be tripped up, or baited, by historic block core issues for a
while... ;)

Thanks,
Mike


Re: [PATCH 0/2] blk-mq: support physical CPU hotplug

2018-01-12 Thread Johannes Thumshirn
On Fri, Jan 12, 2018 at 09:12:24AM +0100, Christian Borntraeger wrote:
> I think we also need cc stable for this. The bug was introduced with 
>  
> commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream.
> ("blk-mq: Create hctx for each present CPU")
> 
> and that was even backported into 4.12 stable.

Yes, and Ming (or Jens?), can you please add a Fixes tag as well? It helps
with identifying patches which fix backports.

Thanks,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH IMPROVEMENT] block, bfq: limit sectors served with interactive weight raising

2018-01-12 Thread Paolo Valente


> Il giorno 12 gen 2018, alle ore 11:15, Holger Hoffstätte 
>  ha scritto:
> 
> On 01/12/18 06:58, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 28 dic 2017, alle ore 15:00, Holger Hoffstätte 
>>>  ha scritto:
>>> 
>>> 
>>> On 12/28/17 12:19, Paolo Valente wrote:
>>> (snip half a tech report ;)
>>> 
>>> So either this or the previous patch ("limit tags for writes and async I/O"
>>> can lead to a hard, unrecoverable hang with heavy writes. Since I couldn't
>>> log into the affected system anymore I couldn't get any stack traces, blk-mq
>>> debug output etc. but there was nothing in dmesg/on the console, so it
>>> wasn't a BUG/OOPS.
>>> 
>>> -h
>> 
>> Hi Holger,
>> if, as I guess, this problem hasn't gone away for you, I have two
>> requests:
>> 1) could you share your exact test
>> 2) if nothing happens in my systems with your test, would you be
>> willing to retry with the dev version of bfq?  It should be able to
>> tell us what takes to your hang.  If you are willing to do this test,
>> I'll prepare a branch with everything already configured for you.
> 
> Hi,
> 
> thanks for following up but there's no need for any of that; it turned out
> to be something else since I got the same hang without those patches at
> least once (during a btrfs balance, even though it didn't look like btrfs'
> fault directly; more like block/mm/helpers.
> 
> So on January 7 I posted to linux-block et.al. where I said
> "So this turned out to be something else, sorry for the false alarm."
> but apparently that didn't make it through since it's not in the
> archives either. Sorry.
> 
> Long story short, the good news is that I've been running with both patches
> since then without any issue. :)
> 

Wow, what a relief! :)

So, Jens, being the only issue reported gone, can you please consider
queueing this patch and the other pending one [1]?  They are both
critical for bfq performance.

Thanks,
Paolo

[1] https://www.spinics.net/lists/kernel/msg2684463.html

> cheers
> Holger



Re: [PATCH IMPROVEMENT] block, bfq: limit sectors served with interactive weight raising

2018-01-12 Thread Holger Hoffstätte
On 01/12/18 06:58, Paolo Valente wrote:
> 
> 
>> Il giorno 28 dic 2017, alle ore 15:00, Holger Hoffstätte 
>>  ha scritto:
>>
>>
>> On 12/28/17 12:19, Paolo Valente wrote:
>> (snip half a tech report ;)
>>
>> So either this or the previous patch ("limit tags for writes and async I/O"
>> can lead to a hard, unrecoverable hang with heavy writes. Since I couldn't
>> log into the affected system anymore I couldn't get any stack traces, blk-mq
>> debug output etc. but there was nothing in dmesg/on the console, so it
>> wasn't a BUG/OOPS.
>>
>> -h
> 
> Hi Holger,
> if, as I guess, this problem hasn't gone away for you, I have two
> requests:
> 1) could you share your exact test
> 2) if nothing happens in my systems with your test, would you be
> willing to retry with the dev version of bfq?  It should be able to
> tell us what takes to your hang.  If you are willing to do this test,
> I'll prepare a branch with everything already configured for you.

Hi,

thanks for following up but there's no need for any of that; it turned out
to be something else since I got the same hang without those patches at
least once (during a btrfs balance, even though it didn't look like btrfs'
fault directly; more like block/mm/helpers.

So on January 7 I posted to linux-block et.al. where I said
"So this turned out to be something else, sorry for the false alarm."
but apparently that didn't make it through since it's not in the
archives either. Sorry.

Long story short, the good news is that I've been running with both patches
since then without any issue. :)

cheers
Holger


Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000436

2018-01-12 Thread Paolo Valente


> Il giorno 12 gen 2018, alle ore 05:18, Ming Lei  ha 
> scritto:
> 
> On Thu, Jan 11, 2018 at 08:40:54AM -0700, Jens Axboe wrote:
>> On 1/11/18 2:41 AM, Paolo Valente wrote:
>>> 
>>> 
 Il giorno 11 gen 2018, alle ore 10:30, Paolo Valente 
  ha scritto:
 
 
 
> Il giorno 10 gen 2018, alle ore 05:58, Ming Lei  ha 
> scritto:
> 
> Hi Paolo,
> 
> Looks this one is introduced in recent merge, and it is triggered
> in test of IO vs. removing device on the latest for-next of block
> tree:
> 
> [  296.151615] BUG: unable to handle kernel NULL pointer dereference at 
> 0436
> [  296.152302] IP: percpu_counter_add_batch+0x25/0x9d
> [  296.152698] PGD 0 P4D 0
> [  296.152916] Oops:  [#1] PREEMPT SMP
> [  296.153233] Dumping ftrace buffer:
> [  296.153517](ftrace buffer empty)
> [  296.153817] Modules linked in: scsi_debug(E) null_blk(E) isofs(E) 
> iTCO_wdt(E) iTCO_vendor_support(E) i2c_i801(E) i2c_core(E) lpc_ich(E) 
> mfd_core(E) ip_tables(E) sr_mod(E) cdrom(E) sd_mod(E) usb_storage(E) 
> ahci(E) libahci(E) libata(E) crc32c_intel(E) virtio_scsi(E) 
> qemu_fw_cfg(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last 
> unloaded: scsi_debug]
> [  296.156199] CPU: 2 PID: 2001 Comm: scsi_id Tainted: GE
> 4.15.0-rc7790529290ab3_my_v4.15-rc-block-for-next #4
> [  296.156961] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.9.3-1.fc25 04/01/2014
> [  296.157546] RIP: 0010:percpu_counter_add_batch+0x25/0x9d
> [  296.157920] RSP: 0018:c90001133b28 EFLAGS: 00010002
> [  296.158285] RAX: 0002 RBX: ff6e RCX: 
> 0001
> [  296.158779] RDX: 3fff RSI: 81e95429 RDI: 
> 81e5e64c
> [  296.159276] RBP: 0416 R08: 0001 R09: 
> 0001
> [  296.159770] R10: c90001133aa0 R11: 88007e96caf0 R12: 
> 3fff
> [  296.160273] R13: 0001 R14: 0001 R15: 
> e8d0b180
> [  296.160780] FS:  7f635988af80() GS:88007cf0() 
> knlGS:
> [  296.161325] CS:  0010 DS:  ES:  CR0: 80050033
> [  296.161744] CR2: 0436 CR3: 717f1005 CR4: 
> 003606e0
> [  296.162258] DR0:  DR1:  DR2: 
> 
> [  296.162764] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  296.163246] Call Trace:
> [  296.163445]  bfqg_stats_update_io_add+0x35/0xdc
> [  296.163754]  bfq_insert_requests+0xbb9/0xbf2
> [  296.164049]  ? blk_rq_bio_prep+0x51/0x5d
> [  296.164353]  ? blk_rq_append_bio+0x32/0x78
> [  296.164633]  ? blk_rq_map_user_iov+0x11b/0x1a0
> [  296.164940]  blk_mq_sched_insert_request+0xaf/0x12f
> [  296.165273]  blk_execute_rq+0x4b/0x93
> [  296.165586]  sg_io+0x236/0x38a
> [  296.165800]  scsi_cmd_ioctl+0x1d4/0x381
> [  296.166065]  ? seccomp_run_filters+0xee/0x12d
> [  296.166362]  ? preempt_count_add+0x6d/0x8c
> [  296.166643]  sd_ioctl+0xbb/0xde [sd_mod]
> [  296.166915]  blkdev_ioctl+0x7f2/0x850
> [  296.167167]  block_ioctl+0x39/0x3c
> [  296.167401]  vfs_ioctl+0x1b/0x28
> [  296.167622]  do_vfs_ioctl+0x4bc/0x542
> [  296.167877]  ? syscall_trace_enter+0x164/0x261
> [  296.168180]  SyS_ioctl+0x3e/0x5a
> [  296.168402]  do_syscall_64+0x71/0x168
> [  296.168654]  entry_SYSCALL64_slow_path+0x25/0x25
> [  296.168970] RIP: 0033:0x7f63593a1dc7
> [  296.169214] RSP: 002b:7fff87210878 EFLAGS: 0246 ORIG_RAX: 
> 0010
> [  296.169823] RAX: ffda RBX: 7fff872108b0 RCX: 
> 7f63593a1dc7
> [  296.170303] RDX: 7fff872108b0 RSI: 2285 RDI: 
> 0004
> [  296.170783] RBP: 7fff87210f00 R08: 2006 R09: 
> fe00
> [  296.171266] R10: 7fff87210910 R11: 0246 R12: 
> 
> [  296.171745] R13: 7fff872108b0 R14: 7fff872108ba R15: 
> 7fff872112b0
> [  296.172228] Code: 5d 41 5c 41 5d c3 41 55 41 54 49 89 f5 55 53 48 89 
> fd bf 01 00 00 00 41 89 d4 e8 8c 19 d4 ff 48 c7 c7 29 54 e9 81 e8 fd 3d 
> ff ff <48> 8b 45 20 49 63 d4 65 8b 18 48 63 db 4c 01 eb 48 39 d3 7d 0b
> [  296.173594] RIP: percpu_counter_add_batch+0x25/0x9d RSP: 
> c90001133b28
> [  296.174054] CR2: 0436
> [  296.174283] ---[ end trace 87b58f9235daac7e ]---
> [  296.174673] Kernel panic - not syncing: Fatal exception
> [  297.270432] Shutting down cpus with NMI
> [  297.271020] Dumping ftrace buffer:
> [  297.271262](ftrace buffer empty)
> [  297.271513] Kernel Offset: disabled
> [  

Re: [PATCH 0/2] blk-mq: support physical CPU hotplug

2018-01-12 Thread Christian Borntraeger
I think we also need cc stable for this. The bug was introduced with 
 
commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream.
("blk-mq: Create hctx for each present CPU")

and that was even backported into 4.12 stable.



On 01/12/2018 03:53 AM, Ming Lei wrote:
> Hi,
> 
> This two patches support physical CPU hotplug, so that we can make blk-mq
> scale well when new physical CPU is added or removed, and this use case
> is normal for VM world.
> 
> Also this patchset fixes the following warning reported by Christian
> Borntraeger:
> 
>   https://marc.info/?l=linux-block=151092973417143=2
> 
> Christoph Hellwig (2):
>   genirq/affinity: assign vectors to all possible CPUs
>   blk-mq: simplify queue mapping & schedule with each possisble CPU
> 
>  block/blk-mq.c| 19 ---
>  kernel/irq/affinity.c | 30 +++---
>  2 files changed, 23 insertions(+), 26 deletions(-)
> 



Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000436

2018-01-12 Thread Paolo Valente


> Il giorno 11 gen 2018, alle ore 16:40, Jens Axboe  ha 
> scritto:
> 
> On 1/11/18 2:41 AM, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 11 gen 2018, alle ore 10:30, Paolo Valente 
>>>  ha scritto:
>>> 
>>> 
>>> 
 Il giorno 10 gen 2018, alle ore 05:58, Ming Lei  ha 
 scritto:
 
 Hi Paolo,
 
 Looks this one is introduced in recent merge, and it is triggered
 in test of IO vs. removing device on the latest for-next of block
 tree:
 
 [  296.151615] BUG: unable to handle kernel NULL pointer dereference at 
 0436
 [  296.152302] IP: percpu_counter_add_batch+0x25/0x9d
 [  296.152698] PGD 0 P4D 0
 [  296.152916] Oops:  [#1] PREEMPT SMP
 [  296.153233] Dumping ftrace buffer:
 [  296.153517](ftrace buffer empty)
 [  296.153817] Modules linked in: scsi_debug(E) null_blk(E) isofs(E) 
 iTCO_wdt(E) iTCO_vendor_support(E) i2c_i801(E) i2c_core(E) lpc_ich(E) 
 mfd_core(E) ip_tables(E) sr_mod(E) cdrom(E) sd_mod(E) usb_storage(E) 
 ahci(E) libahci(E) libata(E) crc32c_intel(E) virtio_scsi(E) qemu_fw_cfg(E) 
 dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: 
 scsi_debug]
 [  296.156199] CPU: 2 PID: 2001 Comm: scsi_id Tainted: GE
 4.15.0-rc7790529290ab3_my_v4.15-rc-block-for-next #4
 [  296.156961] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
 1.9.3-1.fc25 04/01/2014
 [  296.157546] RIP: 0010:percpu_counter_add_batch+0x25/0x9d
 [  296.157920] RSP: 0018:c90001133b28 EFLAGS: 00010002
 [  296.158285] RAX: 0002 RBX: ff6e RCX: 
 0001
 [  296.158779] RDX: 3fff RSI: 81e95429 RDI: 
 81e5e64c
 [  296.159276] RBP: 0416 R08: 0001 R09: 
 0001
 [  296.159770] R10: c90001133aa0 R11: 88007e96caf0 R12: 
 3fff
 [  296.160273] R13: 0001 R14: 0001 R15: 
 e8d0b180
 [  296.160780] FS:  7f635988af80() GS:88007cf0() 
 knlGS:
 [  296.161325] CS:  0010 DS:  ES:  CR0: 80050033
 [  296.161744] CR2: 0436 CR3: 717f1005 CR4: 
 003606e0
 [  296.162258] DR0:  DR1:  DR2: 
 
 [  296.162764] DR3:  DR6: fffe0ff0 DR7: 
 0400
 [  296.163246] Call Trace:
 [  296.163445]  bfqg_stats_update_io_add+0x35/0xdc
 [  296.163754]  bfq_insert_requests+0xbb9/0xbf2
 [  296.164049]  ? blk_rq_bio_prep+0x51/0x5d
 [  296.164353]  ? blk_rq_append_bio+0x32/0x78
 [  296.164633]  ? blk_rq_map_user_iov+0x11b/0x1a0
 [  296.164940]  blk_mq_sched_insert_request+0xaf/0x12f
 [  296.165273]  blk_execute_rq+0x4b/0x93
 [  296.165586]  sg_io+0x236/0x38a
 [  296.165800]  scsi_cmd_ioctl+0x1d4/0x381
 [  296.166065]  ? seccomp_run_filters+0xee/0x12d
 [  296.166362]  ? preempt_count_add+0x6d/0x8c
 [  296.166643]  sd_ioctl+0xbb/0xde [sd_mod]
 [  296.166915]  blkdev_ioctl+0x7f2/0x850
 [  296.167167]  block_ioctl+0x39/0x3c
 [  296.167401]  vfs_ioctl+0x1b/0x28
 [  296.167622]  do_vfs_ioctl+0x4bc/0x542
 [  296.167877]  ? syscall_trace_enter+0x164/0x261
 [  296.168180]  SyS_ioctl+0x3e/0x5a
 [  296.168402]  do_syscall_64+0x71/0x168
 [  296.168654]  entry_SYSCALL64_slow_path+0x25/0x25
 [  296.168970] RIP: 0033:0x7f63593a1dc7
 [  296.169214] RSP: 002b:7fff87210878 EFLAGS: 0246 ORIG_RAX: 
 0010
 [  296.169823] RAX: ffda RBX: 7fff872108b0 RCX: 
 7f63593a1dc7
 [  296.170303] RDX: 7fff872108b0 RSI: 2285 RDI: 
 0004
 [  296.170783] RBP: 7fff87210f00 R08: 2006 R09: 
 fe00
 [  296.171266] R10: 7fff87210910 R11: 0246 R12: 
 
 [  296.171745] R13: 7fff872108b0 R14: 7fff872108ba R15: 
 7fff872112b0
 [  296.172228] Code: 5d 41 5c 41 5d c3 41 55 41 54 49 89 f5 55 53 48 89 fd 
 bf 01 00 00 00 41 89 d4 e8 8c 19 d4 ff 48 c7 c7 29 54 e9 81 e8 fd 3d ff ff 
 <48> 8b 45 20 49 63 d4 65 8b 18 48 63 db 4c 01 eb 48 39 d3 7d 0b
 [  296.173594] RIP: percpu_counter_add_batch+0x25/0x9d RSP: 
 c90001133b28
 [  296.174054] CR2: 0436
 [  296.174283] ---[ end trace 87b58f9235daac7e ]---
 [  296.174673] Kernel panic - not syncing: Fatal exception
 [  297.270432] Shutting down cpus with NMI
 [  297.271020] Dumping ftrace buffer:
 [  297.271262](ftrace buffer empty)
 [  297.271513] Kernel Offset: disabled
 [  297.271760] ---[ end Kernel panic - not syncing: Fatal exception
 
>>> 
>>> Well, then don't do it! :)
>>> 
>>> Jokes aside, I've been trying to reproduce this