date:20180110

Re: [for-4.16 PATCH v3 3/3] dm: fix awkward and incomplete request_queue initialization

2018-01-10 Thread Hannes Reinecke

On 01/11/2018 03:12 AM, Mike Snitzer wrote:
> DM is now no longer prone to having its request_queue be improperly
> initialized.
> 
> Summary of changes:
> 
> - defer DM's blk_register_queue() from add_disk()-time until
>   dm_setup_md_queue() by setting QUEUE_FLAG_DEFER_REG in alloc_dev().
> 
> - dm_setup_md_queue() is updated to fully initialize DM's request_queue
>   (_after_ all table loads have occurred and the request_queue's type,
>   features and limits are known).
> 
> - various other small improvements that were noticed along the way.
> 
> A very welcome side-effect of these changes is DM no longer needs to:
> 1) backfill the "mq" sysfs entry (because historically DM didn't
> initialize the request_queue to use blk-mq until _after_
> register_queue() was called via add_disk()).
> 2) call elv_register_queue() to get .request_fn request-based DM
> device's "queue" exposed in syfs.
> 
> In addition, blk-mq debugfs support is now made available because
> request-based DM's blk-mq request_queue is now properly initialized
> before blk_register_queue() is called.
> 
> These changes also stave off the need to introduce new DM-specific
> workarounds in block core, e.g. this proposal:
> https://patchwork.kernel.org/patch/10067961/
> 
> In the end DM devices should be less unicorn in nature (relative to
> initialization and availability of block core infrastructure).
> 
> Signed-off-by: Mike Snitzer 
> ---
>  drivers/md/dm-core.h |  2 --
>  drivers/md/dm-rq.c   | 11 ---
>  drivers/md/dm.c  | 44 ++--
>  3 files changed, 26 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index 6a14f945783c..f955123b4765 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -130,8 +130,6 @@ struct mapped_device {
>   struct srcu_struct io_barrier;
>  };
>  
> -void dm_init_md_queue(struct mapped_device *md);
> -void dm_init_normal_md_queue(struct mapped_device *md);
>  int md_in_flight(struct mapped_device *md);
>  void disable_write_same(struct mapped_device *md);
>  void disable_write_zeroes(struct mapped_device *md);
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 9d32f25489c2..3b319776d80c 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -700,7 +700,6 @@ int dm_old_init_request_queue(struct mapped_device *md, 
> struct dm_table *t)
>   /* disable dm_old_request_fn's merge heuristic by default */
>   md->seq_rq_merge_deadline_usecs = 0;
>  
> - dm_init_normal_md_queue(md);
>   blk_queue_softirq_done(md->queue, dm_softirq_done);
>  
>   /* Initialize the request-based DM worker thread */
> @@ -713,8 +712,6 @@ int dm_old_init_request_queue(struct mapped_device *md, 
> struct dm_table *t)
>   return error;
>   }
>  
> - elv_register_queue(md->queue);
> -
>   return 0;
>  }
>  
> @@ -810,17 +807,9 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
> struct dm_table *t)
>   err = PTR_ERR(q);
>   goto out_tag_set;
>   }
> - dm_init_md_queue(md);
> -
> - /* backfill 'mq' sysfs registration normally done in blk_register_queue 
> */
> - err = blk_mq_register_dev(disk_to_dev(md->disk), q);
> - if (err)
> - goto out_cleanup_queue;
>  
>   return 0;
>  
> -out_cleanup_queue:
> - blk_cleanup_queue(q);
>  out_tag_set:
>   blk_mq_free_tag_set(md->tag_set);
>  out_kfree_tag_set:
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 7475739fee49..f5d61b6adaec 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1626,20 +1626,9 @@ static const struct dax_operations dm_dax_ops;
>  
>  static void dm_wq_work(struct work_struct *work);
>  
> -void dm_init_md_queue(struct mapped_device *md)
> -{
> - /*
> -  * Initialize data that will only be used by a non-blk-mq DM queue
> -  * - must do so here (in alloc_dev callchain) before queue is used
> -  */
> - md->queue->queuedata = md;
> - md->queue->backing_dev_info->congested_data = md;
> -}
> -
> -void dm_init_normal_md_queue(struct mapped_device *md)
> +static void dm_init_normal_md_queue(struct mapped_device *md)
>  {
>   md->use_blk_mq = false;
> - dm_init_md_queue(md);
>  
>   /*
>* Initialize aspects of queue that aren't relevant for blk-mq
> @@ -1734,10 +1723,15 @@ static struct mapped_device *alloc_dev(int minor)
>   md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id);
>   if (!md->queue)
>   goto bad;
> + md->queue->queuedata = md;
> + md->queue->backing_dev_info->congested_data = md;
> + /*
> +  * Do not allow add_disk() to blk_register_queue().
> +  * Defer blk_register_queue() until dm_setup_md_queue().
> +  */
> + queue_flag_set_unlocked(QUEUE_FLAG_DEFER_REG, md->queue);
>  
> - dm_init_md_queue(md);
> -
> - md->disk = alloc_disk_node(1, numa_node_id);
> + md->disk = alloc_disk_node(1, md->numa_node_

Re: [for-4.16 PATCH v3 2/3] block: allow gendisk's request_queue registration to be deferred

2018-01-10 Thread Hannes Reinecke

On 01/11/2018 03:12 AM, Mike Snitzer wrote:
> Since I can remember DM has forced the block layer to allow the
> allocation and initialization of the request_queue to be distinct
> operations.  Reason for this is block/genhd.c:add_disk() has requires
> that the request_queue (and associated bdi) be tied to the gendisk
> before add_disk() is called -- because add_disk() also deals with
> exposing the request_queue via blk_register_queue().
> 
> DM's dynamic creation of arbitrary device types (and associated
> request_queue types) requires the DM device's gendisk be available so
> that DM table loads can establish a master/slave relationship with
> subordinate devices that are referenced by loaded DM tables -- using
> bd_link_disk_holder().  But until these DM tables, and their associated
> subordinate devices, are known DM cannot know what type of request_queue
> it needs -- nor what its queue_limits should be.
> 
> This chicken and egg scenario has created all manner of problems for DM
> and, at times, the block layer.
> 
> Summary of changes:
> 
> - Add QUEUE_FLAG_DEFER_REG that a block driver can set to defer the
>   required blk_register_queue() until the block driver's request_queue
>   is fully initialized.
> 
> - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
>   is not set.  It won't be set if a request_queue with
>   QUEUE_FLAG_DEFER_REG set never called blk_register_queue() -- as can
>   happen if a driver encounters an error and must del_gendisk() before
>   calling blk_register_queue().
> 
> - Export blk_register_queue().
> 
> These changes allow DM to use device_add_disk() to anchor its gendisk as
> the "master" for master/slave relationships DM must establish with
> subordinate devices referenced in DM tables that get loaded.  Once all
> "slave" devices for a DM device are known a request_queue can be
> properly initialized and then advertised via sysfs -- important
> improvement being that no request_queue resource initialization is
> missed -- before these changes DM was known to be missing blk-mq debugfs
> and proper block throttle initialization.
> 
> Signed-off-by: Mike Snitzer 
> ---
>  block/blk-sysfs.c  | 4 
>  block/genhd.c  | 4 +++-
>  include/linux/blkdev.h | 1 +
>  3 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 870484eaed1f..2395122875b4 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -921,6 +921,7 @@ int blk_register_queue(struct gendisk *disk)
>   mutex_unlock(&q->sysfs_lock);
>   return ret;
>  }
> +EXPORT_SYMBOL_GPL(blk_register_queue);
>  
>  void blk_unregister_queue(struct gendisk *disk)
>  {
> @@ -929,6 +930,9 @@ void blk_unregister_queue(struct gendisk *disk)
>   if (WARN_ON(!q))
>   return;
>  
> + if (!test_bit(QUEUE_FLAG_REGISTERED, &disk->queue->queue_flags))
> + return;
> +
>   mutex_lock(&q->sysfs_lock);
>   queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
>   mutex_unlock(&q->sysfs_lock);
> diff --git a/block/genhd.c b/block/genhd.c
> index 00620e01e043..3912a82ecc4f 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -682,7 +682,6 @@ void device_add_disk(struct device *parent, struct 
> gendisk *disk)
>   exact_match, exact_lock, disk);
>   }
>   register_disk(parent, disk);
> - blk_register_queue(disk);
>  
>   /*
>* Take an extra ref on queue which will be put on disk_release()
> @@ -692,6 +691,9 @@ void device_add_disk(struct device *parent, struct 
> gendisk *disk)
>  
>   disk_add_events(disk);
>   blk_integrity_add(disk);
> +
> + if (!test_bit(QUEUE_FLAG_DEFER_REG, &disk->queue->queue_flags))
> + blk_register_queue(disk);
>  }
>  EXPORT_SYMBOL(device_add_disk);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 71a9371c8182..84d144c7b025 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -681,6 +681,7 @@ struct request_queue {
>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 27   /* queue supports SCSI commands 
> */
>  #define QUEUE_FLAG_QUIESCED28/* queue has been quiesced */
>  #define QUEUE_FLAG_PREEMPT_ONLY  29  /* only process REQ_PREEMPT 
> requests */
> +#define QUEUE_FLAG_DEFER_REG 30  /* defer registering queue to a disk */
>  
>  #define QUEUE_FLAG_DEFAULT   ((1 << QUEUE_FLAG_IO_STAT) |\
>(1 << QUEUE_FLAG_SAME_COMP)|   \
> 
Thinking of this a bit more, wouldn't it be better to modify add_disk()
(or, rather, device_add_disk()) to handle this case?
You already moved the call to 'blk_register_queue()' to the end of
device_add_disk(), so it would be trivial to make device_add_disk() a
wrapper function like

void device_add_disk(struct device *parent, struct gendisk *disk) {
  blk_add_disk(parent, disk);
  blk_register_queue(disk);
}

and then call blk_add_disk() from the dm-core.
That would

Re: [for-4.16 PATCH v3 2/3] block: allow gendisk's request_queue registration to be deferred

2018-01-10 Thread Hannes Reinecke

On 01/11/2018 03:12 AM, Mike Snitzer wrote:
> Since I can remember DM has forced the block layer to allow the
> allocation and initialization of the request_queue to be distinct
> operations.  Reason for this is block/genhd.c:add_disk() has requires
> that the request_queue (and associated bdi) be tied to the gendisk
> before add_disk() is called -- because add_disk() also deals with
> exposing the request_queue via blk_register_queue().
> 
> DM's dynamic creation of arbitrary device types (and associated
> request_queue types) requires the DM device's gendisk be available so
> that DM table loads can establish a master/slave relationship with
> subordinate devices that are referenced by loaded DM tables -- using
> bd_link_disk_holder().  But until these DM tables, and their associated
> subordinate devices, are known DM cannot know what type of request_queue
> it needs -- nor what its queue_limits should be.
> 
> This chicken and egg scenario has created all manner of problems for DM
> and, at times, the block layer.
> 
> Summary of changes:
> 
> - Add QUEUE_FLAG_DEFER_REG that a block driver can set to defer the
>   required blk_register_queue() until the block driver's request_queue
>   is fully initialized.
> 
> - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
>   is not set.  It won't be set if a request_queue with
>   QUEUE_FLAG_DEFER_REG set never called blk_register_queue() -- as can
>   happen if a driver encounters an error and must del_gendisk() before
>   calling blk_register_queue().
> 
> - Export blk_register_queue().
> 
> These changes allow DM to use device_add_disk() to anchor its gendisk as
> the "master" for master/slave relationships DM must establish with
> subordinate devices referenced in DM tables that get loaded.  Once all
> "slave" devices for a DM device are known a request_queue can be
> properly initialized and then advertised via sysfs -- important
> improvement being that no request_queue resource initialization is
> missed -- before these changes DM was known to be missing blk-mq debugfs
> and proper block throttle initialization.
> 
> Signed-off-by: Mike Snitzer 
> ---
>  block/blk-sysfs.c  | 4 
>  block/genhd.c  | 4 +++-
>  include/linux/blkdev.h | 1 +
>  3 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 870484eaed1f..2395122875b4 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -921,6 +921,7 @@ int blk_register_queue(struct gendisk *disk)
>   mutex_unlock(&q->sysfs_lock);
>   return ret;
>  }
> +EXPORT_SYMBOL_GPL(blk_register_queue);
>  
>  void blk_unregister_queue(struct gendisk *disk)
>  {
> @@ -929,6 +930,9 @@ void blk_unregister_queue(struct gendisk *disk)
>   if (WARN_ON(!q))
>   return;
>  
> + if (!test_bit(QUEUE_FLAG_REGISTERED, &disk->queue->queue_flags))
> + return;
> +
>   mutex_lock(&q->sysfs_lock);
>   queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
>   mutex_unlock(&q->sysfs_lock);
Why can't we use test_and_clear_bit() here?
Shouldn't that relieve the need for the mutex_lock() here, too?

> diff --git a/block/genhd.c b/block/genhd.c
> index 00620e01e043..3912a82ecc4f 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -682,7 +682,6 @@ void device_add_disk(struct device *parent, struct 
> gendisk *disk)
>   exact_match, exact_lock, disk);
>   }
>   register_disk(parent, disk);
> - blk_register_queue(disk);
>  
>   /*
>* Take an extra ref on queue which will be put on disk_release()
> @@ -692,6 +691,9 @@ void device_add_disk(struct device *parent, struct 
> gendisk *disk)
>  
>   disk_add_events(disk);
>   blk_integrity_add(disk);
> +
> + if (!test_bit(QUEUE_FLAG_DEFER_REG, &disk->queue->queue_flags))
> + blk_register_queue(disk);
>  }
>  EXPORT_SYMBOL(device_add_disk);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 71a9371c8182..84d144c7b025 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -681,6 +681,7 @@ struct request_queue {
>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 27   /* queue supports SCSI commands 
> */
>  #define QUEUE_FLAG_QUIESCED28/* queue has been quiesced */
>  #define QUEUE_FLAG_PREEMPT_ONLY  29  /* only process REQ_PREEMPT 
> requests */
> +#define QUEUE_FLAG_DEFER_REG 30  /* defer registering queue to a disk */
>  
>  #define QUEUE_FLAG_DEFAULT   ((1 << QUEUE_FLAG_IO_STAT) |\
>(1 << QUEUE_FLAG_SAME_COMP)|   \
> 
Where is this flag used?

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [for-4.16 PATCH v3 1/3] block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN

2018-01-10 Thread Hannes Reinecke

On 01/11/2018 03:12 AM, Mike Snitzer wrote:
> device_add_disk() will only call bdi_register_owner() if
> !GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
> bdi_unregister() if !GENHD_FL_HIDDEN.
> 
> Found with code inspection.  bdi_unregister() won't do any harm if
> bdi_register_owner() wasn't used but best to avoid the unnecessary
> call to bdi_unregister().
> 
> Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
> Signed-off-by: Mike Snitzer 
> ---
>  block/genhd.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 96a66f671720..00620e01e043 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -725,7 +725,8 @@ void del_gendisk(struct gendisk *disk)
>* Unregister bdi before releasing device numbers (as they can
>* get reused and we'd get clashes in sysfs).
>*/
> - bdi_unregister(disk->queue->backing_dev_info);
> + if (!(disk->flags & GENHD_FL_HIDDEN))
> + bdi_unregister(disk->queue->backing_dev_info);
>   blk_unregister_queue(disk);
>   } else {
>   WARN_ON(1);
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH v2 2/2] blk-mq: Fix a race condition in blk_mq_mark_tag_wait()

2018-01-10 Thread Hannes Reinecke

On 01/10/2018 08:39 PM, Bart Van Assche wrote:
> Both add_wait_queue() and blk_mq_dispatch_wake() protect wait queue
> manipulations with the wait queue lock. Hence also protect the
> !list_empty(&wait->entry) test with the wait queue lock instead of
> the hctx lock.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---
>  block/blk-mq.c | 19 +--
>  1 file changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e770e8814f60..d5313ce60836 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1184,7 +1184,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
> **hctx,
>   bool shared_tags = (this_hctx->flags & BLK_MQ_F_TAG_SHARED) != 0;
>   struct sbq_wait_state *ws;
>   wait_queue_entry_t *wait;
> - bool ret;
> + bool on_wait_list, ret;
>  
>   if (!shared_tags) {
>   if (!test_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state))
> @@ -1204,13 +1204,15 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
> **hctx,
>   if (!list_empty_careful(&wait->entry))
>   return false;
>  
> - spin_lock(&this_hctx->lock);
> - if (!list_empty(&wait->entry)) {
> - spin_unlock(&this_hctx->lock);
> + ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
> +
> + spin_lock_irq(&ws->wait.lock);
> + on_wait_list = !list_empty(&wait->entry);
> + spin_unlock_irq(&ws->wait.lock);
> +
> + if (on_wait_list)
>   return false;
> - }
>  
> - ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
>   add_wait_queue(&ws->wait, wait);
>   /*
>* It's possible that a tag was freed in the window between the
I'm actually not that convinced with this change; originally we had been
checking if it's on the wait list, and only _then_ call bt_wait_ptr().
Now we call bt_wait_ptr() always, meaning we run a chance of increasing
the bitmap_tags wait pointer without actually using it.
Looking at the code I'm not sure this is the correct way of using it ...

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH v2 1/2] blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()

2018-01-10 Thread Hannes Reinecke

On 01/10/2018 08:39 PM, Bart Van Assche wrote:
> This patch does not change any functionality but makes the
> blk_mq_mark_tag_wait() code slightly easier to read.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---
>  block/blk-mq.c | 30 +++---
>  1 file changed, 15 insertions(+), 15 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: scsi: memory leak in sg_start_req

2018-01-10 Thread Douglas Gilbert


On 2018-01-09 11:05 AM, Dmitry Vyukov wrote:

Hello,

syzkaller has found the following memory leak:

unreferenced object 0x88004c19 (size 8328):
   comm "syz-executor", pid 4627, jiffies 4294749150 (age 45.507s)
   hex dump (first 32 bytes):
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
 20 00 00 00 22 01 00 00 00 00 00 00 04 00 00 00   ..."...
   backtrace:
 [<5955b5a9>] kmalloc_order+0x59/0x80 mm/slab_common.c:1124
 [<43ae006e>] kmalloc_order_trace+0x1f/0x160 mm/slab_common.c:1133
 [] kmalloc_large include/linux/slab.h:433 [inline]
 [] __kmalloc+0x2c4/0x340 mm/slub.c:3751
 [] kmalloc include/linux/slab.h:504 [inline]
 [] bio_alloc_bioset+0x4d5/0x7e0 block/bio.c:450
 [] bio_kmalloc include/linux/bio.h:410 [inline]
 [] bio_copy_user_iov+0x2be/0xcb0 block/bio.c:1226
 [<1d0b79ed>] __blk_rq_map_user_iov block/blk-map.c:67 [inline]
 [<1d0b79ed>] blk_rq_map_user_iov+0x2b6/0x7d0 block/blk-map.c:136
 [<4200a869>] blk_rq_map_user+0x11e/0x170 block/blk-map.c:166
 [<8f21739e>] sg_start_req drivers/scsi/sg.c:1794 [inline]
 [<8f21739e>] sg_common_write.isra.16+0x14df/0x1ed0
drivers/scsi/sg.c:777
 [<093f61e3>] sg_write+0x8a7/0xd7b drivers/scsi/sg.c:677
 [] __vfs_write+0x10d/0x8f0 fs/read_write.c:480
 [<0638f16f>] vfs_write+0x1fd/0x570 fs/read_write.c:544
 [<6a7e6867>] SYSC_write fs/read_write.c:589 [inline]
 [<6a7e6867>] SyS_write+0xfa/0x250 fs/read_write.c:581

can be reproduced with the following program:

// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include 
#include 
#include 
#include 

int main()
{
   int fd = open("/dev/sg1", O_RDWR);
   const char *data =
  "\xb6\x3d\xb8\x5e\x1e\x8d\x22\x00\x00\x00\x00\x00\x00\x08\xaf\xd6\x1d"
  "\xcc\x43\x6a\xed\x5e\xd2\xbc\x70\x18\xce\xbc\x9b\x97\xae\x21\x91\x4d"
  "\x87\x2c\x67\x8c\xe2\x2c\x9b\x16\x0e\x96\xaa\x1f\xae\x1a";
   write(fd, data, 0x30);
   return 0;
}

if executed in a loop, memory consumption grows infinitely.

on upstream commit b2cd1df66037e7c4697c7e40496bf7e4a5e16a2d


The seemingly random data that program is sending is asking for
a buffer of 2,264,314 bytes which the sg driver procures and waits
for the caller to either issue a read() or close() the file
or shutdown the program. The test program does none of those
expected operations, it simply asks for the same resources again.

In my version of your test code (attached), that happens 1,021 times
at which point the file handles in that process are exhausted and
all subsequent open()s fail with EBADF (as do the write()s). The
output from my program was this on one run:

# ./sg_syzk_grow
First errno=9 [Bad file descriptor] index=1021
done_count=5, err_count=48979, last_errno=9 [Bad file descriptor]

# lsscsi -gs
[0:0:0:0]  disk  Linux  scsi_debug  0186  /dev/sda  /dev/sg0  2.14GB

Monitoring that program with 'free' from another terminal I see
about 2.5 GBytes of ram "swallowed" almost immediately when the test
program runs. When the program exits (about 50 seconds later) as far
as I can see all that ram is given back.


If you used the same program and wrote to a regular file rather than
a sg device, then that program would eventually fill any file system,
at the rate of 48 bytes per iteration (given enough file descriptor
resources). The sg driver, using its original 1994 interface,
deprecated for around 18 years, just gets a system to resource
exhaustion quicker.

Doug Gilbert




// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static const char * sg_dev = "/dev/sg0";

int main()
{
  const char *data =
 "\xb6\x3d\xb8\x5e\x1e\x8d\x22\x00\x00\x00\x00\x00\x00\x08\xaf\xd6\x1d"
 "\xcc\x43\x6a\xed\x5e\xd2\xbc\x70\x18\xce\xbc\x9b\x97\xae\x21\x91\x4d"
 "\x87\x2c\x67\x8c\xe2\x2c\x9b\x16\x0e\x96\xaa\x1f\xae\x1a";
int k, res, first_errno, prev_errno, first_err_index;
int err_count = 0;
int done_count = 0;
bool got_1st_errno = false;

for (k = 0; k < 1; ++k) {
int fd = open(sg_dev, O_RDWR);

  	res = write(fd, data, 0x30);
	if (res < 0) {
	if (! got_1st_errno) {
		got_1st_errno = true;
		first_errno = errno;
		first_err_index = done_count + k;
		printf("First errno=%d [%s] index=%d\n", first_errno,
		   strerror(first_errno), first_err_index);
	}
	++err_count;
	}
}
done_count += k;
sleep(10);
for (k = 0; k < 1; ++k) {
int fd = open(sg_dev, O_RDWR);

  	res = write(fd, data, 0x30);
	if (res < 0) {
	if (! got_1st_errno) {
		got_1st_errno = true;
		first_errno = errno;
		printf("First errno=%d [%s] index=%d\n", first_errno,
		   strerror(f

[PATCH V3 4/5] blk-mq: return dispatch result to caller in blk_mq_try_issue_directly

2018-01-10 Thread Ming Lei

In the following patch, we will use blk_mq_try_issue_directly() for DM
to return the dispatch result, and DM need this informatin to improve
IO merge.

Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d27c1a265d54..444a4bf9705f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1711,13 +1711,13 @@ static bool __blk_mq_issue_req(struct blk_mq_hw_ctx 
*hctx,
return false;
 }
 
-static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq,
-   blk_qc_t *cookie)
+static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+   struct request *rq,
+   blk_qc_t *cookie)
 {
struct request_queue *q = rq->q;
blk_qc_t new_cookie;
-   blk_status_t ret;
+   blk_status_t ret = BLK_STS_OK;
bool run_queue = true;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -1740,31 +1740,36 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
switch (ret) {
case BLK_STS_OK:
*cookie = new_cookie;
-   return;
+   return ret;
case BLK_STS_RESOURCE:
__blk_mq_requeue_request(rq);
goto insert;
default:
*cookie = BLK_QC_T_NONE;
blk_mq_end_request(rq, ret);
-   return;
+   return ret;
}
 
 insert:
blk_mq_sched_insert_request(rq, false, run_queue, false,
hctx->flags & BLK_MQ_F_BLOCKING);
+   return ret;
 }
 
-static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq, blk_qc_t *cookie)
+static blk_status_t blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+ struct request *rq,
+ blk_qc_t *cookie)
 {
int srcu_idx;
+   blk_status_t ret;
 
might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
hctx_lock(hctx, &srcu_idx);
-   __blk_mq_try_issue_directly(hctx, rq, cookie);
+   ret = __blk_mq_try_issue_directly(hctx, rq, cookie);
hctx_unlock(hctx, srcu_idx);
+
+   return ret;
 }
 
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
-- 
2.9.5

[PATCH V3 5/5] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-10 Thread Ming Lei

blk_insert_cloned_request() is called in fast path of dm-rq driver, and
in this function we append request to hctx->dispatch_list of the underlying
queue directly.

1) This way isn't efficient enough because hctx lock is always required

2) With blk_insert_cloned_request(), we bypass underlying queue's IO scheduler
totally, and depend on DM rq driver to do IO schedule completely. But DM
rq driver can't get underlying queue's dispatch feedback at all, and this
information is extreamly useful for IO merge. Without that IO merge can't
be done basically by blk-mq, and causes very bad sequential IO performance.

This patch makes use of blk_mq_try_issue_directly() to dispatch rq to
underlying queue and provides disptch result to dm-rq and blk-mq, and
improves the above situations very much.

With this patch, seqential IO is improved by 3X in my test over
mpath/virtio-scsi.

Signed-off-by: Ming Lei 
---
 block/blk-core.c   |  3 +--
 block/blk-mq.c | 34 ++
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 ---
 4 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f843ae4f858d..30759fcc93e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   blk_mq_request_bypass_insert(rq, true);
-   return BLK_STS_OK;
+   return blk_mq_request_direct_issue(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 444a4bf9705f..9bfca039d526 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1719,6 +1719,12 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
blk_qc_t new_cookie;
blk_status_t ret = BLK_STS_OK;
bool run_queue = true;
+   /*
+* This function is called from blk_insert_cloned_request() if
+* 'cookie' is NULL, and for dispatching this request only.
+*/
+   bool dispatch_only = !cookie;
+   bool need_insert;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
@@ -1726,10 +1732,19 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   if (q->elevator)
+   if (q->elevator && !dispatch_only)
goto insert;
 
-   if (__blk_mq_issue_req(hctx, rq, &new_cookie, &ret))
+   need_insert = __blk_mq_issue_req(hctx, rq, &new_cookie, &ret);
+   if (dispatch_only) {
+   if (need_insert)
+   return BLK_STS_RESOURCE;
+   if (ret == BLK_STS_RESOURCE)
+   __blk_mq_requeue_request(rq);
+   return ret;
+   }
+
+   if (need_insert)
goto insert;
 
/*
@@ -1751,8 +1766,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
}
 
 insert:
-   blk_mq_sched_insert_request(rq, false, run_queue, false,
-   hctx->flags & BLK_MQ_F_BLOCKING);
+   if (!dispatch_only)
+   blk_mq_sched_insert_request(rq, false, run_queue, false,
+   hctx->flags & BLK_MQ_F_BLOCKING);
+   else
+   blk_mq_request_bypass_insert(rq, run_queue);
return ret;
 }
 
@@ -1772,6 +1790,14 @@ static blk_status_t blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
return ret;
 }
 
+blk_status_t blk_mq_request_direct_issue(struct request *rq)
+{
+   struct blk_mq_ctx *ctx = rq->mq_ctx;
+   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
+
+   return blk_mq_try_issue_directly(hctx, rq, NULL);
+}
+
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
const int is_sync = op_is_sync(bio->bi_opf);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8591a54d989b..bd98864d8e38 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -74,6 +74,9 @@ void blk_mq_request_bypass_insert(struct request *rq, bool 
run_queue);
 void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
struct list_head *list);
 
+/* Used by DM for issuing req directly */
+blk_status_t blk_mq_request_direct_issue(struct request *rq);
+
 /*
  * CPU -> queue mappings
  */
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 4d157b14d302..6b01298fba06 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -395,7 +395,7 @@ static void end_clone_request(struct request *clone, 
blk_status_t error)
dm_complete_request(tio->orig, error);
 }
 
-static void dm_dispatch_clone_request(struct request *clone, struct request 
*rq)
+static blk_status

[PATCH V3 3/5] blk-mq: move actual issue into one helper

2018-01-10 Thread Ming Lei

No functional change, just to clean up code a bit, so that the following
change of using direct issue for blk_mq_request_bypass_insert() which is
needed by DM can be easier to do.

Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 39 +++
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9aa24c9508f9..d27c1a265d54 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1684,15 +1684,38 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx 
*hctx, struct request *rq)
return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq,
-   blk_qc_t *cookie)
+/* return true if insert is needed */
+static bool __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx,
+  struct request *rq,
+  blk_qc_t *new_cookie,
+  blk_status_t *ret)
 {
struct request_queue *q = rq->q;
struct blk_mq_queue_data bd = {
.rq = rq,
.last = true,
};
+
+   if (!blk_mq_get_driver_tag(rq, NULL, false))
+   return true;
+
+   if (!blk_mq_get_dispatch_budget(hctx)) {
+   blk_mq_put_driver_tag(rq);
+   return true;
+   }
+
+   *new_cookie = request_to_qc_t(hctx, rq);
+
+   *ret = q->mq_ops->queue_rq(hctx, &bd);
+
+   return false;
+}
+
+static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+   struct request *rq,
+   blk_qc_t *cookie)
+{
+   struct request_queue *q = rq->q;
blk_qc_t new_cookie;
blk_status_t ret;
bool run_queue = true;
@@ -1706,22 +1729,14 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
if (q->elevator)
goto insert;
 
-   if (!blk_mq_get_driver_tag(rq, NULL, false))
+   if (__blk_mq_issue_req(hctx, rq, &new_cookie, &ret))
goto insert;
 
-   if (!blk_mq_get_dispatch_budget(hctx)) {
-   blk_mq_put_driver_tag(rq);
-   goto insert;
-   }
-
-   new_cookie = request_to_qc_t(hctx, rq);
-
/*
 * For OK queue, we are done. For error, kill it. Any other
 * error (busy), just add it to our list as we previously
 * would have done
 */
-   ret = q->mq_ops->queue_rq(hctx, &bd);
switch (ret) {
case BLK_STS_OK:
*cookie = new_cookie;
-- 
2.9.5

[PATCH V3 2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure

2018-01-10 Thread Ming Lei

blk-mq will rerun queue via RESTART or dispatch wake after one request
is completed, so not necessary to wait random time for requeuing, we
should trust blk-mq to do it.

More importantly, we need return BLK_STS_RESOURCE to blk-mq so that
dequeue from I/O scheduler can be stopped, then I/O merge gets improved.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-mpath.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 86bf502a8e51..fcddf5a62581 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target *ti, 
struct request *rq,
if (queue_dying) {
atomic_inc(&m->pg_init_in_progress);
activate_or_offline_path(pgpath);
+   return DM_MAPIO_DELAY_REQUEUE;
}
-   return DM_MAPIO_DELAY_REQUEUE;
+
+   /*
+* blk-mq's SCHED_RESTART can cover this requeue, so
+* we needn't to deal with it by DELAY_REQUEUE. More
+* importantly, we have to return DM_MAPIO_REQUEUE
+* so that blk-mq can get the queue busy feedback,
+* otherwise I/O merge can be hurt.
+*/
+   if (q->mq_ops)
+   return DM_MAPIO_REQUEUE;
+   else
+   return DM_MAPIO_DELAY_REQUEUE;
}
clone->bio = clone->biotail = NULL;
clone->rq_disk = bdev->bd_disk;
-- 
2.9.5

[PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-10 Thread Ming Lei

Hi Guys,

The 1st patch removes the workaround of blk_mq_delay_run_hw_queue() in
case of requeue, this way isn't necessary, and more worse, it makes
BLK_MQ_S_SCHED_RESTART not working, and degarde I/O performance.

The 2nd patch return DM_MAPIO_REQUEUE to dm-rq if underlying request
allocation fails, then we can return BLK_STS_RESOURCE from dm-rq to
blk-mq, so that blk-mq can hold the requests to be dequeued.

The other 3 paches changes the blk-mq part of blk_insert_cloned_request(),
in which we switch to blk_mq_try_issue_directly(), so that both dm-rq
and blk-mq can get the dispatch result of underlying queue, and with
this information, blk-mq can handle IO merge much better, then
sequential I/O performance is improved much.

In my dm-mpath over virtio-scsi test, this whole patchset improves
sequential IO by 3X ~ 5X.

V3:
- rebase on the latest for-4.16/block of block tree
- add missed pg_init_all_paths() in patch 1, according to Bart's review

V2:
- drop 'dm-mpath: cache ti->clone during requeue', which is a bit
too complicated, and not see obvious performance improvement.
- make change on blk-mq part cleaner

Ming Lei (5):
  dm-mpath: don't call blk_mq_delay_run_hw_queue() in case of
BLK_STS_RESOURCE
  dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure
  blk-mq: move actual issue into one helper
  blk-mq: return dispatch result to caller in blk_mq_try_issue_directly
  blk-mq: issue request directly for blk_insert_cloned_request

 block/blk-core.c  |  3 +-
 block/blk-mq.c| 86 +++
 block/blk-mq.h|  3 ++
 drivers/md/dm-mpath.c | 19 +---
 drivers/md/dm-rq.c| 20 +---
 5 files changed, 101 insertions(+), 30 deletions(-)

-- 
2.9.5

[PATCH V3 1/5] dm-mpath: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2018-01-10 Thread Ming Lei

If .queue_rq() returns BLK_STS_RESOURCE, blk-mq will rerun the queue in
the three situations:

1) if BLK_MQ_S_SCHED_RESTART is set
- queue is rerun after one rq is completed, see blk_mq_sched_restart()
which is run from blk_mq_free_request()

2) run out of driver tag
- queue is rerun after one tag is freed

3) otherwise
- queue is run immediately in blk_mq_dispatch_rq_list()

This random dealy of running hw queue is introduced by commit 6077c2d706097c0
(dm rq: Avoid that request processing stalls sporadically), which claimed
one request processing stalling is fixed, but never explained the behind
idea, and it is a workaound at most. Even the question isn't explained by
anyone in recent discussion.

Also calling blk_mq_delay_run_hw_queue() inside .queue_rq() is a horrible
hack because it makes BLK_MQ_S_SCHED_RESTART not working, and degrades I/O
peformance a lot.

Finally this patch makes sure that dm-rq returns BLK_STS_RESOURCE to blk-mq
only when underlying queue is out of resource, so we switch to return
DM_MAPIO_DELAY_REQUEU if either MPATHF_QUEUE_IO or MPATHF_PG_INIT_REQUIRED
is set in multipath_clone_and_map().

Signed-off-by: Ming Lei 
---
 drivers/md/dm-mpath.c | 5 ++---
 drivers/md/dm-rq.c| 1 -
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index f7810cc869ac..86bf502a8e51 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -516,9 +516,8 @@ static int multipath_clone_and_map(struct dm_target *ti, 
struct request *rq,
return DM_MAPIO_KILL;
} else if (test_bit(MPATHF_QUEUE_IO, &m->flags) ||
   test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) {
-   if (pg_init_all_paths(m))
-   return DM_MAPIO_DELAY_REQUEUE;
-   return DM_MAPIO_REQUEUE;
+   pg_init_all_paths(m);
+   return DM_MAPIO_DELAY_REQUEUE;
}
 
memset(mpio, 0, sizeof(*mpio));
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3b319776d80c..4d157b14d302 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -755,7 +755,6 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx 
*hctx,
/* Undo dm_start_request() before requeuing */
rq_end_stats(md, rq);
rq_completed(md, rq_data_dir(rq), false);
-   blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
return BLK_STS_RESOURCE;
}
 
-- 
2.9.5

[PATCH] blktrace: support enabling cgroup info per-device

2018-01-10 Thread Hou Tao

Now blktrace supports outputting cgroup info for trace action and
trace message, however, it can only be enabled globally by writing
"blk_cgroup" to trace_options file, and there is no per-device API
for the new functionality.

Adding a new field (enable_cg_info) by using the pad after act_mask
in struct blk_user_trace_setup and a new attr file (cgroup_info) under
/sys/block/$dev/trace dir, so BLKTRACESETUP ioctl and sysfs file
can be used to enable cgroup info for selected block devices.

Signed-off-by: Hou Tao 
---
 include/linux/blktrace_api.h  |  2 ++
 include/uapi/linux/blktrace_api.h |  1 +
 kernel/trace/blktrace.c   | 14 --
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index 8804753..f120c6a 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -18,6 +18,7 @@ struct blk_trace {
unsigned long __percpu *sequence;
unsigned char __percpu *msg_data;
u16 act_mask;
+   bool enable_cg_info;
u64 start_lba;
u64 end_lba;
u32 pid;
@@ -102,6 +103,7 @@ static inline int blk_trace_init_sysfs(struct device *dev)
 struct compat_blk_user_trace_setup {
char name[BLKTRACE_BDEV_SIZE];
u16 act_mask;
+   u8  enable_cg_info;
u32 buf_size;
u32 buf_nr;
compat_u64 start_lba;
diff --git a/include/uapi/linux/blktrace_api.h 
b/include/uapi/linux/blktrace_api.h
index 20d1490d..d9d9fca 100644
--- a/include/uapi/linux/blktrace_api.h
+++ b/include/uapi/linux/blktrace_api.h
@@ -136,6 +136,7 @@ enum {
 struct blk_user_trace_setup {
char name[BLKTRACE_BDEV_SIZE];  /* output */
__u16 act_mask; /* input */
+   __u8  enable_cg_info;   /* input */
__u32 buf_size; /* input */
__u32 buf_nr;   /* input */
__u64 start_lba;
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 987d9a9a..f420400 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -180,7 +180,8 @@ void __trace_note_message(struct blk_trace *bt, struct 
blkcg *blkcg,
n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
va_end(args);
 
-   if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
+   if (!((blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP) ||
+ bt->enable_cg_info))
blkcg = NULL;
 #ifdef CONFIG_BLK_CGROUP
trace_note(bt, 0, BLK_TN_MESSAGE, buf, n,
@@ -549,6 +550,7 @@ static int do_blk_trace_setup(struct request_queue *q, char 
*name, dev_t dev,
bt->act_mask = buts->act_mask;
if (!bt->act_mask)
bt->act_mask = (u16) -1;
+   bt->enable_cg_info = buts->enable_cg_info;
 
blk_trace_setup_lba(bt, bdev);
 
@@ -625,6 +627,7 @@ static int compat_blk_trace_setup(struct request_queue *q, 
char *name,
 
buts = (struct blk_user_trace_setup) {
.act_mask = cbuts.act_mask,
+   .enable_cg_info = cbuts.enable_cg_info,
.buf_size = cbuts.buf_size,
.buf_nr = cbuts.buf_nr,
.start_lba = cbuts.start_lba,
@@ -773,7 +776,8 @@ blk_trace_bio_get_cgid(struct request_queue *q, struct bio 
*bio)
 {
struct blk_trace *bt = q->blk_trace;
 
-   if (!bt || !(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
+   if (!(bt && (bt->enable_cg_info ||
+(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP
return NULL;
 
if (!bio->bi_css)
@@ -1664,6 +1668,7 @@ static BLK_TRACE_DEVICE_ATTR(act_mask);
 static BLK_TRACE_DEVICE_ATTR(pid);
 static BLK_TRACE_DEVICE_ATTR(start_lba);
 static BLK_TRACE_DEVICE_ATTR(end_lba);
+static BLK_TRACE_DEVICE_ATTR(cgroup_info);
 
 static struct attribute *blk_trace_attrs[] = {
&dev_attr_enable.attr,
@@ -1671,6 +1676,7 @@ static struct attribute *blk_trace_attrs[] = {
&dev_attr_pid.attr,
&dev_attr_start_lba.attr,
&dev_attr_end_lba.attr,
+   &dev_attr_cgroup_info.attr,
NULL
 };
 
@@ -1794,6 +1800,8 @@ static ssize_t sysfs_blk_trace_attr_show(struct device 
*dev,
ret = sprintf(buf, "%llu\n", q->blk_trace->start_lba);
else if (attr == &dev_attr_end_lba)
ret = sprintf(buf, "%llu\n", q->blk_trace->end_lba);
+   else if (attr == &dev_attr_cgroup_info)
+   ret = sprintf(buf, "%u\n", q->blk_trace->enable_cg_info);
 
 out_unlock_bdev:
mutex_unlock(&q->blk_trace_mutex);
@@ -1861,6 +1869,8 @@ static ssize_t sysfs_blk_trace_attr_store(struct device 
*dev,
q->blk_trace->start_lba = value;
else if (attr == &dev_attr_end_lba)
q->blk_trace->end_lba = value;
+   else if (attr == &dev_attr_cgroup_info)
+   q->blk_trace->enable_cg_info = !!value;
}
 
 out_unlock_bdev:
-- 
2.9.5

Re: [for-4.16 PATCH v3 3/3] dm: fix awkward and incomplete request_queue initialization

2018-01-10 Thread Ming Lei

On Wed, Jan 10, 2018 at 09:12:56PM -0500, Mike Snitzer wrote:
> DM is now no longer prone to having its request_queue be improperly
> initialized.
> 
> Summary of changes:
> 
> - defer DM's blk_register_queue() from add_disk()-time until
>   dm_setup_md_queue() by setting QUEUE_FLAG_DEFER_REG in alloc_dev().
> 
> - dm_setup_md_queue() is updated to fully initialize DM's request_queue
>   (_after_ all table loads have occurred and the request_queue's type,
>   features and limits are known).
> 
> - various other small improvements that were noticed along the way.
> 
> A very welcome side-effect of these changes is DM no longer needs to:
> 1) backfill the "mq" sysfs entry (because historically DM didn't
> initialize the request_queue to use blk-mq until _after_
> register_queue() was called via add_disk()).
> 2) call elv_register_queue() to get .request_fn request-based DM
> device's "queue" exposed in syfs.
> 
> In addition, blk-mq debugfs support is now made available because
> request-based DM's blk-mq request_queue is now properly initialized
> before blk_register_queue() is called.
> 
> These changes also stave off the need to introduce new DM-specific
> workarounds in block core, e.g. this proposal:
> https://patchwork.kernel.org/patch/10067961/
> 
> In the end DM devices should be less unicorn in nature (relative to
> initialization and availability of block core infrastructure).
> 
> Signed-off-by: Mike Snitzer 
> ---
>  drivers/md/dm-core.h |  2 --
>  drivers/md/dm-rq.c   | 11 ---
>  drivers/md/dm.c  | 44 ++--
>  3 files changed, 26 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index 6a14f945783c..f955123b4765 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -130,8 +130,6 @@ struct mapped_device {
>   struct srcu_struct io_barrier;
>  };
>  
> -void dm_init_md_queue(struct mapped_device *md);
> -void dm_init_normal_md_queue(struct mapped_device *md);
>  int md_in_flight(struct mapped_device *md);
>  void disable_write_same(struct mapped_device *md);
>  void disable_write_zeroes(struct mapped_device *md);
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 9d32f25489c2..3b319776d80c 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -700,7 +700,6 @@ int dm_old_init_request_queue(struct mapped_device *md, 
> struct dm_table *t)
>   /* disable dm_old_request_fn's merge heuristic by default */
>   md->seq_rq_merge_deadline_usecs = 0;
>  
> - dm_init_normal_md_queue(md);
>   blk_queue_softirq_done(md->queue, dm_softirq_done);
>  
>   /* Initialize the request-based DM worker thread */
> @@ -713,8 +712,6 @@ int dm_old_init_request_queue(struct mapped_device *md, 
> struct dm_table *t)
>   return error;
>   }
>  
> - elv_register_queue(md->queue);
> -
>   return 0;
>  }
>  
> @@ -810,17 +807,9 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
> struct dm_table *t)
>   err = PTR_ERR(q);
>   goto out_tag_set;
>   }
> - dm_init_md_queue(md);
> -
> - /* backfill 'mq' sysfs registration normally done in blk_register_queue 
> */
> - err = blk_mq_register_dev(disk_to_dev(md->disk), q);
> - if (err)
> - goto out_cleanup_queue;
>  
>   return 0;
>  
> -out_cleanup_queue:
> - blk_cleanup_queue(q);
>  out_tag_set:
>   blk_mq_free_tag_set(md->tag_set);
>  out_kfree_tag_set:
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 7475739fee49..f5d61b6adaec 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1626,20 +1626,9 @@ static const struct dax_operations dm_dax_ops;
>  
>  static void dm_wq_work(struct work_struct *work);
>  
> -void dm_init_md_queue(struct mapped_device *md)
> -{
> - /*
> -  * Initialize data that will only be used by a non-blk-mq DM queue
> -  * - must do so here (in alloc_dev callchain) before queue is used
> -  */
> - md->queue->queuedata = md;
> - md->queue->backing_dev_info->congested_data = md;
> -}
> -
> -void dm_init_normal_md_queue(struct mapped_device *md)
> +static void dm_init_normal_md_queue(struct mapped_device *md)
>  {
>   md->use_blk_mq = false;
> - dm_init_md_queue(md);
>  
>   /*
>* Initialize aspects of queue that aren't relevant for blk-mq
> @@ -1734,10 +1723,15 @@ static struct mapped_device *alloc_dev(int minor)
>   md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id);
>   if (!md->queue)
>   goto bad;
> + md->queue->queuedata = md;
> + md->queue->backing_dev_info->congested_data = md;
> + /*
> +  * Do not allow add_disk() to blk_register_queue().
> +  * Defer blk_register_queue() until dm_setup_md_queue().
> +  */
> + queue_flag_set_unlocked(QUEUE_FLAG_DEFER_REG, md->queue);
>  
> - dm_init_md_queue(md);
> -
> - md->disk = alloc_disk_node(1, numa_node_id);
> + md->disk = alloc_disk_node

Re: [for-4.16 PATCH v3 2/3] block: allow gendisk's request_queue registration to be deferred

2018-01-10 Thread Ming Lei

On Wed, Jan 10, 2018 at 09:12:55PM -0500, Mike Snitzer wrote:
> Since I can remember DM has forced the block layer to allow the
> allocation and initialization of the request_queue to be distinct
> operations.  Reason for this is block/genhd.c:add_disk() has requires
> that the request_queue (and associated bdi) be tied to the gendisk
> before add_disk() is called -- because add_disk() also deals with
> exposing the request_queue via blk_register_queue().
> 
> DM's dynamic creation of arbitrary device types (and associated
> request_queue types) requires the DM device's gendisk be available so
> that DM table loads can establish a master/slave relationship with
> subordinate devices that are referenced by loaded DM tables -- using
> bd_link_disk_holder().  But until these DM tables, and their associated
> subordinate devices, are known DM cannot know what type of request_queue
> it needs -- nor what its queue_limits should be.
> 
> This chicken and egg scenario has created all manner of problems for DM
> and, at times, the block layer.
> 
> Summary of changes:
> 
> - Add QUEUE_FLAG_DEFER_REG that a block driver can set to defer the
>   required blk_register_queue() until the block driver's request_queue
>   is fully initialized.
> 
> - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
>   is not set.  It won't be set if a request_queue with
>   QUEUE_FLAG_DEFER_REG set never called blk_register_queue() -- as can
>   happen if a driver encounters an error and must del_gendisk() before
>   calling blk_register_queue().
> 
> - Export blk_register_queue().
> 
> These changes allow DM to use device_add_disk() to anchor its gendisk as
> the "master" for master/slave relationships DM must establish with
> subordinate devices referenced in DM tables that get loaded.  Once all
> "slave" devices for a DM device are known a request_queue can be
> properly initialized and then advertised via sysfs -- important
> improvement being that no request_queue resource initialization is
> missed -- before these changes DM was known to be missing blk-mq debugfs
> and proper block throttle initialization.
> 
> Signed-off-by: Mike Snitzer 
> ---
>  block/blk-sysfs.c  | 4 
>  block/genhd.c  | 4 +++-
>  include/linux/blkdev.h | 1 +
>  3 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 870484eaed1f..2395122875b4 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -921,6 +921,7 @@ int blk_register_queue(struct gendisk *disk)
>   mutex_unlock(&q->sysfs_lock);
>   return ret;
>  }
> +EXPORT_SYMBOL_GPL(blk_register_queue);
>  
>  void blk_unregister_queue(struct gendisk *disk)
>  {
> @@ -929,6 +930,9 @@ void blk_unregister_queue(struct gendisk *disk)
>   if (WARN_ON(!q))
>   return;
>  
> + if (!test_bit(QUEUE_FLAG_REGISTERED, &disk->queue->queue_flags))
> + return;
> +
>   mutex_lock(&q->sysfs_lock);
>   queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
>   mutex_unlock(&q->sysfs_lock);
> diff --git a/block/genhd.c b/block/genhd.c
> index 00620e01e043..3912a82ecc4f 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -682,7 +682,6 @@ void device_add_disk(struct device *parent, struct 
> gendisk *disk)
>   exact_match, exact_lock, disk);
>   }
>   register_disk(parent, disk);
> - blk_register_queue(disk);
>  
>   /*
>* Take an extra ref on queue which will be put on disk_release()
> @@ -692,6 +691,9 @@ void device_add_disk(struct device *parent, struct 
> gendisk *disk)
>  
>   disk_add_events(disk);
>   blk_integrity_add(disk);
> +
> + if (!test_bit(QUEUE_FLAG_DEFER_REG, &disk->queue->queue_flags))
> + blk_register_queue(disk);
>  }
>  EXPORT_SYMBOL(device_add_disk);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 71a9371c8182..84d144c7b025 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -681,6 +681,7 @@ struct request_queue {
>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 27   /* queue supports SCSI commands 
> */
>  #define QUEUE_FLAG_QUIESCED28/* queue has been quiesced */
>  #define QUEUE_FLAG_PREEMPT_ONLY  29  /* only process REQ_PREEMPT 
> requests */
> +#define QUEUE_FLAG_DEFER_REG 30  /* defer registering queue to a disk */
>  
>  #define QUEUE_FLAG_DEFAULT   ((1 << QUEUE_FLAG_IO_STAT) |\
>(1 << QUEUE_FLAG_SAME_COMP)|   \
> -- 
> 2.15.0

This way is safe for all other devices, even for DM it is safe too since
block does deal with NULL .make_request_fn well for legacy path.

Maybe QUEUE_FLAG_DEFER_REG can be renamed as QUEUE_FLAG_REG_BY_DRIVER,
but not a big deal, so

Reviewed-by: Ming Lei 

-- 
Ming

Re: [for-4.16 PATCH v3 1/3] block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN

2018-01-10 Thread Ming Lei

On Wed, Jan 10, 2018 at 09:12:54PM -0500, Mike Snitzer wrote:
> device_add_disk() will only call bdi_register_owner() if
> !GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
> bdi_unregister() if !GENHD_FL_HIDDEN.
> 
> Found with code inspection.  bdi_unregister() won't do any harm if
> bdi_register_owner() wasn't used but best to avoid the unnecessary
> call to bdi_unregister().
> 
> Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
> Signed-off-by: Mike Snitzer 
> ---
>  block/genhd.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 96a66f671720..00620e01e043 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -725,7 +725,8 @@ void del_gendisk(struct gendisk *disk)
>* Unregister bdi before releasing device numbers (as they can
>* get reused and we'd get clashes in sysfs).
>*/
> - bdi_unregister(disk->queue->backing_dev_info);
> + if (!(disk->flags & GENHD_FL_HIDDEN))
> + bdi_unregister(disk->queue->backing_dev_info);
>   blk_unregister_queue(disk);
>   } else {
>   WARN_ON(1);
> -- 
> 2.15.0
> 

Reviewed-by: Ming Lei 

-- 
Ming

Re: [PATCH v2 3/4] scsi: Avoid that .queuecommand() gets called for a quiesced SCSI device

2018-01-10 Thread Ming Lei

On Wed, Jan 10, 2018 at 10:18:16AM -0800, Bart Van Assche wrote:
> Several SCSI transport and LLD drivers surround code that does not
> tolerate concurrent calls of .queuecommand() with scsi_target_block() /
> scsi_target_unblock(). These last two functions use
> blk_mq_quiesce_queue() / blk_mq_unquiesce_queue() for scsi-mq request
> queues to prevent concurrent .queuecommand() calls. However, that is

Actually blk_mq_quiesce_queue() is supposed to disable and drain dispatch,
not for prevent concurrent .queuecommand() calls.

> not sufficient to prevent .queuecommand() calls from scsi_send_eh_cmnd().

Given it is error handling, do we need to prevent the .queuecommand() call
in scsi_send_eh_cmnd()? Could you share us what the actual issue
observed is from user view?

> Hence surround the .queuecommand() call from the SCSI error handler with
> blk_start_wait_if_quiesced() / blk_finish_wait_if_quiesced().
> 
> Note: converting the .queuecommand() call in scsi_send_eh_cmnd() into
> code that calls blk_get_request(), e.g. scsi_execute_req(), is not an
> option since scsi_send_eh_cmnd() can be called if all requests are
> allocated and if no requests will make progress without aborting any
> of these requests.

If we need to prevent the .queuecommand() in scsi_send_eh_cmnd(), what
do you think of the approach by requeuing the EH command via
scsi_mq_requeue_cmd()/scsi_requeue_cmd()? And the EH request will be
dispatched finally when the queue becomes unquiesced or the STOPPED
is cleared.

-- 
Ming

[for-4.16 PATCH v3 0/3] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-10 Thread Mike Snitzer

Hi Jens,

I eliminated my implementation that set disk->queue = NULL before
calling add_disk().  As we discussed it left way too much potential
for NULL pointer crashes and I agree it was too fragile.

This v3's approach is much simpler.  It adjusts block core so that
blk_register_queue() can be deferred (so add_disk()'s call is avoided
if QUEUE_FLAG_DEFER_REG is set, and a driver must then call it once it
has completed initializing its request_queue).

PATCH 1 is just an unrelated fix.  Christoph agreed with it in reply
to my v2 submission (but he didn't provide a Reviewed-by).  Anyway,
I've revised the header to have it make more sense.

If these changes look reasonable I'd prefer that you pick them all up
for 4.16 (last DM patch included because it'll save me an awkward
dm-4.16 rebase, etc).

Thanks!
Mike

Mike Snitzer (3):
  block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN
  block: allow gendisk's request_queue registration to be deferred
  dm: fix awkward and incomplete request_queue initialization

 block/blk-sysfs.c  |  4 
 block/genhd.c  |  7 +--
 drivers/md/dm-core.h   |  2 --
 drivers/md/dm-rq.c | 11 ---
 drivers/md/dm.c| 44 ++--
 include/linux/blkdev.h |  1 +
 6 files changed, 36 insertions(+), 33 deletions(-)

-- 
2.15.0

[for-4.16 PATCH v3 3/3] dm: fix awkward and incomplete request_queue initialization

2018-01-10 Thread Mike Snitzer

DM is now no longer prone to having its request_queue be improperly
initialized.

Summary of changes:

- defer DM's blk_register_queue() from add_disk()-time until
  dm_setup_md_queue() by setting QUEUE_FLAG_DEFER_REG in alloc_dev().

- dm_setup_md_queue() is updated to fully initialize DM's request_queue
  (_after_ all table loads have occurred and the request_queue's type,
  features and limits are known).

- various other small improvements that were noticed along the way.

A very welcome side-effect of these changes is DM no longer needs to:
1) backfill the "mq" sysfs entry (because historically DM didn't
initialize the request_queue to use blk-mq until _after_
register_queue() was called via add_disk()).
2) call elv_register_queue() to get .request_fn request-based DM
device's "queue" exposed in syfs.

In addition, blk-mq debugfs support is now made available because
request-based DM's blk-mq request_queue is now properly initialized
before blk_register_queue() is called.

These changes also stave off the need to introduce new DM-specific
workarounds in block core, e.g. this proposal:
https://patchwork.kernel.org/patch/10067961/

In the end DM devices should be less unicorn in nature (relative to
initialization and availability of block core infrastructure).

Signed-off-by: Mike Snitzer 
---
 drivers/md/dm-core.h |  2 --
 drivers/md/dm-rq.c   | 11 ---
 drivers/md/dm.c  | 44 ++--
 3 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 6a14f945783c..f955123b4765 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -130,8 +130,6 @@ struct mapped_device {
struct srcu_struct io_barrier;
 };
 
-void dm_init_md_queue(struct mapped_device *md);
-void dm_init_normal_md_queue(struct mapped_device *md);
 int md_in_flight(struct mapped_device *md);
 void disable_write_same(struct mapped_device *md);
 void disable_write_zeroes(struct mapped_device *md);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 9d32f25489c2..3b319776d80c 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -700,7 +700,6 @@ int dm_old_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
/* disable dm_old_request_fn's merge heuristic by default */
md->seq_rq_merge_deadline_usecs = 0;
 
-   dm_init_normal_md_queue(md);
blk_queue_softirq_done(md->queue, dm_softirq_done);
 
/* Initialize the request-based DM worker thread */
@@ -713,8 +712,6 @@ int dm_old_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
return error;
}
 
-   elv_register_queue(md->queue);
-
return 0;
 }
 
@@ -810,17 +807,9 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
err = PTR_ERR(q);
goto out_tag_set;
}
-   dm_init_md_queue(md);
-
-   /* backfill 'mq' sysfs registration normally done in blk_register_queue 
*/
-   err = blk_mq_register_dev(disk_to_dev(md->disk), q);
-   if (err)
-   goto out_cleanup_queue;
 
return 0;
 
-out_cleanup_queue:
-   blk_cleanup_queue(q);
 out_tag_set:
blk_mq_free_tag_set(md->tag_set);
 out_kfree_tag_set:
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7475739fee49..f5d61b6adaec 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1626,20 +1626,9 @@ static const struct dax_operations dm_dax_ops;
 
 static void dm_wq_work(struct work_struct *work);
 
-void dm_init_md_queue(struct mapped_device *md)
-{
-   /*
-* Initialize data that will only be used by a non-blk-mq DM queue
-* - must do so here (in alloc_dev callchain) before queue is used
-*/
-   md->queue->queuedata = md;
-   md->queue->backing_dev_info->congested_data = md;
-}
-
-void dm_init_normal_md_queue(struct mapped_device *md)
+static void dm_init_normal_md_queue(struct mapped_device *md)
 {
md->use_blk_mq = false;
-   dm_init_md_queue(md);
 
/*
 * Initialize aspects of queue that aren't relevant for blk-mq
@@ -1734,10 +1723,15 @@ static struct mapped_device *alloc_dev(int minor)
md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id);
if (!md->queue)
goto bad;
+   md->queue->queuedata = md;
+   md->queue->backing_dev_info->congested_data = md;
+   /*
+* Do not allow add_disk() to blk_register_queue().
+* Defer blk_register_queue() until dm_setup_md_queue().
+*/
+   queue_flag_set_unlocked(QUEUE_FLAG_DEFER_REG, md->queue);
 
-   dm_init_md_queue(md);
-
-   md->disk = alloc_disk_node(1, numa_node_id);
+   md->disk = alloc_disk_node(1, md->numa_node_id);
if (!md->disk)
goto bad;
 
@@ -1962,13 +1956,18 @@ static struct dm_table *__unbind(struct mapped_device 
*md)
  */
 int dm_create(int minor, struct mapped_device **result)
 {
+   int r;
st

[for-4.16 PATCH v3 2/3] block: allow gendisk's request_queue registration to be deferred

2018-01-10 Thread Mike Snitzer

Since I can remember DM has forced the block layer to allow the
allocation and initialization of the request_queue to be distinct
operations.  Reason for this is block/genhd.c:add_disk() has requires
that the request_queue (and associated bdi) be tied to the gendisk
before add_disk() is called -- because add_disk() also deals with
exposing the request_queue via blk_register_queue().

DM's dynamic creation of arbitrary device types (and associated
request_queue types) requires the DM device's gendisk be available so
that DM table loads can establish a master/slave relationship with
subordinate devices that are referenced by loaded DM tables -- using
bd_link_disk_holder().  But until these DM tables, and their associated
subordinate devices, are known DM cannot know what type of request_queue
it needs -- nor what its queue_limits should be.

This chicken and egg scenario has created all manner of problems for DM
and, at times, the block layer.

Summary of changes:

- Add QUEUE_FLAG_DEFER_REG that a block driver can set to defer the
  required blk_register_queue() until the block driver's request_queue
  is fully initialized.

- Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
  is not set.  It won't be set if a request_queue with
  QUEUE_FLAG_DEFER_REG set never called blk_register_queue() -- as can
  happen if a driver encounters an error and must del_gendisk() before
  calling blk_register_queue().

- Export blk_register_queue().

These changes allow DM to use device_add_disk() to anchor its gendisk as
the "master" for master/slave relationships DM must establish with
subordinate devices referenced in DM tables that get loaded.  Once all
"slave" devices for a DM device are known a request_queue can be
properly initialized and then advertised via sysfs -- important
improvement being that no request_queue resource initialization is
missed -- before these changes DM was known to be missing blk-mq debugfs
and proper block throttle initialization.

Signed-off-by: Mike Snitzer 
---
 block/blk-sysfs.c  | 4 
 block/genhd.c  | 4 +++-
 include/linux/blkdev.h | 1 +
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 870484eaed1f..2395122875b4 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -921,6 +921,7 @@ int blk_register_queue(struct gendisk *disk)
mutex_unlock(&q->sysfs_lock);
return ret;
 }
+EXPORT_SYMBOL_GPL(blk_register_queue);
 
 void blk_unregister_queue(struct gendisk *disk)
 {
@@ -929,6 +930,9 @@ void blk_unregister_queue(struct gendisk *disk)
if (WARN_ON(!q))
return;
 
+   if (!test_bit(QUEUE_FLAG_REGISTERED, &disk->queue->queue_flags))
+   return;
+
mutex_lock(&q->sysfs_lock);
queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
mutex_unlock(&q->sysfs_lock);
diff --git a/block/genhd.c b/block/genhd.c
index 00620e01e043..3912a82ecc4f 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -682,7 +682,6 @@ void device_add_disk(struct device *parent, struct gendisk 
*disk)
exact_match, exact_lock, disk);
}
register_disk(parent, disk);
-   blk_register_queue(disk);
 
/*
 * Take an extra ref on queue which will be put on disk_release()
@@ -692,6 +691,9 @@ void device_add_disk(struct device *parent, struct gendisk 
*disk)
 
disk_add_events(disk);
blk_integrity_add(disk);
+
+   if (!test_bit(QUEUE_FLAG_DEFER_REG, &disk->queue->queue_flags))
+   blk_register_queue(disk);
 }
 EXPORT_SYMBOL(device_add_disk);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 71a9371c8182..84d144c7b025 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -681,6 +681,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED28  /* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY29  /* only process REQ_PREEMPT 
requests */
+#define QUEUE_FLAG_DEFER_REG   30  /* defer registering queue to a disk */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_SAME_COMP)|   \
-- 
2.15.0

[for-4.16 PATCH v3 1/3] block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN

2018-01-10 Thread Mike Snitzer

device_add_disk() will only call bdi_register_owner() if
!GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
bdi_unregister() if !GENHD_FL_HIDDEN.

Found with code inspection.  bdi_unregister() won't do any harm if
bdi_register_owner() wasn't used but best to avoid the unnecessary
call to bdi_unregister().

Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
Signed-off-by: Mike Snitzer 
---
 block/genhd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index 96a66f671720..00620e01e043 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -725,7 +725,8 @@ void del_gendisk(struct gendisk *disk)
 * Unregister bdi before releasing device numbers (as they can
 * get reused and we'd get clashes in sysfs).
 */
-   bdi_unregister(disk->queue->backing_dev_info);
+   if (!(disk->flags & GENHD_FL_HIDDEN))
+   bdi_unregister(disk->queue->backing_dev_info);
blk_unregister_queue(disk);
} else {
WARN_ON(1);
-- 
2.15.0

[PATCH v3 0/2] Rework blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

Hello Jens,

This patch series reworks the blk_mq_mark_tag_wait() implementation and also
fixes a race condition in that function. Please consider these two patches for
kernel v4.16.

Thanks,

Bart.

Changes compared to v3:
- Reworked patch 1/2 such that it uses if (...) ...; ... instead of
  if (...) ... else ... as proposed by Jens.

Changes compared to v1:
- Split a single patch into two patches to make reviewing easier.
- The race fix does no longer use prepare_to_wait() / finish_wait().

Bart Van Assche (2):
  blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()
  blk-mq: Fix a race condition in blk_mq_mark_tag_wait()

 block/blk-mq.c | 72 +-
 1 file changed, 36 insertions(+), 36 deletions(-)

-- 
2.15.1

[PATCH v3 1/2] blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

This patch does not change any functionality but makes the
blk_mq_mark_tag_wait() code slightly easier to read.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Omar Sandoval 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq.c | 69 +-
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1d705e25852e..f27bcb6f6751 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1181,58 +1181,59 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
 struct request *rq)
 {
struct blk_mq_hw_ctx *this_hctx = *hctx;
-   bool shared_tags = (this_hctx->flags & BLK_MQ_F_TAG_SHARED) != 0;
struct sbq_wait_state *ws;
wait_queue_entry_t *wait;
bool ret;
 
-   if (!shared_tags) {
+   if (!(this_hctx->flags & BLK_MQ_F_TAG_SHARED)) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state))
set_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state);
-   } else {
-   wait = &this_hctx->dispatch_wait;
-   if (!list_empty_careful(&wait->entry))
-   return false;
 
-   spin_lock(&this_hctx->lock);
-   if (!list_empty(&wait->entry)) {
-   spin_unlock(&this_hctx->lock);
-   return false;
-   }
+   /*
+* It's possible that a tag was freed in the window between the
+* allocation failure and adding the hardware queue to the wait
+* queue.
+*
+* Don't clear RESTART here, someone else could have set it.
+* At most this will cost an extra queue run.
+*/
+   return blk_mq_get_driver_tag(rq, hctx, false);
+   }
+
+   wait = &this_hctx->dispatch_wait;
+   if (!list_empty_careful(&wait->entry))
+   return false;
 
-   ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
-   add_wait_queue(&ws->wait, wait);
+   spin_lock(&this_hctx->lock);
+   if (!list_empty(&wait->entry)) {
+   spin_unlock(&this_hctx->lock);
+   return false;
}
 
+   ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
+   add_wait_queue(&ws->wait, wait);
+
/*
 * It's possible that a tag was freed in the window between the
 * allocation failure and adding the hardware queue to the wait
 * queue.
 */
ret = blk_mq_get_driver_tag(rq, hctx, false);
-
-   if (!shared_tags) {
-   /*
-* Don't clear RESTART here, someone else could have set it.
-* At most this will cost an extra queue run.
-*/
-   return ret;
-   } else {
-   if (!ret) {
-   spin_unlock(&this_hctx->lock);
-   return false;
-   }
-
-   /*
-* We got a tag, remove ourselves from the wait queue to ensure
-* someone else gets the wakeup.
-*/
-   spin_lock_irq(&ws->wait.lock);
-   list_del_init(&wait->entry);
-   spin_unlock_irq(&ws->wait.lock);
+   if (!ret) {
spin_unlock(&this_hctx->lock);
-   return true;
+   return false;
}
+
+   /*
+* We got a tag, remove ourselves from the wait queue to ensure
+* someone else gets the wakeup.
+*/
+   spin_lock_irq(&ws->wait.lock);
+   list_del_init(&wait->entry);
+   spin_unlock_irq(&ws->wait.lock);
+   spin_unlock(&this_hctx->lock);
+
+   return true;
 }
 
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
-- 
2.15.1

[PATCH v3 2/2] blk-mq: Fix a race condition in blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

Both add_wait_queue() and blk_mq_dispatch_wake() protect wait queue
manipulations with the wait queue lock. Hence also protect the
!list_empty(&wait->entry) test with the wait queue lock.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Omar Sandoval 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f27bcb6f6751..bbadbd5c8003 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1183,7 +1183,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
struct blk_mq_hw_ctx *this_hctx = *hctx;
struct sbq_wait_state *ws;
wait_queue_entry_t *wait;
-   bool ret;
+   bool on_wait_list, ret;
 
if (!(this_hctx->flags & BLK_MQ_F_TAG_SHARED)) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state))
@@ -1204,13 +1204,15 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
if (!list_empty_careful(&wait->entry))
return false;
 
-   spin_lock(&this_hctx->lock);
-   if (!list_empty(&wait->entry)) {
-   spin_unlock(&this_hctx->lock);
+   ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
+
+   spin_lock_irq(&ws->wait.lock);
+   on_wait_list = !list_empty(&wait->entry);
+   spin_unlock_irq(&ws->wait.lock);
+
+   if (on_wait_list)
return false;
-   }
 
-   ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
add_wait_queue(&ws->wait, wait);
 
/*
@@ -1219,10 +1221,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
 * queue.
 */
ret = blk_mq_get_driver_tag(rq, hctx, false);
-   if (!ret) {
-   spin_unlock(&this_hctx->lock);
+   if (!ret)
return false;
-   }
 
/*
 * We got a tag, remove ourselves from the wait queue to ensure
@@ -1231,7 +1231,6 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
spin_lock_irq(&ws->wait.lock);
list_del_init(&wait->entry);
spin_unlock_irq(&ws->wait.lock);
-   spin_unlock(&this_hctx->lock);
 
return true;
 }
-- 
2.15.1

Re: [PATCH v2 1/2] blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

On Wed, 2018-01-10 at 13:30 -0700, Jens Axboe wrote:
> On 1/10/18 12:39 PM, Bart Van Assche wrote:
> > This patch does not change any functionality but makes the
> > blk_mq_mark_tag_wait() code slightly easier to read.
> 
> I agree it could do with a cleanup, but how about something like the
> below? I think that's easier to read.
> 
> [ ... ]

Hello Jens,

That sounds like a good idea to me. I will update the patch series, retest
it and repost.

Bart.

Re: [PATCH v2 1/2] blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()

2018-01-10 Thread Jens Axboe

On 1/10/18 12:39 PM, Bart Van Assche wrote:
> This patch does not change any functionality but makes the
> blk_mq_mark_tag_wait() code slightly easier to read.

I agree it could do with a cleanup, but how about something like the
below? I think that's easier to read.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8000ba6db07d..afccd0848d6f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1104,58 +1104,59 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
 struct request *rq)
 {
struct blk_mq_hw_ctx *this_hctx = *hctx;
-   bool shared_tags = (this_hctx->flags & BLK_MQ_F_TAG_SHARED) != 0;
struct sbq_wait_state *ws;
wait_queue_entry_t *wait;
bool ret;
 
-   if (!shared_tags) {
+   if (!(this_hctx->flags & BLK_MQ_F_TAG_SHARED)) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state))
set_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state);
-   } else {
-   wait = &this_hctx->dispatch_wait;
-   if (!list_empty_careful(&wait->entry))
-   return false;
 
-   spin_lock(&this_hctx->lock);
-   if (!list_empty(&wait->entry)) {
-   spin_unlock(&this_hctx->lock);
-   return false;
-   }
+   /*
+* It's possible that a tag was freed in the window between the
+* allocation failure and adding the hardware queue to the wait
+* queue.
+*
+* Don't clear RESTART here, someone else could have set it.
+* At most this will cost an extra queue run.
+*/
+   return blk_mq_get_driver_tag(rq, hctx, false);
+   }
+
+   wait = &this_hctx->dispatch_wait;
+   if (!list_empty_careful(&wait->entry))
+   return false;
 
-   ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
-   add_wait_queue(&ws->wait, wait);
+   spin_lock(&this_hctx->lock);
+   if (!list_empty(&wait->entry)) {
+   spin_unlock(&this_hctx->lock);
+   return false;
}
 
+   ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
+   add_wait_queue(&ws->wait, wait);
+
/*
 * It's possible that a tag was freed in the window between the
 * allocation failure and adding the hardware queue to the wait
 * queue.
 */
ret = blk_mq_get_driver_tag(rq, hctx, false);
-
-   if (!shared_tags) {
-   /*
-* Don't clear RESTART here, someone else could have set it.
-* At most this will cost an extra queue run.
-*/
-   return ret;
-   } else {
-   if (!ret) {
-   spin_unlock(&this_hctx->lock);
-   return false;
-   }
-
-   /*
-* We got a tag, remove ourselves from the wait queue to ensure
-* someone else gets the wakeup.
-*/
-   spin_lock_irq(&ws->wait.lock);
-   list_del_init(&wait->entry);
-   spin_unlock_irq(&ws->wait.lock);
+   if (!ret) {
spin_unlock(&this_hctx->lock);
-   return true;
+   return false;
}
+
+   /*
+* We got a tag, remove ourselves from the wait queue to ensure
+* someone else gets the wakeup.
+*/
+   spin_lock_irq(&ws->wait.lock);
+   list_del_init(&wait->entry);
+   spin_unlock_irq(&ws->wait.lock);
+   spin_unlock(&this_hctx->lock);
+
+   return true;
 }
 
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,

-- 
Jens Axboe

[PATCH v2 0/2] Rework blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

Hello Jens,

This patch series reworks the blk_mq_mark_tag_wait() implementation and also
fixes a race condition in that function. Please consider these two patches for
kernel v4.16.

Thanks,

Bart.

Changes compared to v1:
- Split a single patch into two patches to make reviewing easier.
- The race fix does no longer use prepare_to_wait() / finish_wait().

Bart Van Assche (2):
  blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()
  blk-mq: Fix a race condition in blk_mq_mark_tag_wait()

 block/blk-mq.c | 49 -
 1 file changed, 24 insertions(+), 25 deletions(-)

-- 
2.15.1

[PATCH v2 1/2] blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

This patch does not change any functionality but makes the
blk_mq_mark_tag_wait() code slightly easier to read.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Omar Sandoval 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1d705e25852e..e770e8814f60 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1189,6 +1189,16 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
if (!shared_tags) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state))
set_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state);
+   /*
+* It's possible that a tag was freed in the window between the
+* allocation failure and adding the hardware queue to the wait
+* queue.
+*/
+   ret = blk_mq_get_driver_tag(rq, hctx, false);
+   /*
+* Don't clear RESTART here, someone else could have set it.
+* At most this will cost an extra queue run.
+*/
} else {
wait = &this_hctx->dispatch_wait;
if (!list_empty_careful(&wait->entry))
@@ -1202,22 +1212,12 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
 
ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
add_wait_queue(&ws->wait, wait);
-   }
-
-   /*
-* It's possible that a tag was freed in the window between the
-* allocation failure and adding the hardware queue to the wait
-* queue.
-*/
-   ret = blk_mq_get_driver_tag(rq, hctx, false);
-
-   if (!shared_tags) {
/*
-* Don't clear RESTART here, someone else could have set it.
-* At most this will cost an extra queue run.
+* It's possible that a tag was freed in the window between the
+* allocation failure and adding the hardware queue to the wait
+* queue.
 */
-   return ret;
-   } else {
+   ret = blk_mq_get_driver_tag(rq, hctx, false);
if (!ret) {
spin_unlock(&this_hctx->lock);
return false;
@@ -1231,8 +1231,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
list_del_init(&wait->entry);
spin_unlock_irq(&ws->wait.lock);
spin_unlock(&this_hctx->lock);
-   return true;
}
+   return ret;
 }
 
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
-- 
2.15.1

[PATCH v2 2/2] blk-mq: Fix a race condition in blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

Both add_wait_queue() and blk_mq_dispatch_wake() protect wait queue
manipulations with the wait queue lock. Hence also protect the
!list_empty(&wait->entry) test with the wait queue lock instead of
the hctx lock.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Omar Sandoval 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e770e8814f60..d5313ce60836 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1184,7 +1184,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
bool shared_tags = (this_hctx->flags & BLK_MQ_F_TAG_SHARED) != 0;
struct sbq_wait_state *ws;
wait_queue_entry_t *wait;
-   bool ret;
+   bool on_wait_list, ret;
 
if (!shared_tags) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state))
@@ -1204,13 +1204,15 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
if (!list_empty_careful(&wait->entry))
return false;
 
-   spin_lock(&this_hctx->lock);
-   if (!list_empty(&wait->entry)) {
-   spin_unlock(&this_hctx->lock);
+   ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
+
+   spin_lock_irq(&ws->wait.lock);
+   on_wait_list = !list_empty(&wait->entry);
+   spin_unlock_irq(&ws->wait.lock);
+
+   if (on_wait_list)
return false;
-   }
 
-   ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
add_wait_queue(&ws->wait, wait);
/*
 * It's possible that a tag was freed in the window between the
@@ -1218,10 +1220,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
 * queue.
 */
ret = blk_mq_get_driver_tag(rq, hctx, false);
-   if (!ret) {
-   spin_unlock(&this_hctx->lock);
+   if (!ret)
return false;
-   }
 
/*
 * We got a tag, remove ourselves from the wait queue to ensure
@@ -1230,7 +1230,6 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
spin_lock_irq(&ws->wait.lock);
list_del_init(&wait->entry);
spin_unlock_irq(&ws->wait.lock);
-   spin_unlock(&this_hctx->lock);
}
return ret;
 }
-- 
2.15.1

Re: [PATCH] blk-mq: Add locking annotations to hctx_lock() and hctx_unlock()

2018-01-10 Thread Jens Axboe

On 1/10/18 12:34 PM, Bart Van Assche wrote:
> This patch avoids that sparse reports the following:
> 
> block/blk-mq.c:637:33: warning: context imbalance in 'hctx_unlock' - 
> unexpected unlock
> block/blk-mq.c:642:9: warning: context imbalance in 'hctx_lock' - wrong count 
> at exit

Thanks Bart, applied.

-- 
Jens Axboe

[PATCH] blk-mq: Add locking annotations to hctx_lock() and hctx_unlock()

2018-01-10 Thread Bart Van Assche

This patch avoids that sparse reports the following:

block/blk-mq.c:637:33: warning: context imbalance in 'hctx_unlock' - unexpected 
unlock
block/blk-mq.c:642:9: warning: context imbalance in 'hctx_lock' - wrong count 
at exit

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8139d88e78f6..1d705e25852e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -630,6 +630,7 @@ static void __blk_mq_complete_request(struct request *rq)
 }
 
 static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
+   __releases(hctx->srcu)
 {
if (!(hctx->flags & BLK_MQ_F_BLOCKING))
rcu_read_unlock();
@@ -638,6 +639,7 @@ static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int 
srcu_idx)
 }
 
 static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
+   __acquires(hctx->srcu)
 {
if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
/* shut up gcc false positive */
-- 
2.15.1

Re: [PATCH v2] block: silently forbid sending any ioctl to a partition

2018-01-10 Thread Jens Axboe

On 1/10/18 8:54 AM, Paolo Bonzini wrote:
> After the first few months, the message has not led to many bug reports.
> It's been almost five years now, and in practice the main source of
> it seems to be MTIOCGET that someone is using to detect tape devices.
> While we could whitelist it just like CDROM_GET_CAPABILITY, this patch
> just removes the message altogether.
> 
> The patch also removes the "safe but not very useful" ioctl whitelist,
> as suggested by Christoph.  I doubt anything is using most of those
> ioctls _in general_, let alone on a partition.

Applied, thanks.

-- 
Jens Axboe

Re: [PATCH] blk-mq: Rework blk_mq_mark_tag_wait()

2018-01-10 Thread Omar Sandoval

On Wed, Jan 10, 2018 at 10:04:28AM -0800, Bart Van Assche wrote:
> Use prepare_to_wait() and finish_wait() instead of open-coding these
> functions. Reduce the number of if-statements to make
> blk_mq_mark_tag_wait() easier to read. Both add_wait_queue() and
> blk_mq_dispatch_wake() protect wait queue manipulations with the wait
> queue lock. Hence also protect the !list_empty(&wait->entry) test with
> the wait queue lock instead of the hctx lock.
> 
> Note: a side effect of this patch is that the task state is changed
> for shared queues before and after the blk_mq_get_driver_tag().
> Since blk_mq_dispatch_wake() ignores the task state these task state
> changes do not affect which task gets woken up.

I'm not a fan of this. The reason it's open-coded is exactly because we
don't want to mess around with the task state.

> See also commit f906a6a0f426 ("blk-mq: improve tag waiting setup for
> non-shared tags").
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---
>  block/blk-mq.c | 49 +
>  1 file changed, 17 insertions(+), 32 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 61bdbf45c3be..29f140b4dbf7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1121,50 +1121,35 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
> **hctx,
>   if (!shared_tags) {
>   if (!test_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state))
>   set_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state);
> + ret = blk_mq_get_driver_tag(rq, hctx, false);
> + /*
> +  * Don't clear RESTART here, someone else could have set it.
> +  * At most this will cost an extra queue run.
> +  */
>   } else {
>   wait = &this_hctx->dispatch_wait;
>   if (!list_empty_careful(&wait->entry))
>   return false;
>  
> - spin_lock(&this_hctx->lock);
> - if (!list_empty(&wait->entry)) {
> - spin_unlock(&this_hctx->lock);
> - return false;
> - }
> -
>   ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
> - add_wait_queue(&ws->wait, wait);
> - }
> -
> - /*
> -  * It's possible that a tag was freed in the window between the
> -  * allocation failure and adding the hardware queue to the wait
> -  * queue.
> -  */
> - ret = blk_mq_get_driver_tag(rq, hctx, false);
> + prepare_to_wait(&ws->wait, wait, TASK_UNINTERRUPTIBLE);
>  
> - if (!shared_tags) {
>   /*
> -  * Don't clear RESTART here, someone else could have set it.
> -  * At most this will cost an extra queue run.
> +  * It's possible that a tag was freed in the window between the
> +  * allocation failure and adding the hardware queue to the wait
> +  * queue.
>*/
> - return ret;
> - } else {
> - if (!ret) {
> - spin_unlock(&this_hctx->lock);
> - return false;
> - }
> -
> + ret = blk_mq_get_driver_tag(rq, hctx, false);
>   /*
> -  * We got a tag, remove ourselves from the wait queue to ensure
> -  * someone else gets the wakeup.
> +  * If we got a tag, remove ourselves from the wait queue to
> +  * ensure someone else gets the wakeup.
>*/
> - spin_lock_irq(&ws->wait.lock);
> - list_del_init(&wait->entry);
> - spin_unlock_irq(&ws->wait.lock);
> - spin_unlock(&this_hctx->lock);
> - return true;
> + if (ret)
> + finish_wait(&ws->wait, wait);
> + else
> + __set_current_state(TASK_RUNNING);
>   }
> + return ret;
>  }
>  
>  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> -- 
> 2.15.1
>

Re: [PATCH 3/4] block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit

2018-01-10 Thread Omar Sandoval

On Wed, Jan 10, 2018 at 06:42:17PM +, Bart Van Assche wrote:
> On Wed, 2018-01-10 at 11:35 -0700, Jens Axboe wrote:
> > On 1/10/18 11:33 AM, Bart Van Assche wrote:
> > > On Wed, 2018-01-10 at 11:32 -0700, Jens Axboe wrote:
> > > > On 1/10/18 11:29 AM, Bart Van Assche wrote:
> > > > > On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> > > > > > @@ -313,8 +307,6 @@ int __blk_mq_debugfs_rq_show(struct seq_file 
> > > > > > *m, struct request *rq)
> > > > > > seq_puts(m, ", .rq_flags=");
> > > > > > blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
> > > > > >ARRAY_SIZE(rqf_name));
> > > > > > -   seq_puts(m, ", .atomic_flags=");
> > > > > > -   blk_flags_show(m, rq->atomic_flags, rqaf_name, 
> > > > > > ARRAY_SIZE(rqaf_name));
> > > > > > seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
> > > > > >rq->internal_tag);
> > > > > > if (mq_ops->show_rq)
> > > > > 
> > > > > Whether or not a request has been marked complete is very useful to 
> > > > > know. Have you
> > > > > considered to show the return value of blk_rq_is_complete() in the 
> > > > > debugfs output?
> > > > 
> > > > Yeah, that's a good point. Let me add that in lieu of the atomic flags 
> > > > that
> > > > are being killed. Are you fine with the patch then?
> > > 
> > > The rest of the patch looks fine to me. This is the only comment I had 
> > > about this patch.
> > 
> > http://git.kernel.dk/cgit/linux-block/commit/?h=blk-kill-atomic-flags&id=3d34009082f5e72137d6bb38cbc2ff6047f03fd1
> > 
> > Added a complete= entry for it.
> 
> For that patch: Reviewed-by: Bart Van Assche 

Also Reviewed-by: Omar Sandoval

Re: [PATCH 4/4] block: rearrange a few request fields for better cache layout

2018-01-10 Thread Jens Axboe

On 1/10/18 11:43 AM, Omar Sandoval wrote:
>> -INIT_LIST_HEAD(&rq->queuelist);
>>  /* csd/requeue_work/fifo_time is initialized before use */
>>  rq->q = data->q;
>>  rq->mq_ctx = data->ctx;
>> +rq->rq_flags = 0;
>> +rq->cpu = -1;
>>  rq->cmd_flags = op;
>>  if (data->flags & BLK_MQ_REQ_PREEMPT)
>>  rq->rq_flags |= RQF_PREEMPT;
>>  if (blk_queue_io_stat(data->q))
>>  rq->rq_flags |= RQF_IO_STAT;
>> -rq->cpu = -1;
>> +/* do not touch atomic flags, it needs atomic ops against the timer */
> 
> This comment was just removed in a previous patch but it snuck back in.

Eagle eyes - thanks, I will kill it.

-- 
Jens Axboe

Re: [PATCH 4/4] block: rearrange a few request fields for better cache layout

2018-01-10 Thread Omar Sandoval

On Tue, Jan 09, 2018 at 05:29:27PM -0700, Jens Axboe wrote:
> Move completion related items (like the call single data) near the
> end of the struct, instead of mixing them in with the initial
> queueing related fields.
> 
> Move queuelist below the bio structures. Then we have all
> queueing related bits in the first cache line.
> 
> This yields a 1.5-2% increase in IOPS for a null_blk test, both for
> sync and for high thread count access. Sync test goes form 975K to
> 992K, 32-thread case from 20.8M to 21.2M IOPS.

One nit below, otherwise

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  block/blk-mq.c | 19 ++-
>  include/linux/blkdev.h | 28 +++-
>  2 files changed, 25 insertions(+), 22 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7248ee043651..ec128001ea8b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -270,8 +270,6 @@ static struct request *blk_mq_rq_ctx_init(struct 
> blk_mq_alloc_data *data,
>   struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
>   struct request *rq = tags->static_rqs[tag];
>  
> - rq->rq_flags = 0;
> -
>   if (data->flags & BLK_MQ_REQ_INTERNAL) {
>   rq->tag = -1;
>   rq->internal_tag = tag;
> @@ -285,26 +283,23 @@ static struct request *blk_mq_rq_ctx_init(struct 
> blk_mq_alloc_data *data,
>   data->hctx->tags->rqs[rq->tag] = rq;
>   }
>  
> - INIT_LIST_HEAD(&rq->queuelist);
>   /* csd/requeue_work/fifo_time is initialized before use */
>   rq->q = data->q;
>   rq->mq_ctx = data->ctx;
> + rq->rq_flags = 0;
> + rq->cpu = -1;
>   rq->cmd_flags = op;
>   if (data->flags & BLK_MQ_REQ_PREEMPT)
>   rq->rq_flags |= RQF_PREEMPT;
>   if (blk_queue_io_stat(data->q))
>   rq->rq_flags |= RQF_IO_STAT;
> - rq->cpu = -1;
> + /* do not touch atomic flags, it needs atomic ops against the timer */

This comment was just removed in a previous patch but it snuck back in.

Re: [PATCH 3/4] block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit

2018-01-10 Thread Bart Van Assche

On Wed, 2018-01-10 at 11:35 -0700, Jens Axboe wrote:
> On 1/10/18 11:33 AM, Bart Van Assche wrote:
> > On Wed, 2018-01-10 at 11:32 -0700, Jens Axboe wrote:
> > > On 1/10/18 11:29 AM, Bart Van Assche wrote:
> > > > On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> > > > > @@ -313,8 +307,6 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, 
> > > > > struct request *rq)
> > > > >   seq_puts(m, ", .rq_flags=");
> > > > >   blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
> > > > >  ARRAY_SIZE(rqf_name));
> > > > > - seq_puts(m, ", .atomic_flags=");
> > > > > - blk_flags_show(m, rq->atomic_flags, rqaf_name, 
> > > > > ARRAY_SIZE(rqaf_name));
> > > > >   seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
> > > > >  rq->internal_tag);
> > > > >   if (mq_ops->show_rq)
> > > > 
> > > > Whether or not a request has been marked complete is very useful to 
> > > > know. Have you
> > > > considered to show the return value of blk_rq_is_complete() in the 
> > > > debugfs output?
> > > 
> > > Yeah, that's a good point. Let me add that in lieu of the atomic flags 
> > > that
> > > are being killed. Are you fine with the patch then?
> > 
> > The rest of the patch looks fine to me. This is the only comment I had 
> > about this patch.
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=blk-kill-atomic-flags&id=3d34009082f5e72137d6bb38cbc2ff6047f03fd1
> 
> Added a complete= entry for it.

For that patch: Reviewed-by: Bart Van Assche 

Thanks,

Bart.

Re: [PATCH 2/4] block: add accessors for setting/querying request deadline

2018-01-10 Thread Omar Sandoval

On Tue, Jan 09, 2018 at 05:29:25PM -0700, Jens Axboe wrote:
> We reduce the resolution of request expiry, but since we're already
> using jiffies for this where resolution depends on the kernel
> configuration and since the timeout resolution is coarse anyway,
> that should be fine.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  block/blk-mq.c |  2 +-
>  block/blk-timeout.c| 14 --
>  block/blk.h| 15 +++
>  include/linux/blkdev.h |  4 +++-
>  4 files changed, 27 insertions(+), 8 deletions(-)
>

Re: [PATCH 1/4] block: remove REQ_ATOM_POLL_SLEPT

2018-01-10 Thread Omar Sandoval

On Tue, Jan 09, 2018 at 05:29:24PM -0700, Jens Axboe wrote:
> We don't need this to be an atomic flag, it can be a regular
> flag. We either end up on the same CPU for the polling, in which
> case the state is sane, or we did the sleep which would imply
> the needed barrier to ensure we see the right state.

Second Bart's comment, looks good otherwise.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe

Re: [PATCH 3/4] block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit

2018-01-10 Thread Jens Axboe

On 1/10/18 11:33 AM, Bart Van Assche wrote:
> On Wed, 2018-01-10 at 11:32 -0700, Jens Axboe wrote:
>> On 1/10/18 11:29 AM, Bart Van Assche wrote:
>>> On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
 @@ -313,8 +307,6 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, 
 struct request *rq)
seq_puts(m, ", .rq_flags=");
blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
   ARRAY_SIZE(rqf_name));
 -  seq_puts(m, ", .atomic_flags=");
 -  blk_flags_show(m, rq->atomic_flags, rqaf_name, ARRAY_SIZE(rqaf_name));
seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
   rq->internal_tag);
if (mq_ops->show_rq)
>>>
>>> Whether or not a request has been marked complete is very useful to know. 
>>> Have you
>>> considered to show the return value of blk_rq_is_complete() in the debugfs 
>>> output?
>>
>> Yeah, that's a good point. Let me add that in lieu of the atomic flags that
>> are being killed. Are you fine with the patch then?
> 
> The rest of the patch looks fine to me. This is the only comment I had about 
> this patch.

http://git.kernel.dk/cgit/linux-block/commit/?h=blk-kill-atomic-flags&id=3d34009082f5e72137d6bb38cbc2ff6047f03fd1

Added a complete= entry for it.

-- 
Jens Axboe

Re: [PATCH 4/4] block: rearrange a few request fields for better cache layout

2018-01-10 Thread Bart Van Assche

On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> Move completion related items (like the call single data) near the
> end of the struct, instead of mixing them in with the initial
> queueing related fields.
> 
> Move queuelist below the bio structures. Then we have all
> queueing related bits in the first cache line.
> 
> This yields a 1.5-2% increase in IOPS for a null_blk test, both for
> sync and for high thread count access. Sync test goes form 975K to
> 992K, 32-thread case from 20.8M to 21.2M IOPS.

That's a nice result!

Reviewed-by: Bart Van Assche

Re: [PATCH 3/4] block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit

2018-01-10 Thread Bart Van Assche

On Wed, 2018-01-10 at 11:32 -0700, Jens Axboe wrote:
> On 1/10/18 11:29 AM, Bart Van Assche wrote:
> > On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> > > @@ -313,8 +307,6 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, 
> > > struct request *rq)
> > >   seq_puts(m, ", .rq_flags=");
> > >   blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
> > >  ARRAY_SIZE(rqf_name));
> > > - seq_puts(m, ", .atomic_flags=");
> > > - blk_flags_show(m, rq->atomic_flags, rqaf_name, ARRAY_SIZE(rqaf_name));
> > >   seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
> > >  rq->internal_tag);
> > >   if (mq_ops->show_rq)
> > 
> > Whether or not a request has been marked complete is very useful to know. 
> > Have you
> > considered to show the return value of blk_rq_is_complete() in the debugfs 
> > output?
> 
> Yeah, that's a good point. Let me add that in lieu of the atomic flags that
> are being killed. Are you fine with the patch then?

The rest of the patch looks fine to me. This is the only comment I had about 
this patch.

Bart.

Re: [PATCH 1/4] block: remove REQ_ATOM_POLL_SLEPT

2018-01-10 Thread Jens Axboe

On 1/10/18 11:25 AM, Bart Van Assche wrote:
> On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
>> We don't need this to be an atomic flag, it can be a regular
>> flag. We either end up on the same CPU for the polling, in which
>> case the state is sane, or we did the sleep which would imply
>> the needed barrier to ensure we see the right state.
> 
> Please consider updating rqf_name[] in blk-mq-debugfs.c.

Good catch - I'll do that, and also add a small prep patch that syncs
with the current situation, looks like we're missing the timeout
and zone locked flags.

-- 
Jens Axboe

Re: [PATCH 3/4] block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit

2018-01-10 Thread Jens Axboe

On 1/10/18 11:29 AM, Bart Van Assche wrote:
> On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
>> @@ -313,8 +307,6 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, struct 
>> request *rq)
>>  seq_puts(m, ", .rq_flags=");
>>  blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
>> ARRAY_SIZE(rqf_name));
>> -seq_puts(m, ", .atomic_flags=");
>> -blk_flags_show(m, rq->atomic_flags, rqaf_name, ARRAY_SIZE(rqaf_name));
>>  seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
>> rq->internal_tag);
>>  if (mq_ops->show_rq)
> 
> Whether or not a request has been marked complete is very useful to know. 
> Have you
> considered to show the return value of blk_rq_is_complete() in the debugfs 
> output?

Yeah, that's a good point. Let me add that in lieu of the atomic flags that
are being killed. Are you fine with the patch then?

-- 
Jens Axboe

Re: [PATCH 3/4] block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit

2018-01-10 Thread Bart Van Assche

On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> @@ -313,8 +307,6 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, struct 
> request *rq)
>   seq_puts(m, ", .rq_flags=");
>   blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
>  ARRAY_SIZE(rqf_name));
> - seq_puts(m, ", .atomic_flags=");
> - blk_flags_show(m, rq->atomic_flags, rqaf_name, ARRAY_SIZE(rqaf_name));
>   seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
>  rq->internal_tag);
>   if (mq_ops->show_rq)

Whether or not a request has been marked complete is very useful to know. Have 
you
considered to show the return value of blk_rq_is_complete() in the debugfs 
output?

Thanks,

Bart.

Re: [PATCH 2/4] block: add accessors for setting/querying request deadline

2018-01-10 Thread Bart Van Assche

On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> We reduce the resolution of request expiry, but since we're already
> using jiffies for this where resolution depends on the kernel
> configuration and since the timeout resolution is coarse anyway,
> that should be fine.

Reviewed-by: Bart Van Assche

Re: [PATCH 1/4] block: remove REQ_ATOM_POLL_SLEPT

2018-01-10 Thread Bart Van Assche

On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> We don't need this to be an atomic flag, it can be a regular
> flag. We either end up on the same CPU for the polling, in which
> case the state is sane, or we did the sleep which would imply
> the needed barrier to ensure we see the right state.

Please consider updating rqf_name[] in blk-mq-debugfs.c. Anyway:

Reviewed-by: Bart Van Assche

[PATCH v2 2/4] block: Introduce blk_start_wait_if_quiesced() and blk_finish_wait_if_quiesced()

2018-01-10 Thread Bart Van Assche

Introduce functions that allow block drivers to wait while a request
queue is in the quiesced state (blk-mq) or in the stopped state (legacy
block layer). The next patch will add calls to these functions in the
SCSI core.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Ming Lei 
---
 block/blk-core.c   |  1 +
 block/blk-mq.c | 64 ++
 include/linux/blk-mq.h |  2 ++
 3 files changed, 67 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index c10b4ce95248..06eaea15bae9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -287,6 +287,7 @@ void blk_start_queue(struct request_queue *q)
WARN_ON_ONCE(q->mq_ops);
 
queue_flag_clear(QUEUE_FLAG_STOPPED, q);
+   wake_up_all(&q->mq_wq);
__blk_run_queue(q);
 }
 EXPORT_SYMBOL(blk_start_queue);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a05ea7e9b415..87455977ad34 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -247,11 +247,75 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
spin_unlock_irqrestore(q->queue_lock, flags);
 
+   wake_up_all(&q->mq_wq);
+
/* dispatch requests which are inserted during quiescing */
blk_mq_run_hw_queues(q, true);
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
 
+/**
+ * blk_start_wait_if_quiesced() - wait if a queue is quiesced (blk-mq) or 
stopped (legacy block layer)
+ * @q: Request queue pointer.
+ *
+ * Some block drivers, e.g. the SCSI core, can bypass the block layer core
+ * request submission mechanism. Surround such code with
+ * blk_start_wait_if_quiesced() / blk_finish_wait_if_quiesced() to avoid that
+ * request submission can happen while a queue is quiesced or stopped.
+ *
+ * Returns with the RCU read lock held (blk-mq) or with q->queue_lock held
+ * (legacy block layer).
+ *
+ * Notes:
+ * - Every call of this function must be followed by a call of
+ *   blk_finish_wait_if_quiesced().
+ * - This function does not support block drivers whose .queue_rq()
+ *   implementation can sleep (BLK_MQ_F_BLOCKING).
+ */
+int blk_start_wait_if_quiesced(struct request_queue *q)
+{
+   struct blk_mq_hw_ctx *hctx;
+   unsigned int i;
+
+   might_sleep();
+
+   if (q->mq_ops) {
+   queue_for_each_hw_ctx(q, hctx, i)
+   WARN_ON(hctx->flags & BLK_MQ_F_BLOCKING);
+
+   rcu_read_lock();
+   while (!blk_queue_dying(q) && blk_queue_quiesced(q)) {
+   rcu_read_unlock();
+   wait_event(q->mq_wq, blk_queue_dying(q) ||
+  !blk_queue_quiesced(q));
+   rcu_read_lock();
+   }
+   } else {
+   spin_lock_irq(q->queue_lock);
+   wait_event_lock_irq(q->mq_wq,
+   blk_queue_dying(q) || !blk_queue_stopped(q),
+   *q->queue_lock);
+   q->request_fn_active++;
+   }
+   return blk_queue_dying(q) ? -ENODEV : 0;
+}
+EXPORT_SYMBOL(blk_start_wait_if_quiesced);
+
+/**
+ * blk_finish_wait_if_quiesced() - counterpart of blk_start_wait_if_quiesced()
+ * @q: Request queue pointer.
+ */
+void blk_finish_wait_if_quiesced(struct request_queue *q)
+{
+   if (q->mq_ops) {
+   rcu_read_unlock();
+   } else {
+   q->request_fn_active--;
+   spin_unlock_irq(q->queue_lock);
+   }
+}
+EXPORT_SYMBOL(blk_finish_wait_if_quiesced);
+
 void blk_mq_wake_waiters(struct request_queue *q)
 {
struct blk_mq_hw_ctx *hctx;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49796a3..15912cd348b5 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -267,6 +267,8 @@ void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx 
*hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_quiesce_queue(struct request_queue *q);
 void blk_mq_unquiesce_queue(struct request_queue *q);
+int blk_start_wait_if_quiesced(struct request_queue *q);
+void blk_finish_wait_if_quiesced(struct request_queue *q);
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long 
msecs);
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
-- 
2.15.1

[PATCH v2 3/4] scsi: Avoid that .queuecommand() gets called for a quiesced SCSI device

2018-01-10 Thread Bart Van Assche

Several SCSI transport and LLD drivers surround code that does not
tolerate concurrent calls of .queuecommand() with scsi_target_block() /
scsi_target_unblock(). These last two functions use
blk_mq_quiesce_queue() / blk_mq_unquiesce_queue() for scsi-mq request
queues to prevent concurrent .queuecommand() calls. However, that is
not sufficient to prevent .queuecommand() calls from scsi_send_eh_cmnd().
Hence surround the .queuecommand() call from the SCSI error handler with
blk_start_wait_if_quiesced() / blk_finish_wait_if_quiesced().

Note: converting the .queuecommand() call in scsi_send_eh_cmnd() into
code that calls blk_get_request(), e.g. scsi_execute_req(), is not an
option since scsi_send_eh_cmnd() can be called if all requests are
allocated and if no requests will make progress without aborting any
of these requests.

Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: Johannes Thumshirn 
Cc: Ming Lei 
---
 drivers/scsi/scsi_error.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 62b56de38ae8..f7154ea86715 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1016,6 +1016,7 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, 
unsigned char *cmnd,
 {
struct scsi_device *sdev = scmd->device;
struct Scsi_Host *shost = sdev->host;
+   struct request_queue *q = sdev->request_queue;
DECLARE_COMPLETION_ONSTACK(done);
unsigned long timeleft = timeout;
struct scsi_eh_save ses;
@@ -1028,7 +1029,9 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, 
unsigned char *cmnd,
 
scsi_log_send(scmd);
scmd->scsi_done = scsi_eh_done;
+   blk_start_wait_if_quiesced(q);
rtn = shost->hostt->queuecommand(shost, scmd);
+   blk_finish_wait_if_quiesced(q);
if (rtn) {
if (timeleft > stall_for) {
scsi_eh_restore_cmnd(scmd, &ses);
-- 
2.15.1

[PATCH v2 0/4] Make SCSI transport recovery more robust

2018-01-10 Thread Bart Van Assche

Hello Jens,

A longstanding issue with the SCSI core is that several SCSI transport drivers
use scsi_target_block() and scsi_target_unblock() to avoid concurrent
.queuecommand() calls during e.g. transport recovery but that this is not
sufficient to protect from such calls. Hence this patch series. An additional
benefit of this patch series is that it allows to remove an ugly hack from
the SRP initiator driver. Please consider this patch series for kernel v4.16.

Thanks,

Bart.

Changes compared to v1:
- Renamed blk_wait_if_quiesced() into blk_start_wait_if_quiesced().
- Mentioned in the comment above blk_start_wait_if_quiesced() that every call
  of this function has to be followed by a call of
  blk_finish_wait_if_quiesced().

Bart Van Assche (4):
  blk-mq: Rename request_queue.mq_freeze_wq into mq_wq
  block: Introduce blk_start_wait_if_quiesced() and
blk_finish_wait_if_quiesced()
  scsi: Avoid that .queuecommand() gets called for a quiesced SCSI
device
  IB/srp: Fix a sleep-in-invalid-context bug

 block/blk-core.c| 11 +++---
 block/blk-mq.c  | 74 ++---
 drivers/infiniband/ulp/srp/ib_srp.c | 21 +--
 drivers/scsi/scsi_error.c   |  3 ++
 include/linux/blk-mq.h  |  2 +
 include/linux/blkdev.h  |  2 +-
 6 files changed, 83 insertions(+), 30 deletions(-)

-- 
2.15.1

[PATCH v2 4/4] IB/srp: Fix a sleep-in-invalid-context bug

2018-01-10 Thread Bart Van Assche

The previous two patches guarantee that srp_queuecommand() does not get
invoked while reconnecting occurs. Hence remove the code from
srp_queuecommand() that prevents command queueing while reconnecting.
This patch avoids that the following can appear in the kernel log:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 5600, name: scsi_eh_9
1 lock held by scsi_eh_9/5600:
 #0:  (rcu_read_lock){}, at: [] 
__blk_mq_run_hw_queue+0xf1/0x1e0
Preemption disabled at:
[<139badf2>] __blk_mq_delay_run_hw_queue+0x78/0xf0
CPU: 9 PID: 5600 Comm: scsi_eh_9 Tainted: GW4.15.0-rc4-dbg+ #1
Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 2.5.4 01/22/2016
Call Trace:
 dump_stack+0x67/0x99
 ___might_sleep+0x16a/0x250 [ib_srp]
 __mutex_lock+0x46/0x9d0
 srp_queuecommand+0x356/0x420 [ib_srp]
 scsi_dispatch_cmd+0xf6/0x3f0
 scsi_queue_rq+0x4a8/0x5f0
 blk_mq_dispatch_rq_list+0x73/0x440
 blk_mq_sched_dispatch_requests+0x109/0x1a0
 __blk_mq_run_hw_queue+0x131/0x1e0
 __blk_mq_delay_run_hw_queue+0x9a/0xf0
 blk_mq_run_hw_queue+0xc0/0x1e0
 blk_mq_start_hw_queues+0x2c/0x40
 scsi_run_queue+0x18e/0x2d0
 scsi_run_host_queues+0x22/0x40
 scsi_error_handler+0x18d/0x5f0
 kthread+0x11c/0x140
 ret_from_fork+0x24/0x30

Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Cc: Jason Gunthorpe 
Cc: Doug Ledford 
---
 drivers/infiniband/ulp/srp/ib_srp.c | 21 ++---
 1 file changed, 2 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 972d4b3c5223..670f187ecb91 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -2149,7 +2149,6 @@ static void srp_handle_qp_err(struct ib_cq *cq, struct 
ib_wc *wc,
 static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd)
 {
struct srp_target_port *target = host_to_target(shost);
-   struct srp_rport *rport = target->rport;
struct srp_rdma_ch *ch;
struct srp_request *req;
struct srp_iu *iu;
@@ -2159,16 +2158,6 @@ static int srp_queuecommand(struct Scsi_Host *shost, 
struct scsi_cmnd *scmnd)
u32 tag;
u16 idx;
int len, ret;
-   const bool in_scsi_eh = !in_interrupt() && current == shost->ehandler;
-
-   /*
-* The SCSI EH thread is the only context from which srp_queuecommand()
-* can get invoked for blocked devices (SDEV_BLOCK /
-* SDEV_CREATED_BLOCK). Avoid racing with srp_reconnect_rport() by
-* locking the rport mutex if invoked from inside the SCSI EH.
-*/
-   if (in_scsi_eh)
-   mutex_lock(&rport->mutex);
 
scmnd->result = srp_chkready(target->rport);
if (unlikely(scmnd->result))
@@ -2230,13 +2219,7 @@ static int srp_queuecommand(struct Scsi_Host *shost, 
struct scsi_cmnd *scmnd)
goto err_unmap;
}
 
-   ret = 0;
-
-unlock_rport:
-   if (in_scsi_eh)
-   mutex_unlock(&rport->mutex);
-
-   return ret;
+   return 0;
 
 err_unmap:
srp_unmap_data(scmnd, ch, req);
@@ -2258,7 +2241,7 @@ static int srp_queuecommand(struct Scsi_Host *shost, 
struct scsi_cmnd *scmnd)
ret = SCSI_MLQUEUE_HOST_BUSY;
}
 
-   goto unlock_rport;
+   return ret;
 }
 
 /*
-- 
2.15.1

[PATCH v2 1/4] blk-mq: Rename request_queue.mq_freeze_wq into mq_wq

2018-01-10 Thread Bart Van Assche

Rename a waitqueue in struct request_queue since the next patch will
add code that uses this waitqueue outside the request queue freezing
implementation.

Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Cc: Christoph Hellwig 
Cc: Johannes Thumshirn 
Cc: Ming Lei 
---
 block/blk-core.c   | 10 +-
 block/blk-mq.c | 10 +-
 include/linux/blkdev.h |  2 +-
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fa1cb95f7f6a..c10b4ce95248 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -377,7 +377,7 @@ void blk_clear_preempt_only(struct request_queue *q)
 
spin_lock_irqsave(q->queue_lock, flags);
queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
-   wake_up_all(&q->mq_freeze_wq);
+   wake_up_all(&q->mq_wq);
spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
@@ -648,7 +648,7 @@ void blk_set_queue_dying(struct request_queue *q)
}
 
/* Make blk_queue_enter() reexamine the DYING flag. */
-   wake_up_all(&q->mq_freeze_wq);
+   wake_up_all(&q->mq_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -851,7 +851,7 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 */
smp_rmb();
 
-   ret = wait_event_interruptible(q->mq_freeze_wq,
+   ret = wait_event_interruptible(q->mq_wq,
(atomic_read(&q->mq_freeze_depth) == 0 &&
 (preempt || !blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
@@ -872,7 +872,7 @@ static void blk_queue_usage_counter_release(struct 
percpu_ref *ref)
struct request_queue *q =
container_of(ref, struct request_queue, q_usage_counter);
 
-   wake_up_all(&q->mq_freeze_wq);
+   wake_up_all(&q->mq_wq);
 }
 
 static void blk_rq_timed_out_timer(struct timer_list *t)
@@ -948,7 +948,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
q->bypass_depth = 1;
__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
 
-   init_waitqueue_head(&q->mq_freeze_wq);
+   init_waitqueue_head(&q->mq_wq);
 
/*
 * Init percpu_ref in atomic mode so that it's faster to shutdown.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 29f140b4dbf7..a05ea7e9b415 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -137,16 +137,16 @@ EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
 void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-   wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
+   wait_event(q->mq_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
 
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 unsigned long timeout)
 {
-   return wait_event_timeout(q->mq_freeze_wq,
-   percpu_ref_is_zero(&q->q_usage_counter),
-   timeout);
+   return wait_event_timeout(q->mq_wq,
+ percpu_ref_is_zero(&q->q_usage_counter),
+ timeout);
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
 
@@ -185,7 +185,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
percpu_ref_reinit(&q->q_usage_counter);
-   wake_up_all(&q->mq_freeze_wq);
+   wake_up_all(&q->mq_wq);
}
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 50fb1c18ec54..2c74c03a9d5f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -638,7 +638,7 @@ struct request_queue {
struct throtl_data *td;
 #endif
struct rcu_head rcu_head;
-   wait_queue_head_t   mq_freeze_wq;
+   wait_queue_head_t   mq_wq;
struct percpu_ref   q_usage_counter;
struct list_headall_q_node;
 
-- 
2.15.1

[PATCH] blk-mq: Rework blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

Use prepare_to_wait() and finish_wait() instead of open-coding these
functions. Reduce the number of if-statements to make
blk_mq_mark_tag_wait() easier to read. Both add_wait_queue() and
blk_mq_dispatch_wake() protect wait queue manipulations with the wait
queue lock. Hence also protect the !list_empty(&wait->entry) test with
the wait queue lock instead of the hctx lock.

Note: a side effect of this patch is that the task state is changed
for shared queues before and after the blk_mq_get_driver_tag().
Since blk_mq_dispatch_wake() ignores the task state these task state
changes do not affect which task gets woken up.

See also commit f906a6a0f426 ("blk-mq: improve tag waiting setup for
non-shared tags").

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Omar Sandoval 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq.c | 49 +
 1 file changed, 17 insertions(+), 32 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 61bdbf45c3be..29f140b4dbf7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1121,50 +1121,35 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
if (!shared_tags) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state))
set_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state);
+   ret = blk_mq_get_driver_tag(rq, hctx, false);
+   /*
+* Don't clear RESTART here, someone else could have set it.
+* At most this will cost an extra queue run.
+*/
} else {
wait = &this_hctx->dispatch_wait;
if (!list_empty_careful(&wait->entry))
return false;
 
-   spin_lock(&this_hctx->lock);
-   if (!list_empty(&wait->entry)) {
-   spin_unlock(&this_hctx->lock);
-   return false;
-   }
-
ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
-   add_wait_queue(&ws->wait, wait);
-   }
-
-   /*
-* It's possible that a tag was freed in the window between the
-* allocation failure and adding the hardware queue to the wait
-* queue.
-*/
-   ret = blk_mq_get_driver_tag(rq, hctx, false);
+   prepare_to_wait(&ws->wait, wait, TASK_UNINTERRUPTIBLE);
 
-   if (!shared_tags) {
/*
-* Don't clear RESTART here, someone else could have set it.
-* At most this will cost an extra queue run.
+* It's possible that a tag was freed in the window between the
+* allocation failure and adding the hardware queue to the wait
+* queue.
 */
-   return ret;
-   } else {
-   if (!ret) {
-   spin_unlock(&this_hctx->lock);
-   return false;
-   }
-
+   ret = blk_mq_get_driver_tag(rq, hctx, false);
/*
-* We got a tag, remove ourselves from the wait queue to ensure
-* someone else gets the wakeup.
+* If we got a tag, remove ourselves from the wait queue to
+* ensure someone else gets the wakeup.
 */
-   spin_lock_irq(&ws->wait.lock);
-   list_del_init(&wait->entry);
-   spin_unlock_irq(&ws->wait.lock);
-   spin_unlock(&this_hctx->lock);
-   return true;
+   if (ret)
+   finish_wait(&ws->wait, wait);
+   else
+   __set_current_state(TASK_RUNNING);
}
+   return ret;
 }
 
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
-- 
2.15.1

Re: [PATCHv2 0/5] nvme/dm failover unification

2018-01-10 Thread Jens Axboe

On 1/10/18 1:28 AM, Christoph Hellwig wrote:
> The whole series looks fine to me:
> 
> Reviewed-by: Christoph Hellwig 
> 
> Jens, do you want me to apply this to the nvme tree, or pick it up
> directly?

I queued it up, thanks.

-- 
Jens Axboe

Re: [PATCH v2 1/2] block: fail op_is_write() requests to read-only partitions

2018-01-10 Thread Christoph Hellwig

On Wed, Jan 10, 2018 at 09:34:02AM -0700, Jens Axboe wrote:
> It's yet another check that adds part lookup and rcu lock/unlock in that
> path. Can we combine some of them? Make this part of the remap?  This
> overhead impacts every IO, let's not bloat it more than absolutely
> necessary.

Yes, we should be able to integrate this with the partition remap.

Re: [PATCH v2] block: silently forbid sending any ioctl to a partition

2018-01-10 Thread Christoph Hellwig

On Wed, Jan 10, 2018 at 04:54:52PM +0100, Paolo Bonzini wrote:
> After the first few months, the message has not led to many bug reports.
> It's been almost five years now, and in practice the main source of
> it seems to be MTIOCGET that someone is using to detect tape devices.
> While we could whitelist it just like CDROM_GET_CAPABILITY, this patch
> just removes the message altogether.
> 
> The patch also removes the "safe but not very useful" ioctl whitelist,
> as suggested by Christoph.  I doubt anything is using most of those
> ioctls _in general_, let alone on a partition.
> 
> Signed-off-by: Paolo Bonzini 

Nice, thanks!

Reviewed-by: Christoph Hellwig

Re: [PATCH v2 1/2] block: fail op_is_write() requests to read-only partitions

2018-01-10 Thread Ilya Dryomov

On Wed, Jan 10, 2018 at 5:34 PM, Jens Axboe  wrote:
> On 1/10/18 9:18 AM, Ilya Dryomov wrote:
>> Regular block device writes go through blkdev_write_iter(), which does
>> bdev_read_only(), while zeroout/discard/etc requests are never checked,
>> both userspace- and kernel-triggered.  Add a generic catch-all check to
>> generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
>> set_disk_ro(), which is used by quite a few drivers for things like
>> snapshots, read-only backing files/images, etc.
>>
>> Signed-off-by: Ilya Dryomov 
>> Reviewed-by: Sagi Grimberg 
>> ---
>>  block/blk-core.c | 23 ++-
>>  1 file changed, 22 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index f843ae4f858d..d132bec4a266 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2123,6 +2123,20 @@ static inline int bio_check_eod(struct bio *bio, 
>> unsigned int nr_sectors)
>>   return 0;
>>  }
>>
>> +static inline bool bio_check_ro(struct bio *bio)
>> +{
>> + struct hd_struct *p;
>> + bool ret = false;
>> +
>> + rcu_read_lock();
>> + p = __disk_get_part(bio->bi_disk, bio->bi_partno);
>> + if (!p || (p->policy && op_is_write(bio_op(bio
>> + ret = true;
>> + rcu_read_unlock();
>> +
>> + return ret;
>> +}> +
>>  static noinline_for_stack bool
>>  generic_make_request_checks(struct bio *bio)
>>  {
>> @@ -2145,11 +2159,18 @@ generic_make_request_checks(struct bio *bio)
>>   goto end_io;
>>   }
>>
>> + if (unlikely(bio_check_ro(bio))) {
>> + printk(KERN_ERR
>> +"generic_make_request: Trying to write "
>> + "to read-only block-device %s (partno %d)\n",
>> + bio_devname(bio, b), bio->bi_partno);
>> + goto end_io;
>> + }
>
> It's yet another check that adds part lookup and rcu lock/unlock in that
> path. Can we combine some of them? Make this part of the remap?  This
> overhead impacts every IO, let's not bloat it more than absolutely
> necessary.

Yes, combining with should_fail_request check in remap should be easy
enough.  I considered it, but opted for the less invasive patch.  I'll
re-spin.

Thanks,

Ilya

Re: [PATCH] blk-mq: Explain when 'active_queues' is decremented

2018-01-10 Thread Jens Axboe

On 1/10/18 9:33 AM, Bart Van Assche wrote:
> It is nontrivial to derive from the blk-mq source code when
> blk_mq_tags.active_queues is decremented. Hence add a comment that
> explains this.

That is how it works, applied the patch, thanks Bart.

-- 
Jens Axboe

[PATCH v2] bdi: show error log when fail to create bdi debugfs entry

2018-01-10 Thread weiping zhang

bdi debugfs dir/file may create fail, add error log here.

Signed-off-by: weiping zhang 
---
V1->V2:
fix indentation and make log message more clear

 mm/backing-dev.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b5f940c..0a49665 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -885,7 +885,9 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
char *fmt, va_list args)
cgwb_bdi_register(bdi);
bdi->dev = dev;
 
-   bdi_debug_register(bdi, dev_name(dev));
+   if (bdi_debug_register(bdi, dev_name(dev)))
+   pr_warn("blkdev %s: creation of bdi debugfs entries failed.\n",
+   dev_name(dev));
set_bit(WB_registered, &bdi->wb.state);
 
spin_lock_bh(&bdi_lock);
-- 
2.9.4

Re: [PATCH v2 1/2] block: fail op_is_write() requests to read-only partitions

2018-01-10 Thread Jens Axboe

On 1/10/18 9:18 AM, Ilya Dryomov wrote:
> Regular block device writes go through blkdev_write_iter(), which does
> bdev_read_only(), while zeroout/discard/etc requests are never checked,
> both userspace- and kernel-triggered.  Add a generic catch-all check to
> generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
> set_disk_ro(), which is used by quite a few drivers for things like
> snapshots, read-only backing files/images, etc.
> 
> Signed-off-by: Ilya Dryomov 
> Reviewed-by: Sagi Grimberg 
> ---
>  block/blk-core.c | 23 ++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index f843ae4f858d..d132bec4a266 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2123,6 +2123,20 @@ static inline int bio_check_eod(struct bio *bio, 
> unsigned int nr_sectors)
>   return 0;
>  }
>  
> +static inline bool bio_check_ro(struct bio *bio)
> +{
> + struct hd_struct *p;
> + bool ret = false;
> +
> + rcu_read_lock();
> + p = __disk_get_part(bio->bi_disk, bio->bi_partno);
> + if (!p || (p->policy && op_is_write(bio_op(bio
> + ret = true;
> + rcu_read_unlock();
> +
> + return ret;
> +}> +
>  static noinline_for_stack bool
>  generic_make_request_checks(struct bio *bio)
>  {
> @@ -2145,11 +2159,18 @@ generic_make_request_checks(struct bio *bio)
>   goto end_io;
>   }
>  
> + if (unlikely(bio_check_ro(bio))) {
> + printk(KERN_ERR
> +"generic_make_request: Trying to write "
> + "to read-only block-device %s (partno %d)\n",
> + bio_devname(bio, b), bio->bi_partno);
> + goto end_io;
> + }

It's yet another check that adds part lookup and rcu lock/unlock in that
path. Can we combine some of them? Make this part of the remap?  This
overhead impacts every IO, let's not bloat it more than absolutely
necessary.

-- 
Jens Axboe


-- 
Jens Axboe

[PATCH] blk-mq: Explain when 'active_queues' is decremented

2018-01-10 Thread Bart Van Assche

It is nontrivial to derive from the blk-mq source code when
blk_mq_tags.active_queues is decremented. Hence add a comment that
explains this.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9aa24c9508f9..266fc4f6b046 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -954,6 +954,12 @@ static void blk_mq_timeout_work(struct work_struct *work)
data.next = blk_rq_timeout(round_jiffies_up(data.next));
mod_timer(&q->timeout, data.next);
} else {
+   /*
+* Request timeouts are handled as a forward rolling timer. If
+* we end up here it means that no requests are pending and
+* also that no request has been pending for a while. Mark
+* each hctx as idle.
+*/
queue_for_each_hw_ctx(q, hctx, i) {
/* the hctx may be unmapped, so check it here */
if (blk_mq_hw_queue_mapped(hctx))
-- 
2.15.1

Re: [PATCH] bdi: show error log when fail to create bdi debugfs entry

2018-01-10 Thread weiping zhang

2018-01-11 0:10 GMT+08:00 Bart Van Assche :
> On Wed, 2018-01-10 at 23:18 +0800, weiping zhang wrote:
>> bdi debugfs dir/file may create fail, add error log here.
>>
>> Signed-off-by: weiping zhang 
>> ---
>>  mm/backing-dev.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>> index b5f940c..9117c21 100644
>> --- a/mm/backing-dev.c
>> +++ b/mm/backing-dev.c
>> @@ -885,7 +885,8 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
>> char *fmt, va_list args)
>>   cgwb_bdi_register(bdi);
>>   bdi->dev = dev;
>>
>> - bdi_debug_register(bdi, dev_name(dev));
>> +  if (bdi_debug_register(bdi, dev_name(dev)))
>> +  pr_warn("blkdev %s create bdi debugfs failed\n", 
>> dev_name(dev));
>>   set_bit(WB_registered, &bdi->wb.state);
>>
>>   spin_lock_bh(&bdi_lock);
>
> The indentation of the if-statement is inconsistent. Have you verified this
> patch with checkpatch before submitting it?
>
> Additionally, the error message is hard to understand. Please make it more 
> clear,
> e.g. as follows: "%s: creation of bdi debugfs entries failed.\n".
>
OK, fix them at V2, thanks a lot
> Thanks,
>
> Bart.

[PATCH v2 2/2] block: add bdev_read_only() checks to common helpers

2018-01-10 Thread Ilya Dryomov

Similar to blkdev_write_iter(), return -EPERM if the partition is
read-only.  This covers ioctl(), fallocate() and most in-kernel users
but isn't meant to be exhaustive -- everything else will be caught in
generic_make_request_checks(), fail with -EIO and can be fixed later.

Signed-off-by: Ilya Dryomov 
---
 block/blk-lib.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2bc544ce3d2e..a676084d4740 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -37,6 +37,9 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
if (!q)
return -ENXIO;
 
+   if (bdev_read_only(bdev))
+   return -EPERM;
+
if (flags & BLKDEV_DISCARD_SECURE) {
if (!blk_queue_secure_erase(q))
return -EOPNOTSUPP;
@@ -156,6 +159,9 @@ static int __blkdev_issue_write_same(struct block_device 
*bdev, sector_t sector,
if (!q)
return -ENXIO;
 
+   if (bdev_read_only(bdev))
+   return -EPERM;
+
bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
if ((sector | nr_sects) & bs_mask)
return -EINVAL;
@@ -233,6 +239,9 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
if (!q)
return -ENXIO;
 
+   if (bdev_read_only(bdev))
+   return -EPERM;
+
/* Ensure that max_write_zeroes_sectors doesn't overflow bi_size */
max_write_zeroes_sectors = bdev_write_zeroes_sectors(bdev);
 
@@ -287,6 +296,9 @@ static int __blkdev_issue_zero_pages(struct block_device 
*bdev,
if (!q)
return -ENXIO;
 
+   if (bdev_read_only(bdev))
+   return -EPERM;
+
while (nr_sects != 0) {
bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
   gfp_mask);
-- 
2.4.3

[PATCH v2 0/2] block: enforce ioctl(BLKROSET) and set_disk_ro()

2018-01-10 Thread Ilya Dryomov

Hello,

I was doing some cleanup work on rbd BLKROSET handler and discovered
that we ignore partition rw/ro setting (hd_struct->policy) for pretty
much everything but straight writes.

David (CCed) has blktests patches standing by.

(Another aspect of this is that we don't enforce open(2) mode.  Tejun
took a stab at this a few years ago, but his patch had to be reverted:

  75f1dc0d076d ("block: check bdev_read_only() from blkdev_get()")
  e51900f7d38c ("block: revert block_dev read-only check")

It is a separate issue and refusing writes to read-only devices is
obviously more important, but perhaps it's time to revisit that as
well?)

v1 -> v2:
- added unlikely() per Sagi's suggestion

Thanks,

Ilya


Ilya Dryomov (2):
  block: fail op_is_write() requests to read-only partitions
  block: add bdev_read_only() checks to common helpers

 block/blk-core.c | 23 ++-
 block/blk-lib.c  | 12 
 2 files changed, 34 insertions(+), 1 deletion(-)

-- 
2.4.3

[PATCH v2 1/2] block: fail op_is_write() requests to read-only partitions

2018-01-10 Thread Ilya Dryomov

Regular block device writes go through blkdev_write_iter(), which does
bdev_read_only(), while zeroout/discard/etc requests are never checked,
both userspace- and kernel-triggered.  Add a generic catch-all check to
generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
set_disk_ro(), which is used by quite a few drivers for things like
snapshots, read-only backing files/images, etc.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Sagi Grimberg 
---
 block/blk-core.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f843ae4f858d..d132bec4a266 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2123,6 +2123,20 @@ static inline int bio_check_eod(struct bio *bio, 
unsigned int nr_sectors)
return 0;
 }
 
+static inline bool bio_check_ro(struct bio *bio)
+{
+   struct hd_struct *p;
+   bool ret = false;
+
+   rcu_read_lock();
+   p = __disk_get_part(bio->bi_disk, bio->bi_partno);
+   if (!p || (p->policy && op_is_write(bio_op(bio
+   ret = true;
+   rcu_read_unlock();
+
+   return ret;
+}
+
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -2145,11 +2159,18 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
+   if (unlikely(bio_check_ro(bio))) {
+   printk(KERN_ERR
+  "generic_make_request: Trying to write "
+   "to read-only block-device %s (partno %d)\n",
+   bio_devname(bio, b), bio->bi_partno);
+   goto end_io;
+   }
+
/*
 * For a REQ_NOWAIT based request, return -EOPNOTSUPP
 * if queue is not a request based queue.
 */
-
if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
goto not_supported;
 
-- 
2.4.3

Re: [PATCH] bdi: show error log when fail to create bdi debugfs entry

2018-01-10 Thread Bart Van Assche

On Wed, 2018-01-10 at 23:18 +0800, weiping zhang wrote:
> bdi debugfs dir/file may create fail, add error log here.
> 
> Signed-off-by: weiping zhang 
> ---
>  mm/backing-dev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index b5f940c..9117c21 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -885,7 +885,8 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
> char *fmt, va_list args)
>   cgwb_bdi_register(bdi);
>   bdi->dev = dev;
>  
> - bdi_debug_register(bdi, dev_name(dev));
> +  if (bdi_debug_register(bdi, dev_name(dev)))
> +  pr_warn("blkdev %s create bdi debugfs failed\n", 
> dev_name(dev));
>   set_bit(WB_registered, &bdi->wb.state);
>  
>   spin_lock_bh(&bdi_lock);

The indentation of the if-statement is inconsistent. Have you verified this
patch with checkpatch before submitting it?

Additionally, the error message is hard to understand. Please make it more 
clear,
e.g. as follows: "%s: creation of bdi debugfs entries failed.\n".

Thanks,

Bart.

[PATCH v2] block: silently forbid sending any ioctl to a partition

2018-01-10 Thread Paolo Bonzini

After the first few months, the message has not led to many bug reports.
It's been almost five years now, and in practice the main source of
it seems to be MTIOCGET that someone is using to detect tape devices.
While we could whitelist it just like CDROM_GET_CAPABILITY, this patch
just removes the message altogether.

The patch also removes the "safe but not very useful" ioctl whitelist,
as suggested by Christoph.  I doubt anything is using most of those
ioctls _in general_, let alone on a partition.

Signed-off-by: Paolo Bonzini 
---
v1->v2: disable the legacy ioctls too, hence changing the patch subject

 block/scsi_ioctl.c | 29 -
 1 file changed, 29 deletions(-)

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index edcfff974527..07988eebb66a 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -692,38 +692,9 @@ int scsi_verify_blk_ioctl(struct block_device *bd, 
unsigned int cmd)
if (bd && bd == bd->bd_contains)
return 0;
 
-   /* Actually none of these is particularly useful on a partition,
-* but they are safe.
-*/
-   switch (cmd) {
-   case SCSI_IOCTL_GET_IDLUN:
-   case SCSI_IOCTL_GET_BUS_NUMBER:
-   case SCSI_IOCTL_GET_PCI:
-   case SCSI_IOCTL_PROBE_HOST:
-   case SG_GET_VERSION_NUM:
-   case SG_SET_TIMEOUT:
-   case SG_GET_TIMEOUT:
-   case SG_GET_RESERVED_SIZE:
-   case SG_SET_RESERVED_SIZE:
-   case SG_EMULATED_HOST:
-   return 0;
-   case CDROM_GET_CAPABILITY:
-   /* Keep this until we remove the printk below.  udev sends it
-* and we do not want to spam dmesg about it.   CD-ROMs do
-* not have partitions, so we get here only for disks.
-*/
-   return -ENOIOCTLCMD;
-   default:
-   break;
-   }
-
if (capable(CAP_SYS_RAWIO))
return 0;
 
-   /* In particular, rule out all resets and host-specific ioctls.  */
-   printk_ratelimited(KERN_WARNING
-  "%s: sending ioctl %x to a partition!\n", 
current->comm, cmd);
-
return -ENOIOCTLCMD;
 }
 EXPORT_SYMBOL(scsi_verify_blk_ioctl);
-- 
2.14.3

Re: [PATCH V4 13/45] block: blk-merge: try to make front segments in full size

2018-01-10 Thread Dmitry Osipenko

On 10.01.2018 05:40, Ming Lei wrote:
> On Tue, Jan 09, 2018 at 08:02:53PM +0300, Dmitry Osipenko wrote:
>> On 09.01.2018 17:33, Ming Lei wrote:
>>> On Tue, Jan 09, 2018 at 04:18:39PM +0300, Dmitry Osipenko wrote:
 On 09.01.2018 05:34, Ming Lei wrote:
> On Tue, Jan 09, 2018 at 12:09:27AM +0300, Dmitry Osipenko wrote:
>> On 18.12.2017 15:22, Ming Lei wrote:
>>> When merging one bvec into segment, if the bvec is too big
>>> to merge, current policy is to move the whole bvec into another
>>> new segment.
>>>
>>> This patchset changes the policy into trying to maximize size of
>>> front segments, that means in above situation, part of bvec
>>> is merged into current segment, and the remainder is put
>>> into next segment.
>>>
>>> This patch prepares for support multipage bvec because
>>> it can be quite common to see this case and we should try
>>> to make front segments in full size.
>>>
>>> Signed-off-by: Ming Lei 
>>> ---
>>>  block/blk-merge.c | 54 
>>> +-
>>>  1 file changed, 49 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>>> index a476337a8ff4..42ceb89bc566 100644
>>> --- a/block/blk-merge.c
>>> +++ b/block/blk-merge.c
>>> @@ -109,6 +109,7 @@ static struct bio *blk_bio_segment_split(struct 
>>> request_queue *q,
>>> bool do_split = true;
>>> struct bio *new = NULL;
>>> const unsigned max_sectors = get_max_io_size(q, bio);
>>> +   unsigned advance = 0;
>>>  
>>> bio_for_each_segment(bv, bio, iter) {
>>> /*
>>> @@ -134,12 +135,32 @@ static struct bio *blk_bio_segment_split(struct 
>>> request_queue *q,
>>> }
>>>  
>>> if (bvprvp && blk_queue_cluster(q)) {
>>> -   if (seg_size + bv.bv_len > 
>>> queue_max_segment_size(q))
>>> -   goto new_segment;
>>> if (!BIOVEC_PHYS_MERGEABLE(bvprvp, &bv))
>>> goto new_segment;
>>> if (!BIOVEC_SEG_BOUNDARY(q, bvprvp, &bv))
>>> goto new_segment;
>>> +   if (seg_size + bv.bv_len > 
>>> queue_max_segment_size(q)) {
>>> +   /*
>>> +* On assumption is that initial value 
>>> of
>>> +* @seg_size(equals to bv.bv_len) won't 
>>> be
>>> +* bigger than max segment size, but 
>>> will
>>> +* becomes false after multipage bvec 
>>> comes.
>>> +*/
>>> +   advance = queue_max_segment_size(q) - 
>>> seg_size;
>>> +
>>> +   if (advance > 0) {
>>> +   seg_size += advance;
>>> +   sectors += advance >> 9;
>>> +   bv.bv_len -= advance;
>>> +   bv.bv_offset += advance;
>>> +   }
>>> +
>>> +   /*
>>> +* Still need to put remainder of 
>>> current
>>> +* bvec into a new segment.
>>> +*/
>>> +   goto new_segment;
>>> +   }
>>>  
>>> seg_size += bv.bv_len;
>>> bvprv = bv;
>>> @@ -161,6 +182,12 @@ static struct bio *blk_bio_segment_split(struct 
>>> request_queue *q,
>>> seg_size = bv.bv_len;
>>> sectors += bv.bv_len >> 9;
>>>  
>>> +   /* restore the bvec for iterator */
>>> +   if (advance) {
>>> +   bv.bv_len += advance;
>>> +   bv.bv_offset -= advance;
>>> +   advance = 0;
>>> +   }
>>> }
>>>  
>>> do_split = false;
>>> @@ -361,16 +388,29 @@ __blk_segment_map_sg(struct request_queue *q, 
>>> struct bio_vec *bvec,
>>>  {
>>>  
>>> int nbytes = bvec->bv_len;
>>> +   unsigned advance = 0;
>>>  
>>> if (*sg && *cluster) {
>>> -   if ((*sg)->length + nbytes > queue_max_segment_size(q))
>>> -   goto new_segment;
>>> -
>>> if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
>>> goto new_segment;
>>> if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
>>>

Re: [PATCH BUGFIX 1/1] block, bfq: compile group put for oom queue only if BFQ_GROUP_IOSCHED is set

2018-01-10 Thread Jens Axboe

On 1/10/18 7:58 AM, Paolo Valente wrote:
> Commit ("block, bfq: release oom-queue ref to root group on exit")
> added a missing put of the root bfq group for the oom queue. That put
> has to be, and can be, performed only if CONFIG_BFQ_GROUP_IOSCHED is
> defined: the function doing the put is even not defined at all if
> CONFIG_BFQ_GROUP_IOSCHED is not defined. But that commit makes that
> put be invoked regardless of whether CONFIG_BFQ_GROUP_IOSCHED is
> defined. This commit fixes this mistake, by making that invocation be
> compiled only if CONFIG_BFQ_GROUP_IOSCHED is actually defined.

I already fixed that up yesterday:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block&id=8abef10b3de1144cfe968f454946f13eb1ac3d0a

-- 
Jens Axboe

[PATCH] bdi: show error log when fail to create bdi debugfs entry

2018-01-10 Thread weiping zhang

bdi debugfs dir/file may create fail, add error log here.

Signed-off-by: weiping zhang 
---
 mm/backing-dev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b5f940c..9117c21 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -885,7 +885,8 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
char *fmt, va_list args)
cgwb_bdi_register(bdi);
bdi->dev = dev;
 
-   bdi_debug_register(bdi, dev_name(dev));
+if (bdi_debug_register(bdi, dev_name(dev)))
+pr_warn("blkdev %s create bdi debugfs failed\n", 
dev_name(dev));
set_bit(WB_registered, &bdi->wb.state);
 
spin_lock_bh(&bdi_lock);
-- 
2.9.4

Re: unify the interface of the proportional-share policy in blkio/io

2018-01-10 Thread Tejun Heo

Hello,

On Tue, Jan 09, 2018 at 11:28:55PM +0100, Paolo Valente wrote:
> Yep.  So, do you guys think that our proposal may be ok?  We are
> waiting just for the green light to start implementing it.

Yeah, sounds reasonable to me.  Non-debug stats are already in blkcg
core anyway and I wouldn't worry too much about debug stats.  They can
have different names (they'll mean different things anyway) and better
be hidden behind debug option anyway.

Thanks.

-- 
tejun

[PATCH BUGFIX 0/1] bloc, bfq: fix of a bug introduced by my last patch

2018-01-10 Thread Paolo Valente

Hi Jens,
unfortunately, the patch of mine that you applied yesterday ("block,
bfq: release oom-queue ref to root group on exit"), does not compile
if CONFIG_BFQ_GROUP_IOSCHED is not defined. I forgot to test that patch
with that option disabled. Honestly, I'm more and more uncertain
about how useful (disabling) that option is ...

In any case, one question about the present patch: I didn't add any
hash for the commit that this patch fixes, because, if I'm not
mistaken, the hash that that commit will have in mainline is not yet
defined. I hope this is the right way to go.

Thanks,
Paolo

Paolo Valente (1):
  block, bfq: compile group put for oom queue only if BFQ_GROUP_IOSCHED
is set

 block/bfq-iosched.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--
2.15.1

[PATCH BUGFIX 1/1] block, bfq: compile group put for oom queue only if BFQ_GROUP_IOSCHED is set

2018-01-10 Thread Paolo Valente

Commit ("block, bfq: release oom-queue ref to root group on exit")
added a missing put of the root bfq group for the oom queue. That put
has to be, and can be, performed only if CONFIG_BFQ_GROUP_IOSCHED is
defined: the function doing the put is even not defined at all if
CONFIG_BFQ_GROUP_IOSCHED is not defined. But that commit makes that
put be invoked regardless of whether CONFIG_BFQ_GROUP_IOSCHED is
defined. This commit fixes this mistake, by making that invocation be
compiled only if CONFIG_BFQ_GROUP_IOSCHED is actually defined.

Fixes ("block, bfq: release oom-queue ref to root group on exit")
Reported-by: Jan Alexander Steffens 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c56a495af2e8..93a97a7fe519 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5015,10 +5015,9 @@ static void bfq_exit_queue(struct elevator_queue *e)
 
hrtimer_cancel(&bfqd->idle_slice_timer);
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
/* release oom-queue reference to root group */
bfqg_and_blkg_put(bfqd->root_group);
-
-#ifdef CONFIG_BFQ_GROUP_IOSCHED
blkcg_deactivate_policy(bfqd->queue, &blkcg_policy_bfq);
 #else
spin_lock_irq(&bfqd->lock);
-- 
2.15.1

Re: [PATCH] block, bfq: fix occurrences of request finish method's old name

2018-01-10 Thread Jens Axboe

On 1/10/18 7:24 AM, Chiara Bruschi wrote:
> Hi Jens,
> have you had time to look into this?

I missed that Paolo had already reviewed it, I will queue it up.

-- 
Jens Axboe

Re: [PATCH] block, bfq: fix occurrences of request finish method's old name

2018-01-10 Thread Chiara Bruschi

Hi Jens,
have you had time to look into this?

Thank you,
Chiara Bruschi


On 12/18/17 5:21 PM, Chiara Bruschi wrote:

Commit '7b9e93616399' ("blk-mq-sched: unify request finished methods")
changed the old name of current bfq_finish_request method, but left it
unchanged elsewhere in the code (related comments, part of function
name bfq_put_rq_priv_body).

This commit fixes all occurrences of the old name of this method by
changing them into the current name.

Fixes: 7b9e93616399 ("blk-mq-sched: unify request finished methods")
Reviewed-by: Paolo Valente 
Signed-off-by: Federico Motta 
Signed-off-by: Chiara Bruschi 
---
 block/bfq-iosched.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index bcb6d21..6da7f71 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3630,8 +3630,8 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
 }
 
 /*
-    * We exploit the put_rq_private hook to decrement
-    * rq_in_driver, but put_rq_private will not be
+    * We exploit the bfq_finish_request hook to decrement
+    * rq_in_driver, but bfq_finish_request will not be
  * invoked on this request. So, to avoid unbalance,
  * just start this request, without incrementing
  * rq_in_driver. As a negative consequence,
@@ -3640,14 +3640,14 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
  * bfq_schedule_dispatch to be invoked uselessly.
  *
  * As for implementing an exact solution, the
-    * put_request hook, if defined, is probably invoked
-    * also on this request. So, by exploiting this hook,
-    * we could 1) increment rq_in_driver here, and 2)
-    * decrement it in put_request. Such a solution would
-    * let the value of the counter be always accurate,
-    * but it would entail using an extra interface
-    * function. This cost seems higher than the benefit,
-    * being the frequency of non-elevator-private
+    * bfq_finish_request hook, if defined, is probably
+    * invoked also on this request. So, by exploiting
+    * this hook, we could 1) increment rq_in_driver here,
+    * and 2) decrement it in bfq_finish_request. Such a
+    * solution would let the value of the counter be
+    * always accurate, but it would entail using an extra
+    * interface function. This cost seems higher than the
+    * benefit, being the frequency of non-elevator-private
  * requests very low.
  */
 goto start_rq;
@@ -4482,7 +4482,7 @@ static void bfq_completed_request(struct bfq_queue *bfqq, 
struct bfq_data *bfqd)
 bfq_schedule_dispatch(bfqd);
 }
 
-static void bfq_put_rq_priv_body(struct bfq_queue *bfqq)
+static void bfq_finish_request_body(struct bfq_queue *bfqq)
 {
 bfqq->allocated--;
 
@@ -4512,7 +4512,7 @@ static void bfq_finish_request(struct request *rq)
 spin_lock_irqsave(&bfqd->lock, flags);
 
 bfq_completed_request(bfqq, bfqd);
-   bfq_put_rq_priv_body(bfqq);
+   bfq_finish_request_body(bfqq);
 
 spin_unlock_irqrestore(&bfqd->lock, flags);
 } else {
@@ -4533,7 +4533,7 @@ static void bfq_finish_request(struct request *rq)
 bfqg_stats_update_io_remove(bfqq_group(bfqq),
 rq->cmd_flags);
 }
-   bfq_put_rq_priv_body(bfqq);
+   bfq_finish_request_body(bfqq);
 }
 
 rq->elv.priv[0] = NULL;
-- 
2.1.4

Re: [for-4.16 PATCH v2 2/3] block: cope with gendisk's 'queue' being added later

2018-01-10 Thread Mike Snitzer

On Wed, Jan 10 2018 at  2:55am -0500,
Ming Lei  wrote:

> On Wed, Jan 10, 2018 at 12:21 PM, Mike Snitzer  wrote:
> > On Tue, Jan 09 2018 at 10:46pm -0500,
> > Ming Lei  wrote:
> >
> >> Another related issue is that blk-mq debugfs can't work yet for dm-mpath.
> >
> > Not sure how that is a related issue to this thread.
> > But can you expand on that?  I'm not familiar with "blk-mq debugfs".
> > Any pointer to code that activates it?  Could be that my changes make it
> > work now...
> 
> Hi,
> 
> blk_mq_debugfs_register() is called in blk_register_queue() from add_disk(),
> and blk_mq_debugfs_register() need queue's information to setup mq debugfs
> stuff.
> 
> But your patch does fix this issue, cool!

Great, thought it might.
 
> >> So I am just wondering if it is possible to follow the usual way to add 
> >> disk
> >> after setting up queue for DM? If it is possible, it may avoid lots of 
> >> trouble
> >> for us.
> >
> > As I explained in the header (quoted above actually) it is _not_
> > possible.  It is the gendisk that enables the device stack to be
> > constructed (at least how DM's stacking is currently implemented with
> > bd_link_disk_holder() and taking references on devices found in a DM
> > device's table).  And from that gendisk I can then walk the devices to
> > establish the request_queue as needed, its stacked limits, etc.
> >
> > The approach I've taken in these patches is the closest I've gotten to
> > make DM _much_ more sane about how its request_queue is established.
> > But its still off the beaten path due to needing "block: cope with
> > gendisk's 'queue' being added later".
> 
> OK, thanks for your explanation.
> 
> After taking close look at your patches, I think this approach is clean.
> But there are still corner cases which need to be addressed:
> 
> 1) in the following disk attributes, queue is referenced, so we have
> to add check in their show/store operations since the attributes
> are shown up just after disk is added.
> 
>  disk_alignment_offset_show
>  disk_discard_alignment_show
>  part_timeout_show/part_timeout_store
> 
> 2) same issue on /proc/diskstats: see diskstats_show()
> 
> 3) in IO path, looks we already checked disk->queue in
> generic_make_request_checks(), so it should be fine, and you set
> disk->queue after the queue is fully initialized in the 3rd patch.

I'll work through these and see what I can do.

Will likely deal with each independent of the 2/3 patch (otherwise
if I just keep folding changes in to the original patch review will get
too hard).

So hopefully I'll have a v3 to share later today.

Thanks,
Mike

Re: [for-4.16 PATCH v2 2/3] block: cope with gendisk's 'queue' being added later

2018-01-10 Thread Hannes Reinecke

On 01/10/2018 09:32 AM, Christoph Hellwig wrote:
> On Tue, Jan 09, 2018 at 09:41:03PM -0500, Mike Snitzer wrote:
>> Since I can remember DM has forced the block layer to allow the
>> allocation and initialization of the request_queue to be distinct
>> operations.  Reason for this was block/genhd.c:add_disk() has required
>> that the request_queue (and associated bdi) be tied to the gendisk
>> before add_disk() is called -- because add_disk() also deals with
>> exposing the request_queue via blk_register_queue().
> 
> Hmm.  I don't even know how that could be safe given that the disk
> is live and visible to userspace once added..
> 
We're getting an additional 'change' event once the tables are loaded
and setup properly. That's not a problem, and in fact the DASD driver
has the very same issue/feature/design.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

[PATCH] bcache: fix for allocator and register thread race

2018-01-10 Thread tang . junhui

From: Tang Junhui 

After long time run of random small IO writing,
reboot the machine, and after the machine power on,
bcache got stuck, the stack is:
[root@ceph153 ~]# cat /proc/2510/task/*/stack
[] closure_sync+0x25/0x90 [bcache]
[] bch_journal+0x118/0x2b0 [bcache]
[] bch_journal_meta+0x47/0x70 [bcache]
[] bch_prio_write+0x237/0x340 [bcache]
[] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[] kthread+0xcf/0xe0
[] ret_from_fork+0x58/0x90
[] 0x
[root@ceph153 ~]# cat /proc/2038/task/*/stack
[] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[] bch_btree_insert+0xf1/0x170 [bcache]
[] bch_journal_replay+0x13f/0x230 [bcache]
[] run_cache_set+0x79a/0x7c2 [bcache]
[] register_bcache+0xd48/0x1310 [bcache]
[] kobj_attr_store+0xf/0x20
[] sysfs_write_file+0xc6/0x140
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1
The stack shows the register thread and allocator thread
were getting stuck when registering cache device.

we reboot the machine several times, the issue always
exsit in this machine.

We debug the code, and found the call trace as bellow:
register_bcache()
  ==>run_cache_set()
 ==>bch_journal_replay()
==>bch_btree_insert()
   ==>__bch_btree_map_nodes()
  ==>btree_insert_fn()
 ==>btree_split() //node need split
==>btree_check_reserve()
In btree_check_reserve(), It will check if there is enough buckets
of RESERVE_BTREE type, since allocator thread did not work yet, so
no buckets of RESERVE_BTREE type allocated, so the register thread
waits on c->btree_cache_wait, and goes to sleep.

Then the allocator thread initialized, and goes to work,
the call trace is bellow:
bch_allocator_thread()
  ==>bch_prio_write()
 ==>bch_journal_meta()
==>bch_journal()
   ==>journal_wait_for_write()
In journal_wait_for_write(), It will check if journal is full by
journal_full(), but the long time random small IO writing
causes the exhaustion of journal buckets(journal.blocks_free=0),
In order to release the journal buckets,
the allocator calls btree_flush_write() to flush keys to
btree nodes, and waits on c->journal.wait until btree nodes writing over
or there has already some journal buckets space, then the allocator
thread goes to sleep. but in btree_flush_write(), since
bch_journal_replay() is not finished, so no btree nodes have journal
(condition "if (btree_current_write(b)->journal)" never satisfied),
so we got no btree node to flush, no journal bucket released,
and allocator sleep all the times.

Through the above analysis, we can see that:
1) Register thread wait for allocator thread to allocate buckets of
   RESERVE_BTREE type;
2) Alloctor thread wait for register thread to replay journal, so it
   can flush btree nodes and get journal bucket.
   then they are all got stuck by waiting for each other.

Hua Rui provided a patch for me, by allocating some buckets of
RESERVE_BTREE type in advance, so the register thread can get bucket
when btree node splitting and no need to waiting for the allocator thread.
I tested it, it has effect, and register thread run a step forward, but
finally are still got stuck, the reason is only 8 bucket of RESERVE_BTREE
type were allocated, and in bch_journal_replay(), after 2 btree nodes
splitting, only 4 bucket of RESERVE_BTREE type left, then
btree_check_reserve() is not satisfied anymore, so it goes to sleep again,
and in the same time, alloctor thread did not flush enough btree nodes to
release a journal bucket, so they all got stuck again.

So we need to allocate more buckets of RESERVE_BTREE type in advance,
but how much is enough?  By experience and test, I think it should be
as much as journal buckets. Then I modify the code as this patch,
and test in the machine, and it works.

This patch modified base on Hua Rui’s patch, and allocate more buckets
of RESERVE_BTREE type in advance to avoid register thread and allocate
thread going to wait for each other.

Signed-off-by: Hua Rui 
Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/btree.c |  9 ++---
 drivers/md/bcache/super.c | 12 +++-
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 658c54b..d222a9b 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1865,14 +1865,17 @@ void bch_initial_gc_finish(struct cache_set *c)
 */
for_each_cache(ca, c, i) {
for_each_bucket(b, ca) {
-   if (fifo_full(&ca->free[RESERVE_PRIO]))
+   if (fifo_full(&ca->free[RESERVE_PRIO]) &&
+   fifo_full(&ca->free[RESERVE_BTREE]))
break;
 
if (bch_can_invalidate_bucket(ca, b) &&
!GC_MARK(b)) {
__bch_invalidate_one_bucket(ca, b);
-   fifo_push(&ca->free[RESERVE_PRIO],
-

Re: [for-4.16 PATCH v2 2/3] block: cope with gendisk's 'queue' being added later

2018-01-10 Thread Christoph Hellwig

On Tue, Jan 09, 2018 at 09:41:03PM -0500, Mike Snitzer wrote:
> Since I can remember DM has forced the block layer to allow the
> allocation and initialization of the request_queue to be distinct
> operations.  Reason for this was block/genhd.c:add_disk() has required
> that the request_queue (and associated bdi) be tied to the gendisk
> before add_disk() is called -- because add_disk() also deals with
> exposing the request_queue via blk_register_queue().

Hmm.  I don't even know how that could be safe given that the disk
is live and visible to userspace once added..

Re: [for-4.16 PATCH v2 1/3] block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN

2018-01-10 Thread Christoph Hellwig

It's in fact completely harmless :)  But not even calling it
obviously is just as fine.

Re: [PATCHv2 0/5] nvme/dm failover unification

2018-01-10 Thread Christoph Hellwig

The whole series looks fine to me:

Reviewed-by: Christoph Hellwig 

Jens, do you want me to apply this to the nvme tree, or pick it up
directly?

Re: [dm-devel] [PATCHv2 5/5] dm mpath: Use blk_path_error

2018-01-10 Thread Johannes Thumshirn

Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [dm-devel] [PATCHv2 4/5] nvme/multipath: Use blk_path_error

2018-01-10 Thread Johannes Thumshirn

Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [dm-devel] [PATCHv2 3/5] block: Provide blk_status_t decoding for path errors

2018-01-10 Thread Johannes Thumshirn

Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [dm-devel] [PATCHv2 2/5] nvme/multipath: Consult blk_status_t for failover

2018-01-10 Thread Johannes Thumshirn

Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

86 matches

Mail list logo