Re: [PATCH -next] lightnvm: pblk: fix error handling of pblk_lines_init()

2018-09-25 Thread Hans Holmberg
On Wed, Sep 26, 2018 at 7:47 AM Wei Yongjun  wrote:
>
> In the too many bad blocks error handling case, we should release all
> the alloced resources instead direct return, otherwise it will cause
> memory leak.
>
> Fixes: 2deeefc02dff ("lightnvm: pblk: fail gracefully on line alloc. failure")
> Signed-off-by: Wei Yongjun 
> ---
>  drivers/lightnvm/pblk-init.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
> index 537e98f..a26f4e6 100644
> --- a/drivers/lightnvm/pblk-init.c
> +++ b/drivers/lightnvm/pblk-init.c
> @@ -1083,7 +1083,8 @@ static int pblk_lines_init(struct pblk *pblk)
>
> if (!nr_free_chks) {
> pblk_err(pblk, "too many bad blocks prevent for sane 
> instance\n");
> -   return -EINTR;
> +   ret = -EINTR;
> +   goto fail_free_lines;
> }
>
> pblk_set_provision(pblk, nr_free_chks);
>

Looks good to me.

Reviewed-by: Hans Holmberg 


[PATCH -next] lightnvm: pblk: fix error handling of pblk_lines_init()

2018-09-25 Thread Wei Yongjun
In the too many bad blocks error handling case, we should release all
the alloced resources instead direct return, otherwise it will cause
memory leak.

Fixes: 2deeefc02dff ("lightnvm: pblk: fail gracefully on line alloc. failure")
Signed-off-by: Wei Yongjun 
---
 drivers/lightnvm/pblk-init.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 537e98f..a26f4e6 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -1083,7 +1083,8 @@ static int pblk_lines_init(struct pblk *pblk)
 
if (!nr_free_chks) {
pblk_err(pblk, "too many bad blocks prevent for sane 
instance\n");
-   return -EINTR;
+   ret = -EINTR;
+   goto fail_free_lines;
}
 
pblk_set_provision(pblk, nr_free_chks);



Re: [PATCH] bcache: use REQ_PRIO to indicate bio for metadata

2018-09-25 Thread Coly Li



On 9/26/18 12:32 AM, Adam Manzanares wrote:


On 9/25/18 9:12 AM, Coly Li wrote:

In cached_dev_cache_miss() and check_should_bypass(), REQ_META is used
to check whether a bio is for metadata request. REQ_META is used for
blktrace, the correct REQ_ flag should be REQ_PRIO. This flag means the
bio should be prior to other bio, and frequently be used to indicate
metadata io in file system code.

This patch replaces REQ_META with correct flag REQ_PRIO.

CC Adam Manzanares because he explains to me what REQ_PRIO is for.

Sorry for the confusion, I was talking about using the bi_ioprio field
to pass ioprio information down to devices that support prioritized
commands. I haven't used the REQ_PRIO flag in the work that I have done.


Hi Adam,

Aha, I misunderstood you. After 'git blame' I realize REQ_PRIO is 
invented by Christoph Hellwig,


    Add a new REQ_PRIO to let requests preempt others in the cfq I/O 
schedule,

    and lave REQ_META purely for marking requests as metadata in blktrace.

From log of commit 65299a3b788b ("block: separate priority boosting 
from REQ_META") it seems this patch still uses REQ_PRIO in a correct 
way. And, thank you for clarifying this :-)


Coly Li


Signed-off-by: Coly Li 
Cc: Adam Manzanares 
---
   drivers/md/bcache/request.c | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 51be355a3309..13d3355a90c0 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -395,7 +395,7 @@ static bool check_should_bypass(struct cached_dev *dc, 
struct bio *bio)
 * unless the read-ahead request is for metadata (eg, for gfs2).
 */
if (bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) &&
-   !(bio->bi_opf & REQ_META))
+   !(bio->bi_opf & REQ_PRIO))
goto skip;
   
   	if (bio->bi_iter.bi_sector & (c->sb.block_size - 1) ||

@@ -877,7 +877,7 @@ static int cached_dev_cache_miss(struct btree *b, struct 
search *s,
}
   
   	if (!(bio->bi_opf & REQ_RAHEAD) &&

-   !(bio->bi_opf & REQ_META) &&
+   !(bio->bi_opf & REQ_PRIO) &&
s->iop.c->gc_stats.in_use < CUTOFF_CACHE_READA)
reada = min_t(sector_t, dc->readahead >> 9,
  get_capacity(bio->bi_disk) - bio_end_sector(bio));


Re: [PATCH v10 0/8] blk-mq: Implement runtime power management

2018-09-25 Thread Ming Lei
On Fri, Sep 21, 2018 at 01:31:14PM -0700, Bart Van Assche wrote:
> Hello Jens,
> 
> One of the pieces that is missing before blk-mq can be made the default is
> implementing runtime power management support for blk-mq.  This patch series
> not only implements runtime power management for blk-mq but also fixes a
> starvation issue in the power management code for the legacy block
> layer. Please consider this patch series for the upstream kernel.
> 
> Thanks,
> 
> Bart.
> 
> Changes compared to v9:
> - Left out the patches that document the functions that iterate over requests
>   and also the patch that introduces blk_mq_queue_rq_iter().
> - Simplified blk_pre_runtime_suspend(): left out the check whether no requests
>   are in progress.
> - Fixed the race between blk_queue_enter(), queue freezing and runtime power
>   management that Ming had identified.
> - Added a new patch that introduces percpu_ref_resurrect().
> 
> Changes compared to v8:
> - Fixed the race that was reported by Jianchao.
> - Fixed another spelling issue in a source code comment.
> 
> Changes compared to v7:
> - Addressed Jianchao's feedback about patch "Make blk_get_request() block for
>   non-PM requests while suspended".
> - Added two new patches - one that documents the functions that iterate over
>   requests and one that introduces a new function that iterates over all
>   requests associated with a queue.
> 
> Changes compared to v6:
> - Left out the patches that split RQF_PREEMPT in three flags.
> - Left out the patch that introduces the SCSI device state SDEV_SUSPENDED.
> - Left out the patch that introduces blk_pm_runtime_exit().
> - Restored the patch that changes the PREEMPT_ONLY flag into a counter.
> 
> Changes compared to v5:
> - Introduced a new flag RQF_DV that replaces RQF_PREEMPT for SCSI domain
>   validation.
> - Introduced a new request queue state QUEUE_FLAG_DV_ONLY for SCSI domain
>   validation.
> - Instead of using SDEV_QUIESCE for both runtime suspend and SCSI domain
>   validation, use that state for domain validation only and introduce a new
>   state for runtime suspend, namely SDEV_QUIESCE.
> - Reallow system suspend during SCSI domain validation.
> - Moved the runtime resume call from the request allocation code into
>   blk_queue_enter().
> - Instead of relying on q_usage_counter, iterate over the tag set to determine
>   whether or not any requests are in flight.
> 
> Changes compared to v4:
> - Dropped the patches "Give RQF_PREEMPT back its original meaning" and
>   "Serialize queue freezing and blk_pre_runtime_suspend()".
> - Replaced "percpu_ref_read()" with "percpu_is_in_use()".
> - Inserted pm_request_resume() calls in the block layer request allocation
>   code such that the context that submits a request no longer has to call
>   pm_runtime_get().
> 
> Changes compared to v3:
> - Avoid adverse interactions between system-wide suspend/resume and runtime
>   power management by changing the PREEMPT_ONLY flag into a counter.
> - Give RQF_PREEMPT back its original meaning, namely that it is only set for
>   ide_preempt requests.
> - Remove the flag BLK_MQ_REQ_PREEMPT.
> - Removed the pm_request_resume() call.
> 
> Changes compared to v2:
> - Fixed the build for CONFIG_BLOCK=n.
> - Added a patch that introduces percpu_ref_read() in the percpu-counter code.
> - Added a patch that makes it easier to detect missing pm_runtime_get*() 
> calls.
> - Addressed Jianchao's feedback including the comment about runtime overhead
>   of switching a per-cpu counter to atomic mode.
> 
> Changes compared to v1:
> - Moved the runtime power management code into a separate file.
> - Addressed Ming's feedback.
> 
> Bart Van Assche (8):
>   block: Move power management code into a new source file
>   block, scsi: Change the preempt-only flag into a counter
>   block: Split blk_pm_add_request() and blk_pm_put_request()
>   block: Schedule runtime resume earlier
>   percpu-refcount: Introduce percpu_ref_resurrect()
>   block: Allow unfreezing of a queue while requests are in progress
>   block: Make blk_get_request() block for non-PM requests while
> suspended
>   blk-mq: Enable support for runtime power management
> 
>  block/Kconfig   |   3 +
>  block/Makefile  |   1 +
>  block/blk-core.c| 270 
>  block/blk-mq-debugfs.c  |  10 +-
>  block/blk-mq.c  |   4 +-
>  block/blk-pm.c  | 216 +
>  block/blk-pm.h  |  69 
>  block/elevator.c|  22 +--
>  drivers/scsi/scsi_lib.c |  11 +-
>  drivers/scsi/scsi_pm.c  |   1 +
>  drivers/scsi/sd.c   |   1 +
>  drivers/scsi/sr.c   |   1 +
>  include/linux/blk-pm.h  |  24 +++
>  include/linux/blkdev.h  |  37 ++---
>  include/linux/percpu-refcount.h |   1 +
>  lib/percpu-refcount.c   |  28 +++-
>  16 files changed, 401 insertions(+), 298 deletio

Re: [PATCHv2] blk-mq: Allow blocking queue tag iter callbacks

2018-09-25 Thread Jens Axboe
On 9/25/18 10:36 AM, Keith Busch wrote:
> A recent commit runs tag iterator callbacks under the rcu read lock,
> but existing callbacks do not satisfy the non-blocking requirement.
> The commit intended to prevent an iterator from accessing a queue that's
> being modified. This patch fixes the original issue by taking a queue
> reference instead of reading it, which allows callbacks to make blocking
> calls.

Thanks Keith, queued up for 4.19.

-- 
Jens Axboe



Re: [PATCHv2] blk-mq: Allow blocking queue tag iter callbacks

2018-09-25 Thread jianchao.wang
Hi Keith

On 09/26/2018 12:36 AM, Keith Busch wrote:
> A recent commit runs tag iterator callbacks under the rcu read lock,
> but existing callbacks do not satisfy the non-blocking requirement.
> The commit intended to prevent an iterator from accessing a queue that's
> being modified. This patch fixes the original issue by taking a queue
> reference instead of reading it, which allows callbacks to make blocking
> calls.
> 
> Fixes: f5e4d6357 ("blk-mq: sync the update nr_hw_queues with 
> blk_mq_queue_tag_busy_iter")
> Cc: Jianchao Wang 
> Signed-off-by: Keith Busch 
> ---
> v1 -> v2:
> 
>   Leave the iterator interface as-is, making this change transparent
>   to the callers. This will increment the queue usage counter in the
>   timeout path twice, which is harmless.
> 

This is OK to me.
Acked-by: Jianchao Wang 

>   Changelog.
> 
>  block/blk-mq-tag.c | 13 -
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 94e1ed667b6e..41317c50a446 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -322,16 +322,11 @@ void blk_mq_queue_tag_busy_iter(struct request_queue 
> *q, busy_iter_fn *fn,
>  
>   /*
>* __blk_mq_update_nr_hw_queues will update the nr_hw_queues and
> -  * queue_hw_ctx after freeze the queue. So we could use q_usage_counter
> -  * to avoid race with it. __blk_mq_update_nr_hw_queues will users
> -  * synchronize_rcu to ensure all of the users go out of the critical
> -  * section below and see zeroed q_usage_counter.
> +  * queue_hw_ctx after freeze the queue, so we use q_usage_counter
> +  * to avoid race with it.
>*/
> - rcu_read_lock();
> - if (percpu_ref_is_zero(&q->q_usage_counter)) {
> - rcu_read_unlock();
> + if (!percpu_ref_tryget(&q->q_usage_counter))
>   return;
> - }
>  
>   queue_for_each_hw_ctx(q, hctx, i) {
>   struct blk_mq_tags *tags = hctx->tags;
> @@ -347,7 +342,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
> busy_iter_fn *fn,
>   bt_for_each(hctx, &tags->breserved_tags, fn, priv, 
> true);
>   bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
>   }
> - rcu_read_unlock();
> + blk_queue_exit(q);
>  }
>  
>  static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
> 


Re: [PATCH v10 2/8] block, scsi: Change the preempt-only flag into a counter

2018-09-25 Thread Martin K. Petersen


Bart,

> The RQF_PREEMPT flag is used for three purposes:
> - In the SCSI core, for making sure that power management requests
>   are executed even if a device is in the "quiesced" state.
> - For domain validation by SCSI drivers that use the parallel port.
> - In the IDE driver, for IDE preempt requests.

> Rename "preempt-only" into "pm-only" because the primary purpose of
> this mode is power management. Since the power management core may but
> does not have to resume a runtime suspended device before performing
> system-wide suspend and since a later patch will set "pm-only" mode as
> long as a block device is runtime suspended, make it possible to set
> "pm-only" mode from more than one context. Since with this change
> scsi_device_quiesce() is no longer idempotent, make that function
> return early if it is called for a quiesced queue.

The SCSI pieces look OK to me...

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: clean up physical merging helpers V2

2018-09-25 Thread Jens Axboe
On 9/25/18 2:09 PM, Christoph Hellwig wrote:
> On Mon, Sep 24, 2018 at 12:35:08PM -0600, Jens Axboe wrote:
>> Looks good to me - #1 should just go to the arm guys separately, there's
>> no point in me carrying this arch patch.
> 
> It would require a respin of all the Xen patches and is rather related.
> But if you absolutely don't want it I can drop it from the resend of
> those.

I don't feel that strongly about it, and it's dead block code. If it makes
your life easier, I can add it.

-- 
Jens Axboe



Re: clean up physical merging helpers V2

2018-09-25 Thread Christoph Hellwig
On Mon, Sep 24, 2018 at 12:35:08PM -0600, Jens Axboe wrote:
> Looks good to me - #1 should just go to the arm guys separately, there's
> no point in me carrying this arch patch.

It would require a respin of all the Xen patches and is rather related.
But if you absolutely don't want it I can drop it from the resend of
those.


Re: [PATCH 1/4] block/loop: Don't grab "struct file" for vfs_getattr() operation.

2018-09-25 Thread Omar Sandoval
On Wed, Sep 26, 2018 at 12:26:46AM +0900, Tetsuo Handa wrote:
> vfs_getattr() needs "struct path" rather than "struct file".
> Let's use path_get()/path_put() rather than get_file()/fput().

Reviewed-by: Omar Sandoval 

> Signed-off-by: Tetsuo Handa 
> Reviewed-by: Jan Kara 
> ---
>  drivers/block/loop.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index abad6d1..c2745e6 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1206,7 +1206,7 @@ static int loop_clr_fd(struct loop_device *lo)
>  static int
>  loop_get_status(struct loop_device *lo, struct loop_info64 *info)
>  {
> - struct file *file;
> + struct path path;
>   struct kstat stat;
>   int ret;
>  
> @@ -1231,16 +1231,16 @@ static int loop_clr_fd(struct loop_device *lo)
>   }
>  
>   /* Drop lo_ctl_mutex while we call into the filesystem. */
> - file = get_file(lo->lo_backing_file);
> + path = lo->lo_backing_file->f_path;
> + path_get(&path);
>   mutex_unlock(&lo->lo_ctl_mutex);
> - ret = vfs_getattr(&file->f_path, &stat, STATX_INO,
> -   AT_STATX_SYNC_AS_STAT);
> + ret = vfs_getattr(&path, &stat, STATX_INO, AT_STATX_SYNC_AS_STAT);
>   if (!ret) {
>   info->lo_device = huge_encode_dev(stat.dev);
>   info->lo_inode = stat.ino;
>   info->lo_rdevice = huge_encode_dev(stat.rdev);
>   }
> - fput(file);
> + path_put(&path);
>   return ret;
>  }
>  
> -- 
> 1.8.3.1
> 


[PATCHv2] blk-mq: Allow blocking queue tag iter callbacks

2018-09-25 Thread Keith Busch
A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.

Fixes: f5e4d6357 ("blk-mq: sync the update nr_hw_queues with 
blk_mq_queue_tag_busy_iter")
Cc: Jianchao Wang 
Signed-off-by: Keith Busch 
---
v1 -> v2:

  Leave the iterator interface as-is, making this change transparent
  to the callers. This will increment the queue usage counter in the
  timeout path twice, which is harmless.

  Changelog.

 block/blk-mq-tag.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 94e1ed667b6e..41317c50a446 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -322,16 +322,11 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
 
/*
 * __blk_mq_update_nr_hw_queues will update the nr_hw_queues and
-* queue_hw_ctx after freeze the queue. So we could use q_usage_counter
-* to avoid race with it. __blk_mq_update_nr_hw_queues will users
-* synchronize_rcu to ensure all of the users go out of the critical
-* section below and see zeroed q_usage_counter.
+* queue_hw_ctx after freeze the queue, so we use q_usage_counter
+* to avoid race with it.
 */
-   rcu_read_lock();
-   if (percpu_ref_is_zero(&q->q_usage_counter)) {
-   rcu_read_unlock();
+   if (!percpu_ref_tryget(&q->q_usage_counter))
return;
-   }
 
queue_for_each_hw_ctx(q, hctx, i) {
struct blk_mq_tags *tags = hctx->tags;
@@ -347,7 +342,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
bt_for_each(hctx, &tags->breserved_tags, fn, priv, 
true);
bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
}
-   rcu_read_unlock();
+   blk_queue_exit(q);
 }
 
 static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
-- 
2.14.4



Re: [PATCH] bcache: use REQ_PRIO to indicate bio for metadata

2018-09-25 Thread Adam Manzanares


On 9/25/18 9:12 AM, Coly Li wrote:
> In cached_dev_cache_miss() and check_should_bypass(), REQ_META is used
> to check whether a bio is for metadata request. REQ_META is used for
> blktrace, the correct REQ_ flag should be REQ_PRIO. This flag means the
> bio should be prior to other bio, and frequently be used to indicate
> metadata io in file system code.
> 
> This patch replaces REQ_META with correct flag REQ_PRIO.
> 
> CC Adam Manzanares because he explains to me what REQ_PRIO is for.

Sorry for the confusion, I was talking about using the bi_ioprio field 
to pass ioprio information down to devices that support prioritized 
commands. I haven't used the REQ_PRIO flag in the work that I have done.

> 
> Signed-off-by: Coly Li 
> Cc: Adam Manzanares 
> ---
>   drivers/md/bcache/request.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 51be355a3309..13d3355a90c0 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -395,7 +395,7 @@ static bool check_should_bypass(struct cached_dev *dc, 
> struct bio *bio)
>* unless the read-ahead request is for metadata (eg, for gfs2).
>*/
>   if (bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) &&
> - !(bio->bi_opf & REQ_META))
> + !(bio->bi_opf & REQ_PRIO))
>   goto skip;
>   
>   if (bio->bi_iter.bi_sector & (c->sb.block_size - 1) ||
> @@ -877,7 +877,7 @@ static int cached_dev_cache_miss(struct btree *b, struct 
> search *s,
>   }
>   
>   if (!(bio->bi_opf & REQ_RAHEAD) &&
> - !(bio->bi_opf & REQ_META) &&
> + !(bio->bi_opf & REQ_PRIO) &&
>   s->iop.c->gc_stats.in_use < CUTOFF_CACHE_READA)
>   reada = min_t(sector_t, dc->readahead >> 9,
> get_capacity(bio->bi_disk) - bio_end_sector(bio));
> 

[PATCH] bcache: use REQ_PRIO to indicate bio for metadata

2018-09-25 Thread Coly Li
In cached_dev_cache_miss() and check_should_bypass(), REQ_META is used
to check whether a bio is for metadata request. REQ_META is used for
blktrace, the correct REQ_ flag should be REQ_PRIO. This flag means the
bio should be prior to other bio, and frequently be used to indicate
metadata io in file system code.

This patch replaces REQ_META with correct flag REQ_PRIO.

CC Adam Manzanares because he explains to me what REQ_PRIO is for.

Signed-off-by: Coly Li 
Cc: Adam Manzanares 
---
 drivers/md/bcache/request.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 51be355a3309..13d3355a90c0 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -395,7 +395,7 @@ static bool check_should_bypass(struct cached_dev *dc, 
struct bio *bio)
 * unless the read-ahead request is for metadata (eg, for gfs2).
 */
if (bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) &&
-   !(bio->bi_opf & REQ_META))
+   !(bio->bi_opf & REQ_PRIO))
goto skip;
 
if (bio->bi_iter.bi_sector & (c->sb.block_size - 1) ||
@@ -877,7 +877,7 @@ static int cached_dev_cache_miss(struct btree *b, struct 
search *s,
}
 
if (!(bio->bi_opf & REQ_RAHEAD) &&
-   !(bio->bi_opf & REQ_META) &&
+   !(bio->bi_opf & REQ_PRIO) &&
s->iop.c->gc_stats.in_use < CUTOFF_CACHE_READA)
reada = min_t(sector_t, dc->readahead >> 9,
  get_capacity(bio->bi_disk) - bio_end_sector(bio));
-- 
2.19.0



Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks

2018-09-25 Thread Keith Busch
On Tue, Sep 25, 2018 at 08:14:45AM -0700, Bart Van Assche wrote:
> On Tue, 2018-09-25 at 08:39 -0600, Keith Busch wrote:
> > On Tue, Sep 25, 2018 at 10:39:46AM +0800, jianchao.wang wrote:
> > > But the issue is the left part of blk_mq_timeout_work is moved out of 
> > > protection of q refcount.
> > 
> > I'm not sure what you mean by "left part". The only part that isn't
> > outside the reference with this patch is the part Bart pointed out.
> > 
> > This looks like it may be fixed by either moving the refcount back up a
> > level to all the callers of blk_mq_queue_tag_busy_iter, or add
> > cancel_work_sync(&q->timeout_work) to __blk_mq_update_nr_hw_queues after
> > the freeze.
> 
> Hi Keith,
> 
> How about applying the following (untested) patch on top of your patch?

Yep, this works. I can fold in a v2.
 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 019f9b169887..099e203b5213 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -851,6 +851,9 @@ static void blk_mq_timeout_work(struct work_struct *work)
>   if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next))
>   return;
>  
> + if (!percpu_ref_tryget(&q->q_usage_counter))
> + return;
> +
>   if (next != 0) {
>   mod_timer(&q->timeout, next);
>   } else {
> @@ -866,6 +869,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
>   blk_mq_tag_idle(hctx);
>   }
>   }
> + blk_queue_exit(q);
>  }
> 
> Thanks,
> 
> Bart.


[PATCH 4/4] block/loop: Fix circular locking dependency at blkdev_reread_part().

2018-09-25 Thread Tetsuo Handa
syzbot is reporting circular locking dependency between bdev->bd_mutex
and lo->lo_ctl_mutex [1] which is caused by calling blkdev_reread_part()
with lock held. We need to drop lo->lo_ctl_mutex in order to fix it.

This patch fixes it by combining loop_index_mutex and loop_ctl_mutex into
loop_mutex, and releasing loop_mutex before calling blkdev_reread_part()
or fput() or path_put() or leaving ioctl().

The rule is that current thread calls lock_loop() before accessing
"struct loop_device", and current thread no longer accesses "struct
loop_device" after unlock_loop() is called.

Since syzbot is reporting various bugs [2] where a race in the loop module
is suspected, let's check whether this patch affects these bugs too.

[1] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889
[2] 
https://syzkaller.appspot.com/bug?id=b3c7e1440aa8ece16bf557dbac427fdff1dad9d6

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 

---
 drivers/block/loop.c | 187 ---
 1 file changed, 101 insertions(+), 86 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4b05a27..04389bb 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -84,12 +84,50 @@
 #include 
 
 static DEFINE_IDR(loop_index_idr);
-static DEFINE_MUTEX(loop_index_mutex);
-static DEFINE_MUTEX(loop_ctl_mutex);
+static DEFINE_MUTEX(loop_mutex);
+static void *loop_mutex_owner; /* == __mutex_owner(&loop_mutex) */
 
 static int max_part;
 static int part_shift;
 
+/*
+ * lock_loop - Lock loop_mutex.
+ */
+static void lock_loop(void)
+{
+   mutex_lock(&loop_mutex);
+   loop_mutex_owner = current;
+}
+
+/*
+ * lock_loop_killable - Lock loop_mutex unless killed.
+ */
+static int lock_loop_killable(void)
+{
+   int err = mutex_lock_killable(&loop_mutex);
+
+   if (err)
+   return err;
+   loop_mutex_owner = current;
+   return 0;
+}
+
+/*
+ * unlock_loop - Unlock loop_mutex as needed.
+ *
+ * Explicitly call this function before calling fput() or blkdev_reread_part()
+ * in order to avoid circular lock dependency. After this function is called,
+ * current thread is no longer allowed to access "struct loop_device" memory,
+ * for another thread would access that memory as soon as loop_mutex is held.
+ */
+static void unlock_loop(void)
+{
+   if (loop_mutex_owner == current) {
+   loop_mutex_owner = NULL;
+   mutex_unlock(&loop_mutex);
+   }
+}
+
 static int transfer_xor(struct loop_device *lo, int cmd,
struct page *raw_page, unsigned raw_off,
struct page *loop_page, unsigned loop_off,
@@ -637,6 +675,7 @@ static void loop_reread_partitions(struct loop_device *lo,
const int count = atomic_read(&lo->lo_refcnt);
 
memcpy(filename, lo->lo_file_name, sizeof(filename));
+   unlock_loop();
 
/*
 * bd_mutex has been held already in release path, so don't
@@ -699,6 +738,7 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
struct file *file, *old_file;
int error;
 
+   lockdep_assert_held(&loop_mutex);
error = -ENXIO;
if (lo->lo_state != Lo_bound)
goto out;
@@ -737,10 +777,12 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
 
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
loop_reread_partitions(lo, bdev);
+   unlock_loop();
fput(old_file);
return 0;
 
  out_putf:
+   unlock_loop();
fput(file);
  out:
return error;
@@ -918,6 +960,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
int error;
loff_t  size;
 
+   lockdep_assert_held(&loop_mutex);
/* This is safe, since we have a reference from open(). */
__module_get(THIS_MODULE);
 
@@ -991,6 +1034,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
return 0;
 
  out_putf:
+   unlock_loop();
fput(file);
  out:
/* This is safe: open() is still holding a reference. */
@@ -1042,6 +1086,7 @@ static int loop_clr_fd(struct loop_device *lo)
struct block_device *bdev = lo->lo_device;
bool reread;
 
+   lockdep_assert_held(&loop_mutex);
if (lo->lo_state != Lo_bound)
return -ENXIO;
 
@@ -1057,7 +1102,6 @@ static int loop_clr_fd(struct loop_device *lo)
 */
if (atomic_read(&lo->lo_refcnt) > 1) {
lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-   mutex_unlock(&loop_ctl_mutex);
return 0;
}
 
@@ -,13 +1155,7 @@ static int loop_clr_fd(struct loop_device *lo)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
if (reread)
loop_reread_partitions(lo, bdev);
-   mutex_unlock(&loop_ctl_mutex);
-   /*
-* Need not hold loop_ctl_mutex to fput backing file.
-* 

[PATCH 3/4] block/loop: Reorganize loop_reread_partitions() callers.

2018-09-25 Thread Tetsuo Handa
We will drop loop_ctl_mutex before calling blkdev_reread_part() in
order to fix circular locking dependency between bdev->bd_mutex and
loop_ctl_mutex. To do that we need to make sure that we won't touch
"struct loop_device" after releasing loop_ctl_mutex.

As a preparation step, this patch reorganizes loop_reread_partitions()
callers. According to Ming Lei, calling loop_unprepare_queue() before
loop_reread_partitions() (like we did until 3.19) is fine. Therefore,
this patch will not cause user visible changes.

Signed-off-by: Tetsuo Handa 
Cc: Ming Lei 
---
 drivers/block/loop.c | 29 +++--
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 920cbb1..4b05a27 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -632,6 +632,11 @@ static void loop_reread_partitions(struct loop_device *lo,
   struct block_device *bdev)
 {
int rc;
+   char filename[LO_NAME_SIZE];
+   const int num = lo->lo_number;
+   const int count = atomic_read(&lo->lo_refcnt);
+
+   memcpy(filename, lo->lo_file_name, sizeof(filename));
 
/*
 * bd_mutex has been held already in release path, so don't
@@ -641,13 +646,13 @@ static void loop_reread_partitions(struct loop_device *lo,
 * must be at least one and it can only become zero when the
 * current holder is released.
 */
-   if (!atomic_read(&lo->lo_refcnt))
+   if (!count)
rc = __blkdev_reread_part(bdev);
else
rc = blkdev_reread_part(bdev);
if (rc)
pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
-   __func__, lo->lo_number, lo->lo_file_name, rc);
+   __func__, num, filename, rc);
 }
 
 static inline int is_loop_device(struct file *file)
@@ -730,9 +735,9 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
loop_update_dio(lo);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-   fput(old_file);
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
loop_reread_partitions(lo, bdev);
+   fput(old_file);
return 0;
 
  out_putf:
@@ -971,16 +976,18 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
set_blocksize(bdev, S_ISBLK(inode->i_mode) ?
  block_size(inode->i_bdev) : PAGE_SIZE);
 
+   /*
+* Grab the block_device to prevent its destruction after we
+* put /dev/loopXX inode. Later in loop_clr_fd() we bdput(bdev).
+*/
+   bdgrab(bdev);
+
lo->lo_state = Lo_bound;
if (part_shift)
lo->lo_flags |= LO_FLAGS_PARTSCAN;
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
loop_reread_partitions(lo, bdev);
 
-   /* Grab the block_device to prevent its destruction after we
-* put /dev/loopXX inode. Later in loop_clr_fd() we bdput(bdev).
-*/
-   bdgrab(bdev);
return 0;
 
  out_putf:
@@ -1033,6 +1040,7 @@ static int loop_clr_fd(struct loop_device *lo)
struct file *filp = lo->lo_backing_file;
gfp_t gfp = lo->old_gfp_mask;
struct block_device *bdev = lo->lo_device;
+   bool reread;
 
if (lo->lo_state != Lo_bound)
return -ENXIO;
@@ -1096,12 +1104,13 @@ static int loop_clr_fd(struct loop_device *lo)
module_put(THIS_MODULE);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-   if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
-   loop_reread_partitions(lo, bdev);
+   reread = (lo->lo_flags & LO_FLAGS_PARTSCAN) && bdev;
+   loop_unprepare_queue(lo);
lo->lo_flags = 0;
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
-   loop_unprepare_queue(lo);
+   if (reread)
+   loop_reread_partitions(lo, bdev);
mutex_unlock(&loop_ctl_mutex);
/*
 * Need not hold loop_ctl_mutex to fput backing file.
-- 
1.8.3.1



[PATCH 1/4] block/loop: Don't grab "struct file" for vfs_getattr() operation.

2018-09-25 Thread Tetsuo Handa
vfs_getattr() needs "struct path" rather than "struct file".
Let's use path_get()/path_put() rather than get_file()/fput().

Signed-off-by: Tetsuo Handa 
Reviewed-by: Jan Kara 
---
 drivers/block/loop.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index abad6d1..c2745e6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1206,7 +1206,7 @@ static int loop_clr_fd(struct loop_device *lo)
 static int
 loop_get_status(struct loop_device *lo, struct loop_info64 *info)
 {
-   struct file *file;
+   struct path path;
struct kstat stat;
int ret;
 
@@ -1231,16 +1231,16 @@ static int loop_clr_fd(struct loop_device *lo)
}
 
/* Drop lo_ctl_mutex while we call into the filesystem. */
-   file = get_file(lo->lo_backing_file);
+   path = lo->lo_backing_file->f_path;
+   path_get(&path);
mutex_unlock(&lo->lo_ctl_mutex);
-   ret = vfs_getattr(&file->f_path, &stat, STATX_INO,
- AT_STATX_SYNC_AS_STAT);
+   ret = vfs_getattr(&path, &stat, STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
info->lo_inode = stat.ino;
info->lo_rdevice = huge_encode_dev(stat.rdev);
}
-   fput(file);
+   path_put(&path);
return ret;
 }
 
-- 
1.8.3.1



[PATCH 2/4] block/loop: Use global lock for ioctl() operation.

2018-09-25 Thread Tetsuo Handa
syzbot is reporting NULL pointer dereference [1] which is caused by
race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
loop devices at loop_validate_file() without holding corresponding
lo->lo_ctl_mutex locks.

Since ioctl() request on loop devices is not frequent operation, we don't
need fine grained locking. Let's use global lock in order to allow safe
traversal at loop_validate_file().

Note that syzbot is also reporting circular locking dependency between
bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling
blkdev_reread_part() with lock held. This patch does not address it.

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Reviewed-by: Jan Kara 
---
 drivers/block/loop.c | 58 ++--
 drivers/block/loop.h |  1 -
 2 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c2745e6..920cbb1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -85,6 +85,7 @@
 
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_index_mutex);
+static DEFINE_MUTEX(loop_ctl_mutex);
 
 static int max_part;
 static int part_shift;
@@ -1048,7 +1049,7 @@ static int loop_clr_fd(struct loop_device *lo)
 */
if (atomic_read(&lo->lo_refcnt) > 1) {
lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
-   mutex_unlock(&lo->lo_ctl_mutex);
+   mutex_unlock(&loop_ctl_mutex);
return 0;
}
 
@@ -1101,12 +1102,12 @@ static int loop_clr_fd(struct loop_device *lo)
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
loop_unprepare_queue(lo);
-   mutex_unlock(&lo->lo_ctl_mutex);
+   mutex_unlock(&loop_ctl_mutex);
/*
-* Need not hold lo_ctl_mutex to fput backing file.
-* Calling fput holding lo_ctl_mutex triggers a circular
+* Need not hold loop_ctl_mutex to fput backing file.
+* Calling fput holding loop_ctl_mutex triggers a circular
 * lock dependency possibility warning as fput can take
-* bd_mutex which is usually taken before lo_ctl_mutex.
+* bd_mutex which is usually taken before loop_ctl_mutex.
 */
fput(filp);
return 0;
@@ -1211,7 +1212,7 @@ static int loop_clr_fd(struct loop_device *lo)
int ret;
 
if (lo->lo_state != Lo_bound) {
-   mutex_unlock(&lo->lo_ctl_mutex);
+   mutex_unlock(&loop_ctl_mutex);
return -ENXIO;
}
 
@@ -1230,10 +1231,10 @@ static int loop_clr_fd(struct loop_device *lo)
   lo->lo_encrypt_key_size);
}
 
-   /* Drop lo_ctl_mutex while we call into the filesystem. */
+   /* Drop loop_ctl_mutex while we call into the filesystem. */
path = lo->lo_backing_file->f_path;
path_get(&path);
-   mutex_unlock(&lo->lo_ctl_mutex);
+   mutex_unlock(&loop_ctl_mutex);
ret = vfs_getattr(&path, &stat, STATX_INO, AT_STATX_SYNC_AS_STAT);
if (!ret) {
info->lo_device = huge_encode_dev(stat.dev);
@@ -1325,7 +1326,7 @@ static int loop_clr_fd(struct loop_device *lo)
int err;
 
if (!arg) {
-   mutex_unlock(&lo->lo_ctl_mutex);
+   mutex_unlock(&loop_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, &info64);
@@ -1343,7 +1344,7 @@ static int loop_clr_fd(struct loop_device *lo)
int err;
 
if (!arg) {
-   mutex_unlock(&lo->lo_ctl_mutex);
+   mutex_unlock(&loop_ctl_mutex);
return -EINVAL;
}
err = loop_get_status(lo, &info64);
@@ -1401,7 +1402,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
struct loop_device *lo = bdev->bd_disk->private_data;
int err;
 
-   err = mutex_lock_killable_nested(&lo->lo_ctl_mutex, 1);
+   err = mutex_lock_killable_nested(&loop_ctl_mutex, 1);
if (err)
goto out_unlocked;
 
@@ -1413,7 +1414,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
err = loop_change_fd(lo, bdev, arg);
break;
case LOOP_CLR_FD:
-   /* loop_clr_fd would have unlocked lo_ctl_mutex on success */
+   /* loop_clr_fd would have unlocked loop_ctl_mutex on success */
err = loop_clr_fd(lo);
if (!err)
goto out_unlocked;
@@ -1426,7 +1427,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
break;
case LOOP_GET_STATUS:
err = loop_get_status_old(lo, (struct loop_info __user *) arg);
-   /* loop_get_status() unlocks lo_ctl_mutex */
+  

Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks

2018-09-25 Thread Bart Van Assche
On Tue, 2018-09-25 at 08:39 -0600, Keith Busch wrote:
> On Tue, Sep 25, 2018 at 10:39:46AM +0800, jianchao.wang wrote:
> > But the issue is the left part of blk_mq_timeout_work is moved out of 
> > protection of q refcount.
> 
> I'm not sure what you mean by "left part". The only part that isn't
> outside the reference with this patch is the part Bart pointed out.
> 
> This looks like it may be fixed by either moving the refcount back up a
> level to all the callers of blk_mq_queue_tag_busy_iter, or add
> cancel_work_sync(&q->timeout_work) to __blk_mq_update_nr_hw_queues after
> the freeze.

Hi Keith,

How about applying the following (untested) patch on top of your patch?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 019f9b169887..099e203b5213 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -851,6 +851,9 @@ static void blk_mq_timeout_work(struct work_struct *work)
if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next))
return;
 
+   if (!percpu_ref_tryget(&q->q_usage_counter))
+   return;
+
if (next != 0) {
mod_timer(&q->timeout, next);
} else {
@@ -866,6 +869,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
blk_mq_tag_idle(hctx);
}
}
+   blk_queue_exit(q);
 }

Thanks,

Bart.


Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks

2018-09-25 Thread Keith Busch
On Tue, Sep 25, 2018 at 10:39:46AM +0800, jianchao.wang wrote:
> Hi Bart
> 
> On 09/25/2018 10:20 AM, Bart Van Assche wrote:
> > On 9/24/18 7:11 PM, jianchao.wang wrote:
> >> Hi Keith
> >>
> >> On 09/25/2018 05:09 AM, Keith Busch wrote:
> >>> -    /* A deadlock might occur if a request is stuck requiring a
> >>> - * timeout at the same time a queue freeze is waiting
> >>> - * completion, since the timeout code would not be able to
> >>> - * acquire the queue reference here.
> >>> - *
> >>> - * That's why we don't use blk_queue_enter here; instead, we use
> >>> - * percpu_ref_tryget directly, because we need to be able to
> >>> - * obtain a reference even in the short window between the queue
> >>> - * starting to freeze, by dropping the first reference in
> >>> - * blk_freeze_queue_start, and the moment the last request is
> >>> - * consumed, marked by the instant q_usage_counter reaches
> >>> - * zero.
> >>> - */
> >>> -    if (!percpu_ref_tryget(&q->q_usage_counter))
> >>> +    if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next))
> >>>   return;
> >>
> >> We cannot discard the percpu_ref_tryget here.
> >>
> >> There following code in blk_mq_timeout_work still need it:
> >>
> >> if (next != 0) {
> >>     mod_timer(&q->timeout, next);
> >> } else {
> >>     queue_for_each_hw_ctx(q, hctx, i) {
> >>     /* the hctx may be unmapped, so check it here */
> >>     if (blk_mq_hw_queue_mapped(hctx))
> >>     blk_mq_tag_idle(hctx);
> >>     }
> >> }
> > 
> > Hi Jianchao,
> > 
> > Had you noticed that the percpu_ref_tryget() call has been moved into
> > blk_mq_queue_tag_busy_iter()?
> > 
> 
> Yes.
> 
> But the issue is the left part of blk_mq_timeout_work is moved out of 
> protection of q refcount.

I'm not sure what you mean by "left part". The only part that isn't
outside the reference with this patch is the part Bart pointed out.

This looks like it may be fixed by either moving the refcount back up a
level to all the callers of blk_mq_queue_tag_busy_iter, or add
cancel_work_sync(&q->timeout_work) to __blk_mq_update_nr_hw_queues after
the freeze.


Re: [PATCH] block/loop: Don't hold lock while rereading partition.

2018-09-25 Thread Jan Kara
On Tue 25-09-18 14:10:03, Tetsuo Handa wrote:
> syzbot is reporting circular locking dependency between bdev->bd_mutex and
> lo->lo_ctl_mutex [1] which is caused by calling blkdev_reread_part() with
> lock held. Don't hold loop_ctl_mutex while calling blkdev_reread_part().
> Also, bring bdgrab() at loop_set_fd() to before loop_reread_partitions()
> in case loop_clr_fd() is called while blkdev_reread_part() from
> loop_set_fd() is in progress.
> 
> [1] 
> https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889
> 
> Signed-off-by: Tetsuo Handa 
> Reported-by: syzbot 
> 

Thank you for splitting out this patch. Some comments below.

> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 920cbb1..877cca8 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -632,7 +632,12 @@ static void loop_reread_partitions(struct loop_device 
> *lo,
>  struct block_device *bdev)
>  {
>   int rc;
> + char filename[LO_NAME_SIZE];
> + const int num = lo->lo_number;
> + const int count = atomic_read(&lo->lo_refcnt);
>  
> + memcpy(filename, lo->lo_file_name, sizeof(filename));
> + mutex_unlock(&loop_ctl_mutex);
>   /*
>* bd_mutex has been held already in release path, so don't
>* acquire it if this function is called in such case.
> @@ -641,13 +646,14 @@ static void loop_reread_partitions(struct loop_device 
> *lo,
>* must be at least one and it can only become zero when the
>* current holder is released.
>*/
> - if (!atomic_read(&lo->lo_refcnt))
> + if (!count)
>   rc = __blkdev_reread_part(bdev);
>   else
>   rc = blkdev_reread_part(bdev);
> + mutex_lock(&loop_ctl_mutex);
>   if (rc)
>   pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
> - __func__, lo->lo_number, lo->lo_file_name, rc);
> + __func__, num, filename, rc);
>  }

I still don't quite like this. It is non-trivial to argue that the
temporary dropping of loop_ctl_mutex in loop_reread_partitions() is OK for
all it's callers. I'm really strongly in favor of unlocking the mutex in
the callers of loop_reread_partitions() and reorganizing the code there so
that loop_reread_partitions() is called as late as possible so that it is
clear that dropping the mutex is fine (and usually we don't even have to
reacquire it). Plus your patch does not seem to take care of the possible
races of loop_clr_fd() with LOOP_CTL_REMOVE? See my other mail for
details...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] block/loop: Use global lock for ioctl() operation.

2018-09-25 Thread Jan Kara
On Tue 25-09-18 13:21:01, Tetsuo Handa wrote:
> syzbot is reporting NULL pointer dereference [1] which is caused by
> race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
> ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
> loop devices at loop_validate_file() without holding corresponding
> lo->lo_ctl_mutex locks.
> 
> Since ioctl() request on loop devices is not frequent operation, we don't
> need fine grained locking. Let's use global lock in order to allow safe
> traversal at loop_validate_file().
> 
> Note that syzbot is also reporting circular locking dependency between
> bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling
> blkdev_reread_part() with lock held. This patch does not address it.
> 
> [1] 
> https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
> [2] 
> https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889
> 
> v2: Don't call mutex_init() upon loop_add() request.
> 
> Signed-off-by: Tetsuo Handa 
> Reported-by: syzbot 

Yeah, this is really simple! Thank you for the patch. You can add:

Reviewed-by: Jan Kara 

As a separate cleanup patch, you could drop loop_index_mutex and use
loop_ctl_mutex instead as there's now no reason to have two of them. But it
would not be completely trivial AFAICS.

Honza

> ---
>  drivers/block/loop.c | 58 
> ++--
>  drivers/block/loop.h |  1 -
>  2 files changed, 29 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index c2745e6..920cbb1 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -85,6 +85,7 @@
>  
>  static DEFINE_IDR(loop_index_idr);
>  static DEFINE_MUTEX(loop_index_mutex);
> +static DEFINE_MUTEX(loop_ctl_mutex);
>  
>  static int max_part;
>  static int part_shift;
> @@ -1048,7 +1049,7 @@ static int loop_clr_fd(struct loop_device *lo)
>*/
>   if (atomic_read(&lo->lo_refcnt) > 1) {
>   lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
> - mutex_unlock(&lo->lo_ctl_mutex);
> + mutex_unlock(&loop_ctl_mutex);
>   return 0;
>   }
>  
> @@ -1101,12 +1102,12 @@ static int loop_clr_fd(struct loop_device *lo)
>   if (!part_shift)
>   lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
>   loop_unprepare_queue(lo);
> - mutex_unlock(&lo->lo_ctl_mutex);
> + mutex_unlock(&loop_ctl_mutex);
>   /*
> -  * Need not hold lo_ctl_mutex to fput backing file.
> -  * Calling fput holding lo_ctl_mutex triggers a circular
> +  * Need not hold loop_ctl_mutex to fput backing file.
> +  * Calling fput holding loop_ctl_mutex triggers a circular
>* lock dependency possibility warning as fput can take
> -  * bd_mutex which is usually taken before lo_ctl_mutex.
> +  * bd_mutex which is usually taken before loop_ctl_mutex.
>*/
>   fput(filp);
>   return 0;
> @@ -1211,7 +1212,7 @@ static int loop_clr_fd(struct loop_device *lo)
>   int ret;
>  
>   if (lo->lo_state != Lo_bound) {
> - mutex_unlock(&lo->lo_ctl_mutex);
> + mutex_unlock(&loop_ctl_mutex);
>   return -ENXIO;
>   }
>  
> @@ -1230,10 +1231,10 @@ static int loop_clr_fd(struct loop_device *lo)
>  lo->lo_encrypt_key_size);
>   }
>  
> - /* Drop lo_ctl_mutex while we call into the filesystem. */
> + /* Drop loop_ctl_mutex while we call into the filesystem. */
>   path = lo->lo_backing_file->f_path;
>   path_get(&path);
> - mutex_unlock(&lo->lo_ctl_mutex);
> + mutex_unlock(&loop_ctl_mutex);
>   ret = vfs_getattr(&path, &stat, STATX_INO, AT_STATX_SYNC_AS_STAT);
>   if (!ret) {
>   info->lo_device = huge_encode_dev(stat.dev);
> @@ -1325,7 +1326,7 @@ static int loop_clr_fd(struct loop_device *lo)
>   int err;
>  
>   if (!arg) {
> - mutex_unlock(&lo->lo_ctl_mutex);
> + mutex_unlock(&loop_ctl_mutex);
>   return -EINVAL;
>   }
>   err = loop_get_status(lo, &info64);
> @@ -1343,7 +1344,7 @@ static int loop_clr_fd(struct loop_device *lo)
>   int err;
>  
>   if (!arg) {
> - mutex_unlock(&lo->lo_ctl_mutex);
> + mutex_unlock(&loop_ctl_mutex);
>   return -EINVAL;
>   }
>   err = loop_get_status(lo, &info64);
> @@ -1401,7 +1402,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
> mode,
>   struct loop_device *lo = bdev->bd_disk->private_data;
>   int err;
>  
> - err = mutex_lock_killable_nested(&lo->lo_ctl_mutex, 1);
> + err = mutex_lock_killable_nested(&loop_ctl_mutex, 1);
>   if (err)
>   goto out_unlocked;
>  
> @@ -1413,7 +1414,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
> mode,
>   err = loop_change_fd(lo, bdev, arg);
>   break;
>   case LOOP_CLR_FD:
>

Re: [PATCH] block/loop: Don't grab "struct file" for vfs_getattr() operation.

2018-09-25 Thread Jan Kara
On Tue 25-09-18 12:51:01, Tetsuo Handa wrote:
> vfs_getattr() needs "struct path" rather than "struct file".
> Let's use path_get()/path_put() rather than get_file()/fput().
> 
> Signed-off-by: Tetsuo Handa 

Thanks for splitting off the cleanup. The patch looks good to me. Feel free
to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  drivers/block/loop.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index abad6d1..c2745e6 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1206,7 +1206,7 @@ static int loop_clr_fd(struct loop_device *lo)
>  static int
>  loop_get_status(struct loop_device *lo, struct loop_info64 *info)
>  {
> - struct file *file;
> + struct path path;
>   struct kstat stat;
>   int ret;
>  
> @@ -1231,16 +1231,16 @@ static int loop_clr_fd(struct loop_device *lo)
>   }
>  
>   /* Drop lo_ctl_mutex while we call into the filesystem. */
> - file = get_file(lo->lo_backing_file);
> + path = lo->lo_backing_file->f_path;
> + path_get(&path);
>   mutex_unlock(&lo->lo_ctl_mutex);
> - ret = vfs_getattr(&file->f_path, &stat, STATX_INO,
> -   AT_STATX_SYNC_AS_STAT);
> + ret = vfs_getattr(&path, &stat, STATX_INO, AT_STATX_SYNC_AS_STAT);
>   if (!ret) {
>   info->lo_device = huge_encode_dev(stat.dev);
>   info->lo_inode = stat.ino;
>   info->lo_rdevice = huge_encode_dev(stat.rdev);
>   }
> - fput(file);
> + path_put(&path);
>   return ret;
>  }
>  
> -- 
> 1.8.3.1
> 
-- 
Jan Kara 
SUSE Labs, CR