[PATCH] blk-mq: I/O and timer unplugs are inverted in blktrace
trace_block_unplug() takes true for explicit unplugs and false for implicit unplugs. schedule() unplugs are implicit and should be reported as timer unplugs. While correct in the legacy code, this has been inverted in blk-mq since 4.11. Cc: sta...@vger.kernel.org Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers") Signed-off-by: Ilya Dryomov --- block/blk-mq.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 85a1c1a59c72..e3c39ea8e17b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1628,7 +1628,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) BUG_ON(!rq->q); if (rq->mq_ctx != this_ctx) { if (this_ctx) { - trace_block_unplug(this_q, depth, from_schedule); + trace_block_unplug(this_q, depth, !from_schedule); blk_mq_sched_insert_requests(this_q, this_ctx, _list, from_schedule); @@ -1648,7 +1648,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) * on 'ctx_list'. Do those. */ if (this_ctx) { - trace_block_unplug(this_q, depth, from_schedule); + trace_block_unplug(this_q, depth, !from_schedule); blk_mq_sched_insert_requests(this_q, this_ctx, _list, from_schedule); } -- 2.14.4
Re: [PATCH] [v2] rbd: avoid Wreturn-type warnings
On Wed, Apr 4, 2018 at 2:53 PM, Arnd Bergmannwrote: > A new set of warnings appeared in next-20180403 in some configurations > when gcc cannot see that rbd_assert(0) leads to an unreachable code > path: > > drivers/block/rbd.c: In function 'rbd_img_is_write': > drivers/block/rbd.c:1397:1: error: control reaches end of non-void function > [-Werror=return-type] > drivers/block/rbd.c: In function '__rbd_obj_handle_request': > drivers/block/rbd.c:2499:1: error: control reaches end of non-void function > [-Werror=return-type] > drivers/block/rbd.c: In function 'rbd_obj_handle_write': > drivers/block/rbd.c:2471:1: error: control reaches end of non-void function > [-Werror=return-type] > > As the rbd_assert() here shows has no extra information beyond the verbose > BUG(), we can simply use BUG() directly in its place. This is reliably > detected as not returning on any architecture, since it doesn't depend > on the unlikely() comparison that confused gcc. > > Fixes: 3da691bf4366 ("rbd: new request handling code") > Signed-off-by: Arnd Bergmann > --- > drivers/block/rbd.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index 07dc5419bd63..5f7f4d4b78a8 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -1392,7 +1392,7 @@ static bool rbd_img_is_write(struct rbd_img_request > *img_req) > case OBJ_OP_DISCARD: > return true; > default: > - rbd_assert(0); > + BUG(); > } > } > > @@ -2466,7 +2466,7 @@ static bool rbd_obj_handle_write(struct rbd_obj_request > *obj_req) > } > return false; > default: > - rbd_assert(0); > + BUG(); > } > } > > @@ -2494,7 +2494,7 @@ static bool __rbd_obj_handle_request(struct > rbd_obj_request *obj_req) > } > return false; > default: > - rbd_assert(0); > + BUG(); > } > } > Applied. Thanks, Ilya
Re: [PATCH] rbd: add missing return statements
On Wed, Apr 4, 2018 at 11:49 AM, Arnd Bergmannwrote: > A new set of warnings appeared in next-20180403 in some configurations > when gcc cannot see that rbd_assert(0) leads to an unreachable code > path: > > drivers/block/rbd.c: In function 'rbd_img_is_write': > drivers/block/rbd.c:1397:1: error: control reaches end of non-void function > [-Werror=return-type] > drivers/block/rbd.c: In function '__rbd_obj_handle_request': > drivers/block/rbd.c:2499:1: error: control reaches end of non-void function > [-Werror=return-type] > drivers/block/rbd.c: In function 'rbd_obj_handle_write': > drivers/block/rbd.c:2471:1: error: control reaches end of non-void function > [-Werror=return-type] > > To work around this, we can add a return statement to each of these > cases. An alternative would be to remove the unlikely() annotation > in rbd_assert(), or to just use BUG()/BUG_ON() directly. This adds the > return statements, guessing what the most reasonable behavior > would be. Hi Arnd, I don't like these bogus return statements. Let's go with explicit BUG/BUG_ON() instead. Thanks, Ilya
Re: [PATCH] rbd: remove VLA usage
On Fri, Mar 30, 2018 at 9:17 PM, Gustavo A. R. Silvawrote: > In preparation to enabling -Wvla, remove the use of stack VLA. > > In this particular case, variable buf_size is replaced with a constant > expression that can be computed at preprocessing time. This avoids two > VLA warnings. Also, as a consequence of the mentioned change, some code > was slightly refactored. > > The use of stack Variable Length Arrays needs to be avoided, as they > can be a vector for stack exhaustion, which can be both a runtime bug > or a security flaw. Also, in general, as code evolves it is easy to > lose track of how big a VLA can get. Thus, we can end up having runtime > failures that are hard to debug. > > Also, fixed as part of the directive to remove all VLAs from > the kernel: https://lkml.org/lkml/2018/3/7/621 > > Signed-off-by: Gustavo A. R. Silva > --- > drivers/block/rbd.c | 15 +++ > 1 file changed, 7 insertions(+), 8 deletions(-) > > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index 1e03b04..5133122 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -3091,20 +3091,20 @@ static int __rbd_notify_op_lock(struct rbd_device > *rbd_dev, > { > struct ceph_osd_client *osdc = _dev->rbd_client->client->osdc; > struct rbd_client_id cid = rbd_get_cid(rbd_dev); > - int buf_size = 4 + 8 + 8 + CEPH_ENCODING_START_BLK_LEN; > - char buf[buf_size]; > + char buf[4 + 8 + 8 + CEPH_ENCODING_START_BLK_LEN]; > void *p = buf; > > dout("%s rbd_dev %p notify_op %d\n", __func__, rbd_dev, notify_op); > > /* encode *LockPayload NotifyMessage (op + ClientId) */ > - ceph_start_encoding(, 2, 1, buf_size - CEPH_ENCODING_START_BLK_LEN); > + ceph_start_encoding(, 2, 1, > + sizeof(buf) - CEPH_ENCODING_START_BLK_LEN); > ceph_encode_32(, notify_op); > ceph_encode_64(, cid.gid); > ceph_encode_64(, cid.handle); > > return ceph_osdc_notify(osdc, _dev->header_oid, > - _dev->header_oloc, buf, buf_size, > + _dev->header_oloc, buf, sizeof(buf), > RBD_NOTIFY_TIMEOUT, preply_pages, preply_len); > } > > @@ -3610,19 +3610,18 @@ static void __rbd_acknowledge_notify(struct > rbd_device *rbd_dev, > u64 notify_id, u64 cookie, s32 *result) > { > struct ceph_osd_client *osdc = _dev->rbd_client->client->osdc; > - int buf_size = 4 + CEPH_ENCODING_START_BLK_LEN; > - char buf[buf_size]; > + char buf[4 + CEPH_ENCODING_START_BLK_LEN]; > + int buf_size = 0; > int ret; > > if (result) { > void *p = buf; > > + buf_size = sizeof(buf); > /* encode ResponseMessage */ > ceph_start_encoding(, 1, 1, > buf_size - CEPH_ENCODING_START_BLK_LEN); > ceph_encode_32(, *result); > - } else { > - buf_size = 0; > } > > ret = ceph_osdc_notify_ack(osdc, _dev->header_oid, Already fixed: https://github.com/ceph/ceph-client/commit/2474e77e75215a3cad3b652d2f55ba5b123cb119 Thanks, Ilya
Re: [PATCH] rbd: fix spelling mistake: "reregisteration" -> "re-registration"
On Tue, Mar 20, 2018 at 11:39 AM, Colin Ian Kingwrote: > On 20/03/18 10:35, Dongsheng Yang wrote: >> Hi Colin, >> >> On 03/19/2018 09:33 PM, Colin King wrote: >>> From: Colin Ian King >>> >>> Trivial fix to spelling mistake in rdb_warn message text >>> >>> Signed-off-by: Colin Ian King >>> --- >>> drivers/block/rbd.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >>> index 1e03b04819c8..7b97240ff15e 100644 >>> --- a/drivers/block/rbd.c >>> +++ b/drivers/block/rbd.c >>> @@ -3887,7 +3887,7 @@ static void rbd_reregister_watch(struct >>> work_struct *work) >>> ret = rbd_dev_refresh(rbd_dev); >>> if (ret) >>> -rbd_warn(rbd_dev, "reregisteration refresh failed: %d", ret); >>> +rbd_warn(rbd_dev, "re-registrationrefresh failed: %d", ret); >> >> Hmmm, I am not sure is that a spelling mistake, because the function >> name is rbd_reregister_watch(). > > OK, well reregisteration should be reregistration then. I went ahead and applied the change to "reregistration". Thanks, Ilya
Re: [PATCH v3 0/2] block: enforce ioctl(BLKROSET) and set_disk_ro()
On Thu, Jan 11, 2018 at 2:09 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > Hello, > > I was doing some cleanup work on rbd BLKROSET handler and discovered > that we ignore partition rw/ro setting (hd_struct->policy) for pretty > much everything but straight writes. > > David (CCed) has blktests patches standing by. > > (Another aspect of this is that we don't enforce open(2) mode. Tejun > took a stab at this a few years ago, but his patch had to be reverted: > > 75f1dc0d076d ("block: check bdev_read_only() from blkdev_get()") > e51900f7d38c ("block: revert block_dev read-only check") > > It is a separate issue and refusing writes to read-only devices is > obviously more important, but perhaps it's time to revisit that as > well?) > > v2 -> v3: > - lookup part only once; combine read-only check with existing > should_fail_request check > > v1 -> v2: > - added unlikely() per Sagi's suggestion > > Thanks, > > Ilya > > > Ilya Dryomov (2): > block: fail op_is_write() requests to read-only partitions > block: add bdev_read_only() checks to common helpers > > block/blk-core.c | 56 > ++-- > block/blk-lib.c | 12 > 2 files changed, 50 insertions(+), 18 deletions(-) Jens, could you please pick these up for 4.16? Thanks, Ilya
[PATCH v3 2/2] block: add bdev_read_only() checks to common helpers
Similar to blkdev_write_iter(), return -EPERM if the partition is read-only. This covers ioctl(), fallocate() and most in-kernel users but isn't meant to be exhaustive -- everything else will be caught in generic_make_request_checks(), fail with -EIO and can be fixed later. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-lib.c | 12 1 file changed, 12 insertions(+) diff --git a/block/blk-lib.c b/block/blk-lib.c index 2bc544ce3d2e..a676084d4740 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -37,6 +37,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, if (!q) return -ENXIO; + if (bdev_read_only(bdev)) + return -EPERM; + if (flags & BLKDEV_DISCARD_SECURE) { if (!blk_queue_secure_erase(q)) return -EOPNOTSUPP; @@ -156,6 +159,9 @@ static int __blkdev_issue_write_same(struct block_device *bdev, sector_t sector, if (!q) return -ENXIO; + if (bdev_read_only(bdev)) + return -EPERM; + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; if ((sector | nr_sects) & bs_mask) return -EINVAL; @@ -233,6 +239,9 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, if (!q) return -ENXIO; + if (bdev_read_only(bdev)) + return -EPERM; + /* Ensure that max_write_zeroes_sectors doesn't overflow bi_size */ max_write_zeroes_sectors = bdev_write_zeroes_sectors(bdev); @@ -287,6 +296,9 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, if (!q) return -ENXIO; + if (bdev_read_only(bdev)) + return -EPERM; + while (nr_sects != 0) { bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), gfp_mask); -- 2.4.3
[PATCH v3 1/2] block: fail op_is_write() requests to read-only partitions
Regular block device writes go through blkdev_write_iter(), which does bdev_read_only(), while zeroout/discard/etc requests are never checked, both userspace- and kernel-triggered. Add a generic catch-all check to generic_make_request_checks() to actually enforce ioctl(BLKROSET) and set_disk_ro(), which is used by quite a few drivers for things like snapshots, read-only backing files/images, etc. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-core.c | 56 ++-- 1 file changed, 38 insertions(+), 18 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index f843ae4f858d..44a6dfe47ae1 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2062,6 +2062,21 @@ static inline bool should_fail_request(struct hd_struct *part, #endif /* CONFIG_FAIL_MAKE_REQUEST */ +static inline bool bio_check_ro(struct bio *bio, struct hd_struct *part) +{ + if (part->policy && op_is_write(bio_op(bio))) { + char b[BDEVNAME_SIZE]; + + printk(KERN_ERR + "generic_make_request: Trying to write " + "to read-only block-device %s (partno %d)\n", + bio_devname(bio, b), part->partno); + return true; + } + + return false; +} + /* * Remap block n of partition p to block n+start(p) of the disk. */ @@ -2070,27 +2085,28 @@ static inline int blk_partition_remap(struct bio *bio) struct hd_struct *p; int ret = 0; + rcu_read_lock(); + p = __disk_get_part(bio->bi_disk, bio->bi_partno); + if (unlikely(!p || should_fail_request(p, bio->bi_iter.bi_size) || +bio_check_ro(bio, p))) { + ret = -EIO; + goto out; + } + /* * Zone reset does not include bi_size so bio_sectors() is always 0. * Include a test for the reset op code and perform the remap if needed. */ - if (!bio->bi_partno || - (!bio_sectors(bio) && bio_op(bio) != REQ_OP_ZONE_RESET)) - return 0; + if (!bio_sectors(bio) && bio_op(bio) != REQ_OP_ZONE_RESET) + goto out; - rcu_read_lock(); - p = __disk_get_part(bio->bi_disk, bio->bi_partno); - if (likely(p && !should_fail_request(p, bio->bi_iter.bi_size))) { - bio->bi_iter.bi_sector += p->start_sect; - bio->bi_partno = 0; - trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p), - bio->bi_iter.bi_sector - p->start_sect); - } else { - printk("%s: fail for partition %d\n", __func__, bio->bi_partno); - ret = -EIO; - } - rcu_read_unlock(); + bio->bi_iter.bi_sector += p->start_sect; + bio->bi_partno = 0; + trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p), + bio->bi_iter.bi_sector - p->start_sect); +out: + rcu_read_unlock(); return ret; } @@ -2149,15 +2165,19 @@ generic_make_request_checks(struct bio *bio) * For a REQ_NOWAIT based request, return -EOPNOTSUPP * if queue is not a request based queue. */ - if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) goto not_supported; if (should_fail_request(>bi_disk->part0, bio->bi_iter.bi_size)) goto end_io; - if (blk_partition_remap(bio)) - goto end_io; + if (!bio->bi_partno) { + if (unlikely(bio_check_ro(bio, >bi_disk->part0))) + goto end_io; + } else { + if (blk_partition_remap(bio)) + goto end_io; + } if (bio_check_eod(bio, nr_sectors)) goto end_io; -- 2.4.3
[PATCH v3 0/2] block: enforce ioctl(BLKROSET) and set_disk_ro()
Hello, I was doing some cleanup work on rbd BLKROSET handler and discovered that we ignore partition rw/ro setting (hd_struct->policy) for pretty much everything but straight writes. David (CCed) has blktests patches standing by. (Another aspect of this is that we don't enforce open(2) mode. Tejun took a stab at this a few years ago, but his patch had to be reverted: 75f1dc0d076d ("block: check bdev_read_only() from blkdev_get()") e51900f7d38c ("block: revert block_dev read-only check") It is a separate issue and refusing writes to read-only devices is obviously more important, but perhaps it's time to revisit that as well?) v2 -> v3: - lookup part only once; combine read-only check with existing should_fail_request check v1 -> v2: - added unlikely() per Sagi's suggestion Thanks, Ilya Ilya Dryomov (2): block: fail op_is_write() requests to read-only partitions block: add bdev_read_only() checks to common helpers block/blk-core.c | 56 ++-- block/blk-lib.c | 12 2 files changed, 50 insertions(+), 18 deletions(-) -- 2.4.3
Re: [PATCH v2 1/2] block: fail op_is_write() requests to read-only partitions
On Wed, Jan 10, 2018 at 5:34 PM, Jens Axboe <ax...@kernel.dk> wrote: > On 1/10/18 9:18 AM, Ilya Dryomov wrote: >> Regular block device writes go through blkdev_write_iter(), which does >> bdev_read_only(), while zeroout/discard/etc requests are never checked, >> both userspace- and kernel-triggered. Add a generic catch-all check to >> generic_make_request_checks() to actually enforce ioctl(BLKROSET) and >> set_disk_ro(), which is used by quite a few drivers for things like >> snapshots, read-only backing files/images, etc. >> >> Signed-off-by: Ilya Dryomov <idryo...@gmail.com> >> Reviewed-by: Sagi Grimberg <s...@grimberg.me> >> --- >> block/blk-core.c | 23 ++- >> 1 file changed, 22 insertions(+), 1 deletion(-) >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index f843ae4f858d..d132bec4a266 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -2123,6 +2123,20 @@ static inline int bio_check_eod(struct bio *bio, >> unsigned int nr_sectors) >> return 0; >> } >> >> +static inline bool bio_check_ro(struct bio *bio) >> +{ >> + struct hd_struct *p; >> + bool ret = false; >> + >> + rcu_read_lock(); >> + p = __disk_get_part(bio->bi_disk, bio->bi_partno); >> + if (!p || (p->policy && op_is_write(bio_op(bio >> + ret = true; >> + rcu_read_unlock(); >> + >> + return ret; >> +}> + >> static noinline_for_stack bool >> generic_make_request_checks(struct bio *bio) >> { >> @@ -2145,11 +2159,18 @@ generic_make_request_checks(struct bio *bio) >> goto end_io; >> } >> >> + if (unlikely(bio_check_ro(bio))) { >> + printk(KERN_ERR >> +"generic_make_request: Trying to write " >> + "to read-only block-device %s (partno %d)\n", >> + bio_devname(bio, b), bio->bi_partno); >> + goto end_io; >> + } > > It's yet another check that adds part lookup and rcu lock/unlock in that > path. Can we combine some of them? Make this part of the remap? This > overhead impacts every IO, let's not bloat it more than absolutely > necessary. Yes, combining with should_fail_request check in remap should be easy enough. I considered it, but opted for the less invasive patch. I'll re-spin. Thanks, Ilya
[PATCH v2 2/2] block: add bdev_read_only() checks to common helpers
Similar to blkdev_write_iter(), return -EPERM if the partition is read-only. This covers ioctl(), fallocate() and most in-kernel users but isn't meant to be exhaustive -- everything else will be caught in generic_make_request_checks(), fail with -EIO and can be fixed later. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-lib.c | 12 1 file changed, 12 insertions(+) diff --git a/block/blk-lib.c b/block/blk-lib.c index 2bc544ce3d2e..a676084d4740 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -37,6 +37,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, if (!q) return -ENXIO; + if (bdev_read_only(bdev)) + return -EPERM; + if (flags & BLKDEV_DISCARD_SECURE) { if (!blk_queue_secure_erase(q)) return -EOPNOTSUPP; @@ -156,6 +159,9 @@ static int __blkdev_issue_write_same(struct block_device *bdev, sector_t sector, if (!q) return -ENXIO; + if (bdev_read_only(bdev)) + return -EPERM; + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; if ((sector | nr_sects) & bs_mask) return -EINVAL; @@ -233,6 +239,9 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, if (!q) return -ENXIO; + if (bdev_read_only(bdev)) + return -EPERM; + /* Ensure that max_write_zeroes_sectors doesn't overflow bi_size */ max_write_zeroes_sectors = bdev_write_zeroes_sectors(bdev); @@ -287,6 +296,9 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, if (!q) return -ENXIO; + if (bdev_read_only(bdev)) + return -EPERM; + while (nr_sects != 0) { bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), gfp_mask); -- 2.4.3
[PATCH v2 0/2] block: enforce ioctl(BLKROSET) and set_disk_ro()
Hello, I was doing some cleanup work on rbd BLKROSET handler and discovered that we ignore partition rw/ro setting (hd_struct->policy) for pretty much everything but straight writes. David (CCed) has blktests patches standing by. (Another aspect of this is that we don't enforce open(2) mode. Tejun took a stab at this a few years ago, but his patch had to be reverted: 75f1dc0d076d ("block: check bdev_read_only() from blkdev_get()") e51900f7d38c ("block: revert block_dev read-only check") It is a separate issue and refusing writes to read-only devices is obviously more important, but perhaps it's time to revisit that as well?) v1 -> v2: - added unlikely() per Sagi's suggestion Thanks, Ilya Ilya Dryomov (2): block: fail op_is_write() requests to read-only partitions block: add bdev_read_only() checks to common helpers block/blk-core.c | 23 ++- block/blk-lib.c | 12 2 files changed, 34 insertions(+), 1 deletion(-) -- 2.4.3
[PATCH v2 1/2] block: fail op_is_write() requests to read-only partitions
Regular block device writes go through blkdev_write_iter(), which does bdev_read_only(), while zeroout/discard/etc requests are never checked, both userspace- and kernel-triggered. Add a generic catch-all check to generic_make_request_checks() to actually enforce ioctl(BLKROSET) and set_disk_ro(), which is used by quite a few drivers for things like snapshots, read-only backing files/images, etc. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> Reviewed-by: Sagi Grimberg <s...@grimberg.me> --- block/blk-core.c | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index f843ae4f858d..d132bec4a266 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2123,6 +2123,20 @@ static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors) return 0; } +static inline bool bio_check_ro(struct bio *bio) +{ + struct hd_struct *p; + bool ret = false; + + rcu_read_lock(); + p = __disk_get_part(bio->bi_disk, bio->bi_partno); + if (!p || (p->policy && op_is_write(bio_op(bio + ret = true; + rcu_read_unlock(); + + return ret; +} + static noinline_for_stack bool generic_make_request_checks(struct bio *bio) { @@ -2145,11 +2159,18 @@ generic_make_request_checks(struct bio *bio) goto end_io; } + if (unlikely(bio_check_ro(bio))) { + printk(KERN_ERR + "generic_make_request: Trying to write " + "to read-only block-device %s (partno %d)\n", + bio_devname(bio, b), bio->bi_partno); + goto end_io; + } + /* * For a REQ_NOWAIT based request, return -EOPNOTSUPP * if queue is not a request based queue. */ - if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) goto not_supported; -- 2.4.3
Re: [PATCH 0/2] block: enforce ioctl(BLKROSET) and set_disk_ro()
On Thu, Nov 16, 2017 at 10:02 AM, Ilya Dryomov <idryo...@gmail.com> wrote: > On Thu, Nov 9, 2017 at 7:44 PM, Ilya Dryomov <idryo...@gmail.com> wrote: >> Hello, >> >> I was doing some cleanup work on rbd BLKROSET handler and discovered >> that we ignore partition rw/ro setting (hd_struct->policy) for pretty >> much everything but straight writes. >> >> David (CCed) has blktests patches standing by. >> >> (Another aspect of this is that we don't enforce open(2) mode. Tejun >> took a stab at this a few years ago, but his patch had to be reverted: >> >> 75f1dc0d076d ("block: check bdev_read_only() from blkdev_get()") >> e51900f7d38c ("block: revert block_dev read-only check") >> >> It is a separate issue and refusing writes to read-only devices is >> obviously more important, but perhaps it's time to revisit that as >> well?) >> >> Thanks, >> >> Ilya >> >> >> Ilya Dryomov (2): >> block: fail op_is_write() requests to read-only partitions >> block: add bdev_read_only() checks to common helpers >> >> block/blk-core.c | 23 ++- >> block/blk-lib.c | 12 >> 2 files changed, 34 insertions(+), 1 deletion(-) > > Ping... Christoph, Jens, could one of you please take a look? Ping? Thanks, Ilya
Re: [PATCH 0/2] block: enforce ioctl(BLKROSET) and set_disk_ro()
On Thu, Nov 9, 2017 at 7:44 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > Hello, > > I was doing some cleanup work on rbd BLKROSET handler and discovered > that we ignore partition rw/ro setting (hd_struct->policy) for pretty > much everything but straight writes. > > David (CCed) has blktests patches standing by. > > (Another aspect of this is that we don't enforce open(2) mode. Tejun > took a stab at this a few years ago, but his patch had to be reverted: > > 75f1dc0d076d ("block: check bdev_read_only() from blkdev_get()") > e51900f7d38c ("block: revert block_dev read-only check") > > It is a separate issue and refusing writes to read-only devices is > obviously more important, but perhaps it's time to revisit that as > well?) > > Thanks, > > Ilya > > > Ilya Dryomov (2): > block: fail op_is_write() requests to read-only partitions > block: add bdev_read_only() checks to common helpers > > block/blk-core.c | 23 ++- > block/blk-lib.c | 12 > 2 files changed, 34 insertions(+), 1 deletion(-) Ping... Christoph, Jens, could one of you please take a look? Thanks, Ilya
[PATCH 0/2] block: enforce ioctl(BLKROSET) and set_disk_ro()
Hello, I was doing some cleanup work on rbd BLKROSET handler and discovered that we ignore partition rw/ro setting (hd_struct->policy) for pretty much everything but straight writes. David (CCed) has blktests patches standing by. (Another aspect of this is that we don't enforce open(2) mode. Tejun took a stab at this a few years ago, but his patch had to be reverted: 75f1dc0d076d ("block: check bdev_read_only() from blkdev_get()") e51900f7d38c ("block: revert block_dev read-only check") It is a separate issue and refusing writes to read-only devices is obviously more important, but perhaps it's time to revisit that as well?) Thanks, Ilya Ilya Dryomov (2): block: fail op_is_write() requests to read-only partitions block: add bdev_read_only() checks to common helpers block/blk-core.c | 23 ++- block/blk-lib.c | 12 2 files changed, 34 insertions(+), 1 deletion(-) -- 2.4.3
[PATCH 2/2] block: add bdev_read_only() checks to common helpers
Similar to blkdev_write_iter(), return -EPERM if the partition is read-only. This covers ioctl(), fallocate() and most in-kernel users but isn't meant to be exhaustive -- everything else will be caught in generic_make_request_checks(), fail with -EIO and can be fixed later. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-lib.c | 12 1 file changed, 12 insertions(+) diff --git a/block/blk-lib.c b/block/blk-lib.c index f625fda5f095..64fe863ae43a 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -36,6 +36,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, if (!q) return -ENXIO; + if (bdev_read_only(bdev)) + return -EPERM; + if (flags & BLKDEV_DISCARD_SECURE) { if (!blk_queue_secure_erase(q)) return -EOPNOTSUPP; @@ -155,6 +158,9 @@ static int __blkdev_issue_write_same(struct block_device *bdev, sector_t sector, if (!q) return -ENXIO; + if (bdev_read_only(bdev)) + return -EPERM; + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; if ((sector | nr_sects) & bs_mask) return -EINVAL; @@ -232,6 +238,9 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, if (!q) return -ENXIO; + if (bdev_read_only(bdev)) + return -EPERM; + /* Ensure that max_write_zeroes_sectors doesn't overflow bi_size */ max_write_zeroes_sectors = bdev_write_zeroes_sectors(bdev); @@ -286,6 +295,9 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, if (!q) return -ENXIO; + if (bdev_read_only(bdev)) + return -EPERM; + while (nr_sects != 0) { bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), gfp_mask); -- 2.4.3
[PATCH 1/2] block: fail op_is_write() requests to read-only partitions
Regular block device writes go through blkdev_write_iter(), which does bdev_read_only(), while zeroout/discard/etc requests are never checked, both userspace- and kernel-triggered. Add a generic catch-all check to generic_make_request_checks() to actually enforce ioctl(BLKROSET) and set_disk_ro(), which is used by quite a few drivers for things like snapshots, read-only backing files/images, etc. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-core.c | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index b8d1aa2d1008..139ff47caf4a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2022,6 +2022,20 @@ static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors) return 0; } +static inline bool bio_check_ro(struct bio *bio) +{ + struct hd_struct *p; + int ret = false; + + rcu_read_lock(); + p = __disk_get_part(bio->bi_disk, bio->bi_partno); + if (!p || (p->policy && op_is_write(bio_op(bio + ret = true; + rcu_read_unlock(); + + return ret; +} + static noinline_for_stack bool generic_make_request_checks(struct bio *bio) { @@ -2044,11 +2058,18 @@ generic_make_request_checks(struct bio *bio) goto end_io; } + if (bio_check_ro(bio)) { + printk(KERN_ERR + "generic_make_request: Trying to write " + "to read-only block-device %s (partno %d)\n", + bio_devname(bio, b), bio->bi_partno); + goto end_io; + } + /* * For a REQ_NOWAIT based request, return -EOPNOTSUPP * if queue is not a request based queue. */ - if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) goto not_supported; -- 2.4.3
Re: [PATCH v3 0/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
On Mon, Oct 16, 2017 at 3:59 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > Hi Christoph, Martin, > > blkdev_issue_zeroout() now checks for any error. This required a minor > refactor, so I dropped the stable tag, Jens can add it back if needed. > > v2 -> v3: > - another code flow change in blkdev_issue_zeroout() suggested by > Christoph -- no functional changes > > v1 -> v2: > - changed code flow in blkdev_issue_zeroout() according to Christoph's > suggestion > - this required adding additional checks to blkdev_issue_zeroout() and > __blkdev_issue_zero_pages(), the latter is no longer void > > Previous version at > > https://marc.info/?l=linux-block=150712940922089=2 > > Thanks, > > Ilya > > > Ilya Dryomov (2): > block: factor out __blkdev_issue_zero_pages() > block: cope with WRITE ZEROES failing in blkdev_issue_zeroout() > > block/blk-lib.c | 108 > +--- > 1 file changed, 72 insertions(+), 36 deletions(-) Jens, can you pick these up? Technically this is a user-visible regression (Fixes tag in 2/2), but we are late in the cycle so I'm not sure what is the right thing to do here. Thanks, Ilya
[PATCH] block: move CAP_SYS_ADMIN check in blkdev_roset()
Check for CAP_SYS_ADMIN before calling into the driver, similar to blkdev_flushbuf(). This is safer and can spare a check in the driver. (Currently BLKROSET is overridden by md and rbd, rbd is missing the check. md has the check, but it covers a lot more than BLKROSET.) Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- Al, this appears to go back to your "[PATCH] block ioctl cleanup", history commit c6973580141c. 2002 was a long time ago, but still ;) Was there a reason you made BLKFLSBUF check for CAP_SYS_ADMIN before ->ioctl() and BLKROSET after? --- block/ioctl.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/block/ioctl.c b/block/ioctl.c index 0de02ee67eed..3f81bc50ac00 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -437,11 +437,12 @@ static int blkdev_roset(struct block_device *bdev, fmode_t mode, { int ret, n; + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg); if (!is_unrecognized_ioctl(ret)) return ret; - if (!capable(CAP_SYS_ADMIN)) - return -EACCES; if (get_user(n, (int __user *)arg)) return -EFAULT; set_device_ro(bdev, n); -- 2.4.3
Re: [PATCH v2 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
On Mon, Oct 16, 2017 at 1:44 PM, Christoph Hellwig <h...@infradead.org> wrote: > On Fri, Oct 06, 2017 at 02:31:20PM +0200, Ilya Dryomov wrote: >> This would unconditionally overwrite any WRITE ZEROS error. If we get >> e.g. -EIO, and manual zeroing is not allowed, I don't think we want to >> return -EOPNOTSUPP? >> >> Returning -EOPNOTSUPP to mean "can't zero using either method" doesn't >> make sense to me, because manual zeroing is always supported, just not >> always allowed. > > Then thow the throw bdev_write_zeroes_sectors check back in: > > if (ret && try_write_zeroes) { > if (!(flags & BLKDEV_ZERO_NOFALLBACK)) > try_write_zeroes = false; > goto retry; > } > if (!bdev_write_zeroes_sectors(bdev)) > ret = -EOPNOTSUPP; > } > > The important bit is that the structure in the current patch where the > bdev_write_zeroes_sectors check is on the same level as the method > selection is extremely confusing. I see. An updated version should be in your inbox. Thanks, Ilya
[PATCH v3 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to permit trying WRITE SAME on older SCSI devices, unless ->no_write_same is set. Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO: $ fallocate -zn -l 1k /dev/sdg fallocate: fallocate failed: Remote I/O error $ fallocate -zn -l 1k /dev/sdg # OK $ fallocate -zn -l 1k /dev/sdg # OK The following calls succeed because sd_done() sets ->no_write_same in response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing __blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios. This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is specified. For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if sd_done() has just set ->no_write_same thus indicating lack of offload support. Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing") Cc: Christoph Hellwig <h...@lst.de> Cc: "Martin K. Petersen" <martin.peter...@oracle.com> Cc: Hannes Reinecke <h...@suse.com> Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-lib.c | 45 +++-- 1 file changed, 35 insertions(+), 10 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 9d2ab8bba52a..23411ec7f3cc 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -321,12 +321,6 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, * Zero-fill a block range, either using hardware offload or by explicitly * writing zeroes to the device. * - * Note that this function may fail with -EOPNOTSUPP if the driver signals - * zeroing offload support, but the device fails to process the command (for - * some devices there is no non-destructive way to verify whether this - * operation is actually supported). In this case the caller should call - * retry the call to blkdev_issue_zeroout() and the fallback path will be used. - * * If a device is using logical block provisioning, the underlying space will * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. * @@ -370,18 +364,49 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout); int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, unsigned flags) { - int ret; - struct bio *bio = NULL; + int ret = 0; + sector_t bs_mask; + struct bio *bio; struct blk_plug plug; + bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev); + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; + if ((sector | nr_sects) & bs_mask) + return -EINVAL; + +retry: + bio = NULL; blk_start_plug(); - ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, - , flags); + if (try_write_zeroes) { + ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, + gfp_mask, , flags); + } else if (!(flags & BLKDEV_ZERO_NOFALLBACK)) { + ret = __blkdev_issue_zero_pages(bdev, sector, nr_sects, + gfp_mask, ); + } else { + /* No zeroing offload support */ + ret = -EOPNOTSUPP; + } if (ret == 0 && bio) { ret = submit_bio_wait(bio); bio_put(bio); } blk_finish_plug(); + if (ret && try_write_zeroes) { + if (!(flags & BLKDEV_ZERO_NOFALLBACK)) { + try_write_zeroes = false; + goto retry; + } + if (!bdev_write_zeroes_sectors(bdev)) { + /* +* Zeroing offload support was indicated, but the +* device reported ILLEGAL REQUEST (for some devices +* there is no non-destructive way to verify whether +* WRITE ZEROES is actually supported). +*/ + ret = -EOPNOTSUPP; + } + } return ret; } -- 2.4.3
[PATCH v3 1/2] block: factor out __blkdev_issue_zero_pages()
blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-lib.c | 63 + 1 file changed, 37 insertions(+), 26 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 62240f8832ca..9d2ab8bba52a 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -274,6 +274,40 @@ static unsigned int __blkdev_sectors_to_bio_pages(sector_t nr_sects) return min(pages, (sector_t)BIO_MAX_PAGES); } +static int __blkdev_issue_zero_pages(struct block_device *bdev, + sector_t sector, sector_t nr_sects, gfp_t gfp_mask, + struct bio **biop) +{ + struct request_queue *q = bdev_get_queue(bdev); + struct bio *bio = *biop; + int bi_size = 0; + unsigned int sz; + + if (!q) + return -ENXIO; + + while (nr_sects != 0) { + bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), + gfp_mask); + bio->bi_iter.bi_sector = sector; + bio_set_dev(bio, bdev); + bio_set_op_attrs(bio, REQ_OP_WRITE, 0); + + while (nr_sects != 0) { + sz = min((sector_t) PAGE_SIZE, nr_sects << 9); + bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0); + nr_sects -= bi_size >> 9; + sector += bi_size >> 9; + if (bi_size < sz) + break; + } + cond_resched(); + } + + *biop = bio; + return 0; +} + /** * __blkdev_issue_zeroout - generate number of zero filed write bios * @bdev: blockdev to issue @@ -304,9 +338,6 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, unsigned flags) { int ret; - int bi_size = 0; - struct bio *bio = *biop; - unsigned int sz; sector_t bs_mask; bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; @@ -316,30 +347,10 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask, biop, flags); if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK)) - goto out; - - ret = 0; - while (nr_sects != 0) { - bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), - gfp_mask); - bio->bi_iter.bi_sector = sector; - bio_set_dev(bio, bdev); - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); - - while (nr_sects != 0) { - sz = min((sector_t) PAGE_SIZE, nr_sects << 9); - bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0); - nr_sects -= bi_size >> 9; - sector += bi_size >> 9; - if (bi_size < sz) - break; - } - cond_resched(); - } + return ret; - *biop = bio; -out: - return ret; + return __blkdev_issue_zero_pages(bdev, sector, nr_sects, gfp_mask, +biop); } EXPORT_SYMBOL(__blkdev_issue_zeroout); -- 2.4.3
[PATCH v3 0/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
Hi Christoph, Martin, blkdev_issue_zeroout() now checks for any error. This required a minor refactor, so I dropped the stable tag, Jens can add it back if needed. v2 -> v3: - another code flow change in blkdev_issue_zeroout() suggested by Christoph -- no functional changes v1 -> v2: - changed code flow in blkdev_issue_zeroout() according to Christoph's suggestion - this required adding additional checks to blkdev_issue_zeroout() and __blkdev_issue_zero_pages(), the latter is no longer void Previous version at https://marc.info/?l=linux-block=150712940922089=2 Thanks, Ilya Ilya Dryomov (2): block: factor out __blkdev_issue_zero_pages() block: cope with WRITE ZEROES failing in blkdev_issue_zeroout() block/blk-lib.c | 108 +--- 1 file changed, 72 insertions(+), 36 deletions(-) -- 2.4.3
Re: [PATCH v2 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
On Fri, Oct 6, 2017 at 2:31 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > On Fri, Oct 6, 2017 at 2:05 PM, Christoph Hellwig <h...@infradead.org> wrote: >> On Thu, Oct 05, 2017 at 09:32:33PM +0200, Ilya Dryomov wrote: >>> This is to avoid returning -EREMOTEIO in the following case: device >>> doesn't support WRITE SAME but scsi_disk::max_ws_blocks != 0, zeroout >>> is called with BLKDEV_ZERO_NOFALLBACK. Enter blkdev_issue_zeroout(), >>> bdev_write_zeroes_sectors() != 0, so we issue WRITE ZEROES. The >>> request fails with ILLEGAL REQUEST, sd_done() sets ->no_write_same and >>> updates queue_limits, ILLEGAL REQUEST is translated into -EREMOTEIO, >>> which is returned from submit_bio_wait(). Manual zeroing is not >>> allowed, so we must return an error, but it shouldn't be -EREMOTEIO if >>> queue_limits just got updated because of ILLEGAL REQUEST. Without this >>> conditional, we'd get >> >> Hmm. I think we'd better off to just do the before the retry loop: >> >> if (ret && try_write_zeroes) { >> if (!(flags & BLKDEV_ZERO_NOFALLBACK)) >> try_write_zeroes = false; >> goto retry; >> } >> ret = -EOPNOTSUPP; >> } > > This would unconditionally overwrite any WRITE ZEROS error. If we get > e.g. -EIO, and manual zeroing is not allowed, I don't think we want to > return -EOPNOTSUPP? > > Returning -EOPNOTSUPP to mean "can't zero using either method" doesn't > make sense to me, because manual zeroing is always supported, just not > always allowed. Ping... Thanks, Ilya
Re: false positive lockdep splat with loop device
On Tue, Oct 10, 2017 at 11:16 AM, Ilya Dryomov <idryo...@gmail.com> wrote: > On Thu, Oct 5, 2017 at 6:33 PM, Christoph Hellwig <h...@infradead.org> wrote: >> Does the patch below fix the warning for you? >> >> -- >> From 28aae7104425433d39e6142adcd5b88dc5b0ad5f Mon Sep 17 00:00:00 2001 >> From: Christoph Hellwig <h...@lst.de> >> Date: Thu, 5 Oct 2017 18:31:02 +0200 >> Subject: block: use DECLARE_COMPLETION_ONSTACK in submit_bio_wait >> >> This way we get our lockdep annotations right in case multiple layers >> in the stack use submit_bio_wait. >> >> It also happens to simplify the code by getting rid of the submit_bio_ret >> structure. >> >> Signed-off-by: Christoph Hellwig <h...@lst.de> >> --- >> block/bio.c | 19 +-- >> 1 file changed, 5 insertions(+), 14 deletions(-) >> >> diff --git a/block/bio.c b/block/bio.c >> index 8338304ea256..4e18e959fc0a 100644 >> --- a/block/bio.c >> +++ b/block/bio.c >> @@ -917,17 +917,9 @@ int bio_iov_iter_get_pages(struct bio *bio, struct >> iov_iter *iter) >> } >> EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages); >> >> -struct submit_bio_ret { >> - struct completion event; >> - int error; >> -}; >> - >> static void submit_bio_wait_endio(struct bio *bio) >> { >> - struct submit_bio_ret *ret = bio->bi_private; >> - >> - ret->error = blk_status_to_errno(bio->bi_status); >> - complete(>event); >> + complete(bio->bi_private); >> } >> >> /** >> @@ -943,16 +935,15 @@ static void submit_bio_wait_endio(struct bio *bio) >> */ >> int submit_bio_wait(struct bio *bio) >> { >> - struct submit_bio_ret ret; >> + DECLARE_COMPLETION_ONSTACK(done); >> >> - init_completion(); >> - bio->bi_private = >> + bio->bi_private = >> bio->bi_end_io = submit_bio_wait_endio; >> bio->bi_opf |= REQ_SYNC; >> submit_bio(bio); >> - wait_for_completion_io(); >> + wait_for_completion_io(); >> >> - return ret.error; >> + return blk_status_to_errno(bio->bi_status); >> } >> EXPORT_SYMBOL(submit_bio_wait); > > No, it doesn't -- the splat is a little bit more complicated, but > fundamentally the same thing. Easily triggered with generic/361 too, BTW. Thanks, Ilya
Re: [PATCH v2 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
On Fri, Oct 6, 2017 at 2:05 PM, Christoph Hellwig <h...@infradead.org> wrote: > On Thu, Oct 05, 2017 at 09:32:33PM +0200, Ilya Dryomov wrote: >> This is to avoid returning -EREMOTEIO in the following case: device >> doesn't support WRITE SAME but scsi_disk::max_ws_blocks != 0, zeroout >> is called with BLKDEV_ZERO_NOFALLBACK. Enter blkdev_issue_zeroout(), >> bdev_write_zeroes_sectors() != 0, so we issue WRITE ZEROES. The >> request fails with ILLEGAL REQUEST, sd_done() sets ->no_write_same and >> updates queue_limits, ILLEGAL REQUEST is translated into -EREMOTEIO, >> which is returned from submit_bio_wait(). Manual zeroing is not >> allowed, so we must return an error, but it shouldn't be -EREMOTEIO if >> queue_limits just got updated because of ILLEGAL REQUEST. Without this >> conditional, we'd get > > Hmm. I think we'd better off to just do the before the retry loop: > > if (ret && try_write_zeroes) { > if (!(flags & BLKDEV_ZERO_NOFALLBACK)) > try_write_zeroes = false; > goto retry; > } > ret = -EOPNOTSUPP; > } This would unconditionally overwrite any WRITE ZEROS error. If we get e.g. -EIO, and manual zeroing is not allowed, I don't think we want to return -EOPNOTSUPP? Returning -EOPNOTSUPP to mean "can't zero using either method" doesn't make sense to me, because manual zeroing is always supported, just not always allowed. Thanks, Ilya
Re: [PATCH v2 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
On Thu, Oct 5, 2017 at 7:13 PM, Christoph Hellwig <h...@infradead.org> wrote: > On Wed, Oct 04, 2017 at 05:03:16PM +0200, Ilya Dryomov wrote: >> sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to >> permit trying WRITE SAME on older SCSI devices, unless ->no_write_same >> is set. Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE >> SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO: >> >> $ fallocate -zn -l 1k /dev/sdg >> fallocate: fallocate failed: Remote I/O error >> $ fallocate -zn -l 1k /dev/sdg # OK >> $ fallocate -zn -l 1k /dev/sdg # OK > > Can we wire this up for blktests somehow? This is covered by Darrick's generic/351, part of fstests blockdev group. > >> >> The following calls succeed because sd_done() sets ->no_write_same in >> response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing >> __blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios. >> >> This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing >> and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is >> specified. For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if >> sd_done() has just set ->no_write_same thus indicating lack of offload >> support. >> >> Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing") >> Cc: Christoph Hellwig <h...@lst.de> >> Cc: "Martin K. Petersen" <martin.peter...@oracle.com> >> Cc: Hannes Reinecke <h...@suse.com> >> Signed-off-by: Ilya Dryomov <idryo...@gmail.com> >> --- >> block/blk-lib.c | 41 +++-- >> 1 file changed, 31 insertions(+), 10 deletions(-) >> >> diff --git a/block/blk-lib.c b/block/blk-lib.c >> index 9d2ab8bba52a..17494275673e 100644 >> --- a/block/blk-lib.c >> +++ b/block/blk-lib.c >> @@ -321,12 +321,6 @@ static int __blkdev_issue_zero_pages(struct >> block_device *bdev, >> * Zero-fill a block range, either using hardware offload or by explicitly >> * writing zeroes to the device. >> * >> - * Note that this function may fail with -EOPNOTSUPP if the driver signals >> - * zeroing offload support, but the device fails to process the command >> (for >> - * some devices there is no non-destructive way to verify whether this >> - * operation is actually supported). In this case the caller should call >> - * retry the call to blkdev_issue_zeroout() and the fallback path will be >> used. >> - * >> * If a device is using logical block provisioning, the underlying space >> will >> * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. >> * >> @@ -370,18 +364,45 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout); >> int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, >> sector_t nr_sects, gfp_t gfp_mask, unsigned flags) >> { >> - int ret; >> - struct bio *bio = NULL; >> + int ret = 0; >> + sector_t bs_mask; >> + struct bio *bio; >> struct blk_plug plug; >> + bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev); >> + >> + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; >> + if ((sector | nr_sects) & bs_mask) >> + return -EINVAL; >> >> +retry: >> + bio = NULL; >> blk_start_plug(); >> - ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, >> - , flags); >> + if (try_write_zeroes) { >> + ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, >> + gfp_mask, , flags); >> + } else if (!(flags & BLKDEV_ZERO_NOFALLBACK)) { >> + ret = __blkdev_issue_zero_pages(bdev, sector, nr_sects, >> + gfp_mask, ); >> + } else if (!bdev_write_zeroes_sectors(bdev)) { >> + /* >> + * Manual zeroout is not allowed and either: >> + * - no zeroing offload support >> + * - zeroing offload support was indicated, but the device >> + * reported ILLEGAL REQUEST (for some devices there is no >> + * non-destructive way to verify whether WRITE ZEROES is >> + * actually supported) >> + */ >> + ret = -EOPNOTSUPP; > > I don't understand the conditional above this error return - if > we can't zero using either method we should always return an error. This is to avoid return
[PATCH v2 1/2] block: factor out __blkdev_issue_zero_pages()
blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-lib.c | 63 + 1 file changed, 37 insertions(+), 26 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 62240f8832ca..9d2ab8bba52a 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -274,6 +274,40 @@ static unsigned int __blkdev_sectors_to_bio_pages(sector_t nr_sects) return min(pages, (sector_t)BIO_MAX_PAGES); } +static int __blkdev_issue_zero_pages(struct block_device *bdev, + sector_t sector, sector_t nr_sects, gfp_t gfp_mask, + struct bio **biop) +{ + struct request_queue *q = bdev_get_queue(bdev); + struct bio *bio = *biop; + int bi_size = 0; + unsigned int sz; + + if (!q) + return -ENXIO; + + while (nr_sects != 0) { + bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), + gfp_mask); + bio->bi_iter.bi_sector = sector; + bio_set_dev(bio, bdev); + bio_set_op_attrs(bio, REQ_OP_WRITE, 0); + + while (nr_sects != 0) { + sz = min((sector_t) PAGE_SIZE, nr_sects << 9); + bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0); + nr_sects -= bi_size >> 9; + sector += bi_size >> 9; + if (bi_size < sz) + break; + } + cond_resched(); + } + + *biop = bio; + return 0; +} + /** * __blkdev_issue_zeroout - generate number of zero filed write bios * @bdev: blockdev to issue @@ -304,9 +338,6 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, unsigned flags) { int ret; - int bi_size = 0; - struct bio *bio = *biop; - unsigned int sz; sector_t bs_mask; bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; @@ -316,30 +347,10 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask, biop, flags); if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK)) - goto out; - - ret = 0; - while (nr_sects != 0) { - bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), - gfp_mask); - bio->bi_iter.bi_sector = sector; - bio_set_dev(bio, bdev); - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); - - while (nr_sects != 0) { - sz = min((sector_t) PAGE_SIZE, nr_sects << 9); - bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0); - nr_sects -= bi_size >> 9; - sector += bi_size >> 9; - if (bi_size < sz) - break; - } - cond_resched(); - } + return ret; - *biop = bio; -out: - return ret; + return __blkdev_issue_zero_pages(bdev, sector, nr_sects, gfp_mask, +biop); } EXPORT_SYMBOL(__blkdev_issue_zeroout); -- 2.4.3
[PATCH v2 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to permit trying WRITE SAME on older SCSI devices, unless ->no_write_same is set. Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO: $ fallocate -zn -l 1k /dev/sdg fallocate: fallocate failed: Remote I/O error $ fallocate -zn -l 1k /dev/sdg # OK $ fallocate -zn -l 1k /dev/sdg # OK The following calls succeed because sd_done() sets ->no_write_same in response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing __blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios. This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is specified. For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if sd_done() has just set ->no_write_same thus indicating lack of offload support. Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing") Cc: Christoph Hellwig <h...@lst.de> Cc: "Martin K. Petersen" <martin.peter...@oracle.com> Cc: Hannes Reinecke <h...@suse.com> Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-lib.c | 41 +++-- 1 file changed, 31 insertions(+), 10 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 9d2ab8bba52a..17494275673e 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -321,12 +321,6 @@ static int __blkdev_issue_zero_pages(struct block_device *bdev, * Zero-fill a block range, either using hardware offload or by explicitly * writing zeroes to the device. * - * Note that this function may fail with -EOPNOTSUPP if the driver signals - * zeroing offload support, but the device fails to process the command (for - * some devices there is no non-destructive way to verify whether this - * operation is actually supported). In this case the caller should call - * retry the call to blkdev_issue_zeroout() and the fallback path will be used. - * * If a device is using logical block provisioning, the underlying space will * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. * @@ -370,18 +364,45 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout); int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, unsigned flags) { - int ret; - struct bio *bio = NULL; + int ret = 0; + sector_t bs_mask; + struct bio *bio; struct blk_plug plug; + bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev); + + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; + if ((sector | nr_sects) & bs_mask) + return -EINVAL; +retry: + bio = NULL; blk_start_plug(); - ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, - , flags); + if (try_write_zeroes) { + ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, + gfp_mask, , flags); + } else if (!(flags & BLKDEV_ZERO_NOFALLBACK)) { + ret = __blkdev_issue_zero_pages(bdev, sector, nr_sects, + gfp_mask, ); + } else if (!bdev_write_zeroes_sectors(bdev)) { + /* +* Manual zeroout is not allowed and either: +* - no zeroing offload support +* - zeroing offload support was indicated, but the device +* reported ILLEGAL REQUEST (for some devices there is no +* non-destructive way to verify whether WRITE ZEROES is +* actually supported) +*/ + ret = -EOPNOTSUPP; + } if (ret == 0 && bio) { ret = submit_bio_wait(bio); bio_put(bio); } blk_finish_plug(); + if (ret && try_write_zeroes) { + try_write_zeroes = false; + goto retry; + } return ret; } -- 2.4.3
[PATCH v2 0/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
Hi Christoph, Martin, blkdev_issue_zeroout() now checks for any error. This required a minor refactor, so I dropped the stable tag, Jens can add it back if needed. v1 -> v2: - changed code flow in blkdev_issue_zeroout() according to Christoph's suggestion - this required adding additional checks to blkdev_issue_zeroout() and __blkdev_issue_zero_pages(), the latter is no longer void Previous patch(es) and discussion at https://marc.info/?l=linux-block=150471953327942=2 https://marc.info/?l=linux-block=150601399031909=2 Thanks, Ilya Ilya Dryomov (2): block: factor out __blkdev_issue_zero_pages() block: cope with WRITE ZEROES failing in blkdev_issue_zeroout() block/blk-lib.c | 104 1 file changed, 68 insertions(+), 36 deletions(-) -- 2.4.3
Re: [PATCH 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
On Tue, Oct 3, 2017 at 10:04 AM, Christoph Hellwig <h...@infradead.org> wrote: > On Thu, Sep 21, 2017 at 07:12:52PM +0200, Ilya Dryomov wrote: >> sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to >> permit trying WRITE SAME on older SCSI devices, unless ->no_write_same >> is set. Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE >> SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO: >> >> $ fallocate -zn -l 1k /dev/sdg >> fallocate: fallocate failed: Remote I/O error >> $ fallocate -zn -l 1k /dev/sdg # OK >> $ fallocate -zn -l 1k /dev/sdg # OK >> >> The following calls succeed because sd_done() sets ->no_write_same in >> response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing >> __blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios. >> >> This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing >> and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is >> specified. For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if >> sd_done() has just set ->no_write_same thus indicating lack of offload >> support. >> >> Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing") >> Cc: Christoph Hellwig <h...@lst.de> >> Cc: "Martin K. Petersen" <martin.peter...@oracle.com> >> Cc: Hannes Reinecke <h...@suse.com> >> Signed-off-by: Ilya Dryomov <idryo...@gmail.com> >> --- >> block/blk-lib.c | 27 +-- >> 1 file changed, 21 insertions(+), 6 deletions(-) >> >> diff --git a/block/blk-lib.c b/block/blk-lib.c >> index 6b97feb71065..1cb402beb983 100644 >> --- a/block/blk-lib.c >> +++ b/block/blk-lib.c >> @@ -316,12 +316,6 @@ static void __blkdev_issue_zero_pages(struct >> block_device *bdev, >> * Zero-fill a block range, either using hardware offload or by explicitly >> * writing zeroes to the device. >> * >> - * Note that this function may fail with -EOPNOTSUPP if the driver signals >> - * zeroing offload support, but the device fails to process the command >> (for >> - * some devices there is no non-destructive way to verify whether this >> - * operation is actually supported). In this case the caller should call >> - * retry the call to blkdev_issue_zeroout() and the fallback path will be >> used. >> - * >> * If a device is using logical block provisioning, the underlying space >> will >> * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. >> * >> @@ -374,6 +368,27 @@ int blkdev_issue_zeroout(struct block_device *bdev, >> sector_t sector, >> , flags); >> if (ret == 0 && bio) { >> ret = submit_bio_wait(bio); >> + /* >> + * Fall back to a manual zeroout on any error, if allowed. >> + * >> + * Particularly, WRITE ZEROES may fail with -EREMOTEIO if the >> + * driver signals zeroing offload support, but the device >> + * fails to process the command (for some devices there is no >> + * non-destructive way to verify whether this operation is >> + * actually supported). >> + */ >> + if (ret && bio_op(bio) == REQ_OP_WRITE_ZEROES) { > > No need for the additional levels of indentation here. Also I > really do not like the logic, we shouldn't have to duplicate much > of the logic multiple times. > > I'd more go for something like (sketched in mail): > > bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev); > > retry: > bio = NULL; > blk_start_plug(); > if (try_write_zeroes) > ret = __blkdev_issue_write_zeroes(...) > else > ret = __blkdev_issue_zero_pages(...) > if (ret == 0 && bio) { > ret = submit_bio_wait(bio); > bio_put(bio); > } > blk_finish_plug(); > if (ret && try_write_zeroes) { > try_write_zeroes = false; > goto retry; > } Yeah, I didn't like the code flow either but we are going to duplicate some of it either way. In particular, !bdev_write_zeroes_sectors() -> ret = -EOPNOTSUPP part is still needed to avoid propagating -EREMOTEIO in BLKDEV_ZERO_NOFALLBACK case: if (try_write_zeroes) ret = __blkdev_issue_write_zeroes(...); else if (!(flags & BLKDEV_ZERO_NOFALLBACK)) ret = __blkdev_issue_zero_pages(...); else if (!bdev_write_zeroes_sectors(bdev)) ret = -EOPNOTSUPP; bs_mask check from __blkdev_issue_zeroout() too. I'll post v2 in a few. Thanks, Ilya
Re: [PATCH 0/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
On Thu, Sep 21, 2017 at 7:12 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > Hi Christoph, Martin, > > blkdev_issue_zeroout() now checks for any error. This required a minor > refactor, so I dropped the stable tag, Jens can add it back if needed. > > Previous patch and discussion at > > https://marc.info/?l=linux-block=150471953327942=2 > > Thanks, > > Ilya > > > Ilya Dryomov (2): > block: factor out __blkdev_issue_zero_pages() > block: cope with WRITE ZEROES failing in blkdev_issue_zeroout() > > block/blk-lib.c | 85 > +++-- > 1 file changed, 53 insertions(+), 32 deletions(-) Ping... Christoph, could you please take a look? Thanks, Ilya
[PATCH 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to permit trying WRITE SAME on older SCSI devices, unless ->no_write_same is set. Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO: $ fallocate -zn -l 1k /dev/sdg fallocate: fallocate failed: Remote I/O error $ fallocate -zn -l 1k /dev/sdg # OK $ fallocate -zn -l 1k /dev/sdg # OK The following calls succeed because sd_done() sets ->no_write_same in response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing __blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios. This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is specified. For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if sd_done() has just set ->no_write_same thus indicating lack of offload support. Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing") Cc: Christoph Hellwig <h...@lst.de> Cc: "Martin K. Petersen" <martin.peter...@oracle.com> Cc: Hannes Reinecke <h...@suse.com> Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-lib.c | 27 +-- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 6b97feb71065..1cb402beb983 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -316,12 +316,6 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev, * Zero-fill a block range, either using hardware offload or by explicitly * writing zeroes to the device. * - * Note that this function may fail with -EOPNOTSUPP if the driver signals - * zeroing offload support, but the device fails to process the command (for - * some devices there is no non-destructive way to verify whether this - * operation is actually supported). In this case the caller should call - * retry the call to blkdev_issue_zeroout() and the fallback path will be used. - * * If a device is using logical block provisioning, the underlying space will * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. * @@ -374,6 +368,27 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, , flags); if (ret == 0 && bio) { ret = submit_bio_wait(bio); + /* +* Fall back to a manual zeroout on any error, if allowed. +* +* Particularly, WRITE ZEROES may fail with -EREMOTEIO if the +* driver signals zeroing offload support, but the device +* fails to process the command (for some devices there is no +* non-destructive way to verify whether this operation is +* actually supported). +*/ + if (ret && bio_op(bio) == REQ_OP_WRITE_ZEROES) { + if (flags & BLKDEV_ZERO_NOFALLBACK) { + if (!bdev_write_zeroes_sectors(bdev)) + ret = -EOPNOTSUPP; + } else { + bio_put(bio); + bio = NULL; + __blkdev_issue_zero_pages(bdev, sector, + nr_sects, gfp_mask, ); + ret = submit_bio_wait(bio); + } + } bio_put(bio); } blk_finish_plug(); -- 2.4.3
[PATCH 0/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()
Hi Christoph, Martin, blkdev_issue_zeroout() now checks for any error. This required a minor refactor, so I dropped the stable tag, Jens can add it back if needed. Previous patch and discussion at https://marc.info/?l=linux-block=150471953327942=2 Thanks, Ilya Ilya Dryomov (2): block: factor out __blkdev_issue_zero_pages() block: cope with WRITE ZEROES failing in blkdev_issue_zeroout() block/blk-lib.c | 85 +++-- 1 file changed, 53 insertions(+), 32 deletions(-) -- 2.4.3
[PATCH 1/2] block: factor out __blkdev_issue_zero_pages()
blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-lib.c | 58 +++-- 1 file changed, 32 insertions(+), 26 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 62240f8832ca..6b97feb71065 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -274,6 +274,35 @@ static unsigned int __blkdev_sectors_to_bio_pages(sector_t nr_sects) return min(pages, (sector_t)BIO_MAX_PAGES); } +static void __blkdev_issue_zero_pages(struct block_device *bdev, + sector_t sector, sector_t nr_sects, gfp_t gfp_mask, + struct bio **biop) +{ + struct bio *bio = *biop; + int bi_size = 0; + unsigned int sz; + + while (nr_sects != 0) { + bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), + gfp_mask); + bio->bi_iter.bi_sector = sector; + bio_set_dev(bio, bdev); + bio_set_op_attrs(bio, REQ_OP_WRITE, 0); + + while (nr_sects != 0) { + sz = min((sector_t) PAGE_SIZE, nr_sects << 9); + bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0); + nr_sects -= bi_size >> 9; + sector += bi_size >> 9; + if (bi_size < sz) + break; + } + cond_resched(); + } + + *biop = bio; +} + /** * __blkdev_issue_zeroout - generate number of zero filed write bios * @bdev: blockdev to issue @@ -304,9 +333,6 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, unsigned flags) { int ret; - int bi_size = 0; - struct bio *bio = *biop; - unsigned int sz; sector_t bs_mask; bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; @@ -316,30 +342,10 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask, biop, flags); if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK)) - goto out; - - ret = 0; - while (nr_sects != 0) { - bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects), - gfp_mask); - bio->bi_iter.bi_sector = sector; - bio_set_dev(bio, bdev); - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); - - while (nr_sects != 0) { - sz = min((sector_t) PAGE_SIZE, nr_sects << 9); - bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0); - nr_sects -= bi_size >> 9; - sector += bi_size >> 9; - if (bi_size < sz) - break; - } - cond_resched(); - } + return ret; - *biop = bio; -out: - return ret; + __blkdev_issue_zero_pages(bdev, sector, nr_sects, gfp_mask, biop); + return 0; } EXPORT_SYMBOL(__blkdev_issue_zeroout); -- 2.4.3
Re: [PATCH] block: cope with WRITE SAME failing in blkdev_issue_zeroout()
On Tue, Sep 19, 2017 at 10:32 PM, Christoph Hellwig <h...@lst.de> wrote: > On Wed, Sep 06, 2017 at 07:38:10PM +0200, Ilya Dryomov wrote: >> sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to >> permit trying WRITE SAME on older SCSI devices, unless ->no_write_same >> is set. This means blkdev_issue_zeroout() must cope with WRITE SAME >> failing with BLK_STS_TARGET/-EREMOTEIO and explicitly write zeroes, >> unless BLKDEV_ZERO_NOFALLBACK is specified. >> >> Commit 71027e97d796 ("block: stop using discards for zeroing") added >> the following to __blkdev_issue_zeroout() comment: >> >> "Note that this function may fail with -EOPNOTSUPP if the driver signals >> zeroing offload support, but the device fails to process the command (for >> some devices there is no non-destructive way to verify whether this >> operation is actually supported). In this case the caller should call >> retry the call to blkdev_issue_zeroout() and the fallback path will be >> used." >> >> But __blkdev_issue_zeroout() doesn't fail in this case: if WRITE SAME >> support is indicated, a REQ_OP_WRITE_ZEROES bio is built and returned >> to blkdev_issue_zeroout(). -EREMOTEIO is then propagated up: >> >> $ fallocate -zn -l 1k /dev/sdg >> fallocate: fallocate failed: Remote I/O error >> $ fallocate -zn -l 1k /dev/sdg # OK >> >> (The second fallocate(1) succeeds because sd_done() sets ->no_write_same >> in response to a sense that would become BLK_STS_TARGET.) >> >> Retry __blkdev_issue_zeroout() if the I/O fails with -EREMOTEIO. This >> is roughly what we did until 4.12, sans BLKDEV_ZERO_NOFALLBACK knob. > > I'm really not sure we should check for -EREMOTEIO specifically, but > Martin who is more familiar with the SCSI code might be able to > correct me, I'd feel safer about checking for any error which is > what the old code did. Yeah, I pondered that. The old code did check for any error, but it wouldn't attempt another WRITE SAME after a failure. To do that here, we'd need to refactor __blkdev_issue_zeroout(); as it is, if we retry __blkdev_issue_zeroout() on a random error, it would happily go the REQ_OP_WRITE_ZEROES way again. Martin, what do you think? Thanks, Ilya
Re: [PATCH] block: cope with WRITE SAME failing in blkdev_issue_zeroout()
On Wed, Sep 13, 2017 at 11:50 AM, Ilya Dryomov <idryo...@gmail.com> wrote: > On Wed, Sep 6, 2017 at 7:38 PM, Ilya Dryomov <idryo...@gmail.com> wrote: >> sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to >> permit trying WRITE SAME on older SCSI devices, unless ->no_write_same >> is set. This means blkdev_issue_zeroout() must cope with WRITE SAME >> failing with BLK_STS_TARGET/-EREMOTEIO and explicitly write zeroes, >> unless BLKDEV_ZERO_NOFALLBACK is specified. >> >> Commit 71027e97d796 ("block: stop using discards for zeroing") added >> the following to __blkdev_issue_zeroout() comment: >> >> "Note that this function may fail with -EOPNOTSUPP if the driver signals >> zeroing offload support, but the device fails to process the command (for >> some devices there is no non-destructive way to verify whether this >> operation is actually supported). In this case the caller should call >> retry the call to blkdev_issue_zeroout() and the fallback path will be >> used." >> >> But __blkdev_issue_zeroout() doesn't fail in this case: if WRITE SAME >> support is indicated, a REQ_OP_WRITE_ZEROES bio is built and returned >> to blkdev_issue_zeroout(). -EREMOTEIO is then propagated up: >> >> $ fallocate -zn -l 1k /dev/sdg >> fallocate: fallocate failed: Remote I/O error >> $ fallocate -zn -l 1k /dev/sdg # OK >> >> (The second fallocate(1) succeeds because sd_done() sets ->no_write_same >> in response to a sense that would become BLK_STS_TARGET.) >> >> Retry __blkdev_issue_zeroout() if the I/O fails with -EREMOTEIO. This >> is roughly what we did until 4.12, sans BLKDEV_ZERO_NOFALLBACK knob. >> >> Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing") >> Cc: Christoph Hellwig <h...@lst.de> >> Cc: "Martin K. Petersen" <martin.peter...@oracle.com> >> Cc: Hannes Reinecke <h...@suse.com> >> Cc: sta...@vger.kernel.org # 4.12+ >> Signed-off-by: Ilya Dryomov <idryo...@gmail.com> >> --- >> block/blk-lib.c | 49 + >> 1 file changed, 33 insertions(+), 16 deletions(-) >> >> diff --git a/block/blk-lib.c b/block/blk-lib.c >> index 3fe0aec90597..876b5478c1a2 100644 >> --- a/block/blk-lib.c >> +++ b/block/blk-lib.c >> @@ -287,12 +287,6 @@ static unsigned int >> __blkdev_sectors_to_bio_pages(sector_t nr_sects) >> * Zero-fill a block range, either using hardware offload or by explicitly >> * writing zeroes to the device. >> * >> - * Note that this function may fail with -EOPNOTSUPP if the driver signals >> - * zeroing offload support, but the device fails to process the command >> (for >> - * some devices there is no non-destructive way to verify whether this >> - * operation is actually supported). In this case the caller should call >> - * retry the call to blkdev_issue_zeroout() and the fallback path will be >> used. >> - * >> * If a device is using logical block provisioning, the underlying space >> will >> * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. >> * >> @@ -343,6 +337,34 @@ int __blkdev_issue_zeroout(struct block_device *bdev, >> sector_t sector, >> } >> EXPORT_SYMBOL(__blkdev_issue_zeroout); >> >> +/* >> + * This function may fail with -EREMOTEIO if the driver signals zeroing >> + * offload support, but the device fails to process the command (for >> + * some devices there is no non-destructive way to verify whether this >> + * operation is actually supported). In this case the caller should >> + * retry and the fallback path in __blkdev_issue_zeroout() will be used >> + * unless %flags contains BLKDEV_ZERO_NOFALLBACK. >> + */ >> +static int __blkdev_issue_zeroout_wait(struct block_device *bdev, >> + sector_t sector, sector_t nr_sects, >> + gfp_t gfp_mask, unsigned flags) >> +{ >> + int ret; >> + struct bio *bio = NULL; >> + struct blk_plug plug; >> + >> + blk_start_plug(); >> + ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, >> + , flags); >> + if (ret == 0 && bio) { >> + ret = submit_bio_wait(bio); >> + bio_put(bio); >> + } >> + blk_finish_plug(); >> + >> + return ret; >> +} >> + >> /** >> * blkdev_issue_zeroout -
Re: [PATCH] block: cope with WRITE SAME failing in blkdev_issue_zeroout()
On Wed, Sep 6, 2017 at 7:38 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to > permit trying WRITE SAME on older SCSI devices, unless ->no_write_same > is set. This means blkdev_issue_zeroout() must cope with WRITE SAME > failing with BLK_STS_TARGET/-EREMOTEIO and explicitly write zeroes, > unless BLKDEV_ZERO_NOFALLBACK is specified. > > Commit 71027e97d796 ("block: stop using discards for zeroing") added > the following to __blkdev_issue_zeroout() comment: > > "Note that this function may fail with -EOPNOTSUPP if the driver signals > zeroing offload support, but the device fails to process the command (for > some devices there is no non-destructive way to verify whether this > operation is actually supported). In this case the caller should call > retry the call to blkdev_issue_zeroout() and the fallback path will be > used." > > But __blkdev_issue_zeroout() doesn't fail in this case: if WRITE SAME > support is indicated, a REQ_OP_WRITE_ZEROES bio is built and returned > to blkdev_issue_zeroout(). -EREMOTEIO is then propagated up: > > $ fallocate -zn -l 1k /dev/sdg > fallocate: fallocate failed: Remote I/O error > $ fallocate -zn -l 1k /dev/sdg # OK > > (The second fallocate(1) succeeds because sd_done() sets ->no_write_same > in response to a sense that would become BLK_STS_TARGET.) > > Retry __blkdev_issue_zeroout() if the I/O fails with -EREMOTEIO. This > is roughly what we did until 4.12, sans BLKDEV_ZERO_NOFALLBACK knob. > > Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing") > Cc: Christoph Hellwig <h...@lst.de> > Cc: "Martin K. Petersen" <martin.peter...@oracle.com> > Cc: Hannes Reinecke <h...@suse.com> > Cc: sta...@vger.kernel.org # 4.12+ > Signed-off-by: Ilya Dryomov <idryo...@gmail.com> > --- > block/blk-lib.c | 49 + > 1 file changed, 33 insertions(+), 16 deletions(-) > > diff --git a/block/blk-lib.c b/block/blk-lib.c > index 3fe0aec90597..876b5478c1a2 100644 > --- a/block/blk-lib.c > +++ b/block/blk-lib.c > @@ -287,12 +287,6 @@ static unsigned int > __blkdev_sectors_to_bio_pages(sector_t nr_sects) > * Zero-fill a block range, either using hardware offload or by explicitly > * writing zeroes to the device. > * > - * Note that this function may fail with -EOPNOTSUPP if the driver signals > - * zeroing offload support, but the device fails to process the command (for > - * some devices there is no non-destructive way to verify whether this > - * operation is actually supported). In this case the caller should call > - * retry the call to blkdev_issue_zeroout() and the fallback path will be > used. > - * > * If a device is using logical block provisioning, the underlying space > will > * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. > * > @@ -343,6 +337,34 @@ int __blkdev_issue_zeroout(struct block_device *bdev, > sector_t sector, > } > EXPORT_SYMBOL(__blkdev_issue_zeroout); > > +/* > + * This function may fail with -EREMOTEIO if the driver signals zeroing > + * offload support, but the device fails to process the command (for > + * some devices there is no non-destructive way to verify whether this > + * operation is actually supported). In this case the caller should > + * retry and the fallback path in __blkdev_issue_zeroout() will be used > + * unless %flags contains BLKDEV_ZERO_NOFALLBACK. > + */ > +static int __blkdev_issue_zeroout_wait(struct block_device *bdev, > + sector_t sector, sector_t nr_sects, > + gfp_t gfp_mask, unsigned flags) > +{ > + int ret; > + struct bio *bio = NULL; > + struct blk_plug plug; > + > + blk_start_plug(); > + ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, > + , flags); > + if (ret == 0 && bio) { > + ret = submit_bio_wait(bio); > + bio_put(bio); > + } > + blk_finish_plug(); > + > + return ret; > +} > + > /** > * blkdev_issue_zeroout - zero-fill a block range > * @bdev: blockdev to write > @@ -360,17 +382,12 @@ int blkdev_issue_zeroout(struct block_device *bdev, > sector_t sector, > sector_t nr_sects, gfp_t gfp_mask, unsigned flags) > { > int ret; > - struct bio *bio = NULL; > - struct blk_plug plug; > > - blk_start_plug(); > - ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, > -
Re: [PATCH V2 0/3] block/loop: handle discard/zeroout error
Hi Shaohua, You wrote: > BTW: blkdev_issue_zeroout retries if we immediately find the device doesn't > support zeroout, but it doesn't retry if submit_bio_wait returns -EOPNOTSUPP. > Is this correct behavior? I sent a patch for that yesterday, see "[PATCH] block: cope with WRITE SAME failing in blkdev_issue_zeroout()" on linux-block. It checks for -EREMOTEIO, because that's what I hit, but I wonder if it should check for -EOPNOTSUPP from submit_bio_wait() as well. Can you think of a case where, given bdev_write_zeroes_sectors() != 0, submit_bio_wait() would fail with -EOPNOTSUPP and we would want to do explicit zeroing? Please excuse broken threading. Thanks, Ilya
[PATCH] block: cope with WRITE SAME failing in blkdev_issue_zeroout()
sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to permit trying WRITE SAME on older SCSI devices, unless ->no_write_same is set. This means blkdev_issue_zeroout() must cope with WRITE SAME failing with BLK_STS_TARGET/-EREMOTEIO and explicitly write zeroes, unless BLKDEV_ZERO_NOFALLBACK is specified. Commit 71027e97d796 ("block: stop using discards for zeroing") added the following to __blkdev_issue_zeroout() comment: "Note that this function may fail with -EOPNOTSUPP if the driver signals zeroing offload support, but the device fails to process the command (for some devices there is no non-destructive way to verify whether this operation is actually supported). In this case the caller should call retry the call to blkdev_issue_zeroout() and the fallback path will be used." But __blkdev_issue_zeroout() doesn't fail in this case: if WRITE SAME support is indicated, a REQ_OP_WRITE_ZEROES bio is built and returned to blkdev_issue_zeroout(). -EREMOTEIO is then propagated up: $ fallocate -zn -l 1k /dev/sdg fallocate: fallocate failed: Remote I/O error $ fallocate -zn -l 1k /dev/sdg # OK (The second fallocate(1) succeeds because sd_done() sets ->no_write_same in response to a sense that would become BLK_STS_TARGET.) Retry __blkdev_issue_zeroout() if the I/O fails with -EREMOTEIO. This is roughly what we did until 4.12, sans BLKDEV_ZERO_NOFALLBACK knob. Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing") Cc: Christoph Hellwig <h...@lst.de> Cc: "Martin K. Petersen" <martin.peter...@oracle.com> Cc: Hannes Reinecke <h...@suse.com> Cc: sta...@vger.kernel.org # 4.12+ Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-lib.c | 49 + 1 file changed, 33 insertions(+), 16 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 3fe0aec90597..876b5478c1a2 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -287,12 +287,6 @@ static unsigned int __blkdev_sectors_to_bio_pages(sector_t nr_sects) * Zero-fill a block range, either using hardware offload or by explicitly * writing zeroes to the device. * - * Note that this function may fail with -EOPNOTSUPP if the driver signals - * zeroing offload support, but the device fails to process the command (for - * some devices there is no non-destructive way to verify whether this - * operation is actually supported). In this case the caller should call - * retry the call to blkdev_issue_zeroout() and the fallback path will be used. - * * If a device is using logical block provisioning, the underlying space will * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. * @@ -343,6 +337,34 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, } EXPORT_SYMBOL(__blkdev_issue_zeroout); +/* + * This function may fail with -EREMOTEIO if the driver signals zeroing + * offload support, but the device fails to process the command (for + * some devices there is no non-destructive way to verify whether this + * operation is actually supported). In this case the caller should + * retry and the fallback path in __blkdev_issue_zeroout() will be used + * unless %flags contains BLKDEV_ZERO_NOFALLBACK. + */ +static int __blkdev_issue_zeroout_wait(struct block_device *bdev, + sector_t sector, sector_t nr_sects, + gfp_t gfp_mask, unsigned flags) +{ + int ret; + struct bio *bio = NULL; + struct blk_plug plug; + + blk_start_plug(); + ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, + , flags); + if (ret == 0 && bio) { + ret = submit_bio_wait(bio); + bio_put(bio); + } + blk_finish_plug(); + + return ret; +} + /** * blkdev_issue_zeroout - zero-fill a block range * @bdev: blockdev to write @@ -360,17 +382,12 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, unsigned flags) { int ret; - struct bio *bio = NULL; - struct blk_plug plug; - blk_start_plug(); - ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, - , flags); - if (ret == 0 && bio) { - ret = submit_bio_wait(bio); - bio_put(bio); - } - blk_finish_plug(); + ret = __blkdev_issue_zeroout_wait(bdev, sector, nr_sects, + gfp_mask, flags); + if (ret == -EREMOTEIO) + ret = __blkdev_issue_zeroout_wait(bdev, sector, nr_sects, + gfp_mask, flags); return ret; } -- 2.4.3
Re: [PATCH v2 2/2] nbd: don't leak nbd_config
On Tue, May 23, 2017 at 5:49 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > nbd_config is allocated in nbd_alloc_config(), but never freed. > > Fixes: 5ea8d10802ec ("nbd: separate out the config information") > Signed-off-by: Ilya Dryomov <idryo...@gmail.com> > --- > drivers/block/nbd.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c > index e725d8d5ab0b..f3f191ba8ca4 100644 > --- a/drivers/block/nbd.c > +++ b/drivers/block/nbd.c > @@ -1021,6 +1021,7 @@ static void nbd_config_put(struct nbd_device *nbd) > } > kfree(config->socks); > } > + kfree(nbd->config); > nbd->config = NULL; > > nbd->tag_set.timeout = 0; Hi Josef, Ping -- I think these should go into 4.12-rc, Jens is probably waiting for your OK. Thanks, Ilya
[PATCH v2 1/2] nbd: nbd_reset() call in nbd_dev_add() is redundant
There is nothing to clear -- nbd_device has just been allocated. Fold nbd_reset() into its other caller, nbd_config_put(). Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/nbd.c | 14 -- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 9a7bb2c29447..e725d8d5ab0b 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -937,14 +937,6 @@ static int nbd_reconnect_socket(struct nbd_device *nbd, unsigned long arg) return -ENOSPC; } -/* Reset all properties of an NBD device */ -static void nbd_reset(struct nbd_device *nbd) -{ - nbd->config = NULL; - nbd->tag_set.timeout = 0; - queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, nbd->disk->queue); -} - static void nbd_bdev_reset(struct block_device *bdev) { if (bdev->bd_openers > 1) @@ -1029,7 +1021,10 @@ static void nbd_config_put(struct nbd_device *nbd) } kfree(config->socks); } - nbd_reset(nbd); + nbd->config = NULL; + + nbd->tag_set.timeout = 0; + queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, nbd->disk->queue); mutex_unlock(>config_lock); nbd_put(nbd); @@ -1483,7 +1478,6 @@ static int nbd_dev_add(int index) disk->fops = _fops; disk->private_data = nbd; sprintf(disk->disk_name, "nbd%d", index); - nbd_reset(nbd); add_disk(disk); nbd_total_devices++; return index; -- 2.4.3
[PATCH v2 2/2] nbd: don't leak nbd_config
nbd_config is allocated in nbd_alloc_config(), but never freed. Fixes: 5ea8d10802ec ("nbd: separate out the config information") Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/nbd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index e725d8d5ab0b..f3f191ba8ca4 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -1021,6 +1021,7 @@ static void nbd_config_put(struct nbd_device *nbd) } kfree(config->socks); } + kfree(nbd->config); nbd->config = NULL; nbd->tag_set.timeout = 0; -- 2.4.3
Re: [PATCH] nbd: don't leak nbd_config
On Tue, May 23, 2017 at 4:16 PM, Jens Axboe <ax...@fb.com> wrote: > On 05/23/2017 08:14 AM, Josef Bacik wrote: >> On Tue, May 23, 2017 at 12:38:31PM +0200, Ilya Dryomov wrote: >>> nbd_config is allocated in nbd_alloc_config(), but never freed. >>> >>> Fixes: 5ea8d10802ec ("nbd: separate out the config information") >>> Signed-off-by: Ilya Dryomov <idryo...@gmail.com> >>> --- >>> drivers/block/nbd.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >> >> Oops, thanks Ilya >> >> Reviewed-by: Josef Bacik <jba...@fb.com> > > Since config is referenced, why isn't this done in nbd_config_put() > instead? Josef's 5ea8d10802ec added that nbd->config = NULL to nbd_reset(), so I followed his lead. It could be done nbd_config_put() -- nbd_reset() call in nbd_dev_add() looks like a no-op to me. Josef? Thanks, Ilya
[PATCH] nbd: don't leak nbd_config
nbd_config is allocated in nbd_alloc_config(), but never freed. Fixes: 5ea8d10802ec ("nbd: separate out the config information") Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/nbd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 9a7bb2c29447..882fb9efbab1 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -940,6 +940,7 @@ static int nbd_reconnect_socket(struct nbd_device *nbd, unsigned long arg) /* Reset all properties of an NBD device */ static void nbd_reset(struct nbd_device *nbd) { + kfree(nbd->config); nbd->config = NULL; nbd->tag_set.timeout = 0; queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, nbd->disk->queue); -- 2.4.3
[PATCH] block: get rid of blk_integrity_revalidate()
Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk") introduced blk_integrity_revalidate(), which seems to assume ownership of the stable pages flag and unilaterally clears it if no blk_integrity profile is registered: if (bi->profile) disk->queue->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES; else disk->queue->backing_dev_info->capabilities &= ~BDI_CAP_STABLE_WRITES; It's called from revalidate_disk() and rescan_partitions(), making it impossible to enable stable pages for drivers that support partitions and don't use blk_integrity: while the call in revalidate_disk() can be trivially worked around (see zram, which doesn't support partitions and hence gets away with zram_revalidate_disk()), rescan_partitions() can be triggered from userspace at any time. This breaks rbd, where the ceph messenger is responsible for generating/verifying CRCs. Since blk_integrity_{un,}register() "must" be used for (un)registering the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES setting there. This way drivers that call blk_integrity_register() and use integrity infrastructure won't interfere with drivers that don't but still want stable pages. Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk") Cc: "Martin K. Petersen" <martin.peter...@oracle.com> Cc: Dan Williams <dan.j.willi...@intel.com> Cc: Christoph Hellwig <h...@lst.de> Cc: Mike Snitzer <snit...@redhat.com> Cc: sta...@vger.kernel.org # 4.4+, needs backporting Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- I don't have any integrity-capable disks or nvme separate-metadata cards, so I couldn't test this as well as I wanted to. 25520d55cdb6 went in with some nvme work, but Martin recalls that it may have had something to do with integrity-capable dm arrays losing the integrity capability at runtime. Previous discussion at http://www.spinics.net/lists/ceph-devel/msg35413.html --- block/blk-integrity.c | 19 ++- block/partition-generic.c | 1 - fs/block_dev.c| 1 - include/linux/genhd.h | 2 -- 4 files changed, 2 insertions(+), 21 deletions(-) diff --git a/block/blk-integrity.c b/block/blk-integrity.c index 9f0ff5ba4f84..35c5af1ea068 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -417,7 +417,7 @@ void blk_integrity_register(struct gendisk *disk, struct blk_integrity *template bi->tuple_size = template->tuple_size; bi->tag_size = template->tag_size; - blk_integrity_revalidate(disk); + disk->queue->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES; } EXPORT_SYMBOL(blk_integrity_register); @@ -430,26 +430,11 @@ EXPORT_SYMBOL(blk_integrity_register); */ void blk_integrity_unregister(struct gendisk *disk) { - blk_integrity_revalidate(disk); + disk->queue->backing_dev_info->capabilities &= ~BDI_CAP_STABLE_WRITES; memset(>queue->integrity, 0, sizeof(struct blk_integrity)); } EXPORT_SYMBOL(blk_integrity_unregister); -void blk_integrity_revalidate(struct gendisk *disk) -{ - struct blk_integrity *bi = >queue->integrity; - - if (!(disk->flags & GENHD_FL_UP)) - return; - - if (bi->profile) - disk->queue->backing_dev_info->capabilities |= - BDI_CAP_STABLE_WRITES; - else - disk->queue->backing_dev_info->capabilities &= - ~BDI_CAP_STABLE_WRITES; -} - void blk_integrity_add(struct gendisk *disk) { if (kobject_init_and_add(>integrity_kobj, _ktype, diff --git a/block/partition-generic.c b/block/partition-generic.c index 7afb9907821f..0171a2faad68 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -497,7 +497,6 @@ int rescan_partitions(struct gendisk *disk, struct block_device *bdev) if (disk->fops->revalidate_disk) disk->fops->revalidate_disk(disk); - blk_integrity_revalidate(disk); check_disk_size_change(disk, bdev); bdev->bd_invalidated = 0; if (!get_capacity(disk) || !(state = check_partition(disk, bdev))) diff --git a/fs/block_dev.c b/fs/block_dev.c index 2eca00ec4370..56039dfbc674 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1451,7 +1451,6 @@ int revalidate_disk(struct gendisk *disk) if (disk->fops->revalidate_disk) ret = disk->fops->revalidate_disk(disk); - blk_integrity_revalidate(disk); bdev = bdget_disk(disk, 0); if (!bdev) return ret; diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 76f39754e7b0..76d6a1cd4153 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -722,11 +722,9 @@ static inline void par
Re: blk_integrity_revalidate() clears BDI_CAP_STABLE_WRITES
On Thu, Mar 2, 2017 at 4:24 AM, Martin K. Petersen <martin.peter...@oracle.com> wrote: >>>>>> "Ilya" == Ilya Dryomov <idryo...@gmail.com> writes: > > Ilya, > > Ilya> Given the above, I'm not sure what the baseline is -- > Ilya> blk_integrity code isn't invoked for data-only lbafs. Could nvme > Ilya> folks please look at this? rbd regression caused by integrity > Ilya> revalidate change goes back to 4.4 and I'd really like to get it > Ilya> fixed. The proposed patch is attached to my earlier reply on > Ilya> linux-block. > > I've had a couple of fires to attend to this week. I'll try and give > your patch a spin on my FC setup tomorrow or Friday. Hi Martin, Ping... Did you get a chance to try it? Thanks, Ilya
Re: blk_integrity_revalidate() clears BDI_CAP_STABLE_WRITES
On Fri, Feb 24, 2017 at 12:49 AM, Martin K. Petersen <martin.peter...@oracle.com> wrote: >>>>>> "Ilya" == Ilya Dryomov <idryo...@gmail.com> writes: > > Ilya, > > Ilya> Well, blk_integrity_revalidate() doesn't clear the profile, it > Ilya> just clears the stable pages flag. Whoever calls > Ilya> blk_integrity_unregister() to clear the profile can also clear the > Ilya> stable pages flag -- why not let blk_integrity_unregister() clear > Ilya> the flag like I suggested? > > That's what it used to do. > > blk_integrity_revalidate() was obviously introduced to overcome some > problem. Unfortunately, I can't recall what that was and Google isn't > being particularly helpful. I suspect it was either in the NVDIMM or > NVMe camps since that's where the churn was. Adding more people from 25520d55cdb6 -- does anyone remember the story behind blk_integrity_revalidate()? > > I don't have a problem with your patch as long as we're sure there are > no regressions. I would carry the gendisk check over, though. I spent quite some time trying to do basic tests on nvme, since it uses both nop_profile and real integrity profiles. Unfortunately I don't have access to an nvme card with separate-metadata capability and upstream qemu driver doesn't support any metadata capabilities at all, so I turned to https://github.com/OpenChannelSSD/qemu-nvme. It lets you configure mc and multiple lbafs, but commit ff646a5841f743fdd9cfef138ac6be1f0ba4fbfb Author: Keith Busch <keith.bu...@intel.com> Thu Oct 23 20:02:44 2014 As part of this, I am removing all meta-data support. It's not well supported in any immediatly available operating system, so the code is not well tested. In the end I got it into a semi-working state by checking out an old version and dropping some sanity checks. I formatted the namespace in a number of ways -- bdi/stable_pages_required and integrity/format behaved as expected ("none", "nop", "T10-DIF-TYPE1-CRC"). Given the above, I'm not sure what the baseline is -- blk_integrity code isn't invoked for data-only lbafs. Could nvme folks please look at this? rbd regression caused by integrity revalidate change goes back to 4.4 and I'd really like to get it fixed. The proposed patch is attached to my earlier reply on linux-block. Thanks, Ilya
Re: blk_integrity_revalidate() clears BDI_CAP_STABLE_WRITES
On Wed, Feb 22, 2017 at 5:41 AM, Martin K. Petersen <martin.peter...@oracle.com> wrote: >>>>>> "Ilya" == Ilya Dryomov <idryo...@gmail.com> writes: > > Ilya, > > Ilya> could you please explain blk_integrity_revalidate() and > Ilya> its GENHD_FL_UP check in particular? We have the queue, > Ilya> bi->profile can't be NULL after blk_integrity_register(), and > Ilya> since the latter "must" be used for registering the profile with > Ilya> the block layer, wouldn't the following be sufficient for > Ilya> blk_integrity users? > > IIrc, the FL_UP check fixed a registration problem in the nvme driver. Did it have something to do with a NULL disk->queue? > > The rationale behind revalidate was that we need to handle devices which > lose the integrity capability at runtime (i.e. a integrity-enabled DM > device is extended with a non-cable drive forcing the feature to be > turned off). The clearing of the integrity profile is more important in > that case than zapping the stable pages flag. But that was the original > reason for not just ORing BDI_CAP_STABLE_WRITES. > > I don't have a huge problem with keeping stable pages on if a device > suddenly stops being integrity capable. However, I'd like to understand > your use case a bit better. Well, blk_integrity_revalidate() doesn't clear the profile, it just clears the stable pages flag. Whoever calls blk_integrity_unregister() to clear the profile can also clear the stable pages flag -- why not let blk_integrity_unregister() clear the flag like I suggested? > > Ilya> The alternative seems to be to set up a bogus > Ilya> blk_integrity_profile (nop_profile won't do -- this one would have > Ilya> to be truly bogus w/ NULL *_fn) under BLK_DEV_INTEGRITY ifdefs and > Ilya> hope that nothing breaks. > > Can you point me to the relevant code on your end? This code doesn't exist yet -- it's just something I thought I'd have to do if my patch or something along those lines isn't accepted. There is the nop_profile with stub *_fn callbacks, which is used by nvme and nvdimm to make bio_integrity_enabled() return true, so that per-interval buffers are allocated in bio_integrity_prep(). "I want some space for per-interval metadata, but no integrity checksums" is use case there. In the rbd case, we don't want that. We just want to set the stable pages flag and make sure it's not reset by blk_integrity code. So, if blk_integrity_revalidate() isn't fixed, rbd will need to set up a bogus profile with NULL *_fn callbacks to make bio_integrity_enabled() return false. This is obviously fragile, because blk_get_integrity() will no longer return NULL... We are setting BDI_CAP_STABLE_WRITES in rbd_init_disk(), see commit bae818ee1577 ("rbd: require stable pages if message data CRCs are enabled"). The CRCs are generated in write_partial_message_data() and verified in read_partial_msg_data(). I'm attaching the patch I had in mind. Thanks, Ilya From 24a0cf8bc437a65b3986e9ab25cf2f05e6bbd5d0 Mon Sep 17 00:00:00 2001 From: Ilya Dryomov <idryo...@gmail.com> Date: Mon, 20 Feb 2017 18:50:50 +0100 Subject: [PATCH] block: get rid of blk_integrity_revalidate() Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk") introduced blk_integrity_revalidate(), which clears the stable pages bit if no blk_integrity profile is registered. It's called from revalidate_disk() and rescan_partitions(), making it impossible to set BDI_CAP_STABLE_WRITES for drivers that support partitions and don't use blk_integrity. One such driver is rbd, where the ceph messenger is responsible for generating/verifying CRCs. Since blk_integrity_{un,}register() "must" be used for (un)registering the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES setting there. This way drivers that call blk_integrity_register() and use integrity infrastructure won't interfere with drivers that don't but still want stable pages. Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk") Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- block/blk-integrity.c | 19 ++- block/partition-generic.c | 1 - fs/block_dev.c| 1 - include/linux/genhd.h | 2 -- 4 files changed, 2 insertions(+), 21 deletions(-) diff --git a/block/blk-integrity.c b/block/blk-integrity.c index d69c5c79f98e..319f2e4f4a8b 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -417,7 +417,7 @@ void blk_integrity_register(struct gendisk *disk, struct blk_integrity *template bi->tuple_size = template->tuple_size; bi->tag_size = template->tag_size; - blk_integrity_revalidate(disk); + disk->queue->backing_dev_info.capabilities |= BDI_CAP_STABLE_WRITES; } EXPORT_SYMBOL(blk_integrity_register); @@ -