Re: [PATCH] io_uring: Try to merge io requests only for regular files

2021-03-19 Thread Dmitry Monakhov
- stable@



19.03.2021, 08:29, "Dmitry Monakhov" :
> Otherwise we may endup blocking on pipe or socket.
>
> Fixes: 6d5d5ac ("io_uring: extend async work merge")
> Testcase: 
> https://github.com/dmonakhov/liburing/commit/16d171b6ef9d68e6db66650a83d98c5c721d01f6
> Signed-off-by: Dmitry Monakhov 
> ---
>  fs/io_uring.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 478df7e..848657c 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2183,6 +2183,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, 
> struct io_kiocb *req,
>  static struct async_list *io_async_list_from_req(struct io_ring_ctx *ctx,
>   struct io_kiocb *req)
>  {
> + if (!(req->flags & REQ_F_ISREG))
> + return NULL;
> +
IMHO it is reasonable to completely disable  io_should_merge logic because
even with the this fix it still affected by latency spikes like follows:

->submit_read: req1[slow_hdd, sector=z]
->submit_read: req2[nvme, sector=X]
-> wait(req2)  -> fast

->submit_read: req3[nvme, sector=Y] 
--> wait(req3)  ->slow  
  if completes if X and Y belongs to same page merge logic will wait for req1 
to complete



From c24cda9aa902775e4e23575b758f009e9fae4640 Mon Sep 17 00:00:00 2001
From: Dmitry Monakhov 
Date: Fri, 19 Mar 2021 08:48:18 +0300
Subject: [PATCH] io_uring: completely disable io_should_merge logic

io_should_merge logic is wierd and prone to latency spikes because of slow neighbors.

Signed-off-by: Dmitry Monakhov 

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 478df7e..a4fb3a7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1288,7 +1288,7 @@ static ssize_t io_import_iovec(struct io_ring_ctx *ctx, int rw,
 
 static inline bool io_should_merge(struct async_list *al, struct kiocb *kiocb)
 {
-	if (al->file == kiocb->ki_filp) {
+	if (al->file == kiocb->ki_filp && 0) {
 		off_t start, end;
 
 		/*
-- 
2.7.4



[PATCH] io_uring: Try to merge io requests only for regular files

2021-03-18 Thread Dmitry Monakhov
Otherwise we may endup blocking on pipe or socket.

Fixes: 6d5d5ac ("io_uring: extend async work merge")
Testcase: 
https://github.com/dmonakhov/liburing/commit/16d171b6ef9d68e6db66650a83d98c5c721d01f6
Signed-off-by: Dmitry Monakhov 
---
 fs/io_uring.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 478df7e..848657c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2183,6 +2183,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, 
struct io_kiocb *req,
 static struct async_list *io_async_list_from_req(struct io_ring_ctx *ctx,
 struct io_kiocb *req)
 {
+   if (!(req->flags & REQ_F_ISREG))
+   return NULL;
+
switch (req->submit.opcode) {
case IORING_OP_READV:
case IORING_OP_READ_FIXED:
-- 
2.7.4



Re: [PATCH 1/1] nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a

2021-03-11 Thread Dmitry Monakhov



10.03.2021, 16:41, "Christoph Hellwig" :
> On Wed, Mar 10, 2021 at 02:21:56PM +0100, Christoph Hellwig wrote:
>>  Can you try this patch instead?
>>
>>  http://lists.infradead.org/pipermail/linux-nvme/2021-February/023183.html
>
> Actually, please try the patch below instead, it looks like our existing
> logic messes up the units:
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index e68a8c4ac5a6ea..1867fdf2205bd7 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1963,30 +1963,18 @@ static void nvme_config_discard(struct gendisk *disk, 
> struct nvme_ns *ns)
>  blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
>  }
>
> -static void nvme_config_write_zeroes(struct gendisk *disk, struct nvme_ns 
> *ns)
> +/*
> + * Even though NVMe spec explicitly states that MDTS is not applicable to the
> + * write-zeroes, we are cautious and limit the size to the controllers
> + * max_hw_sectors value, which is based on the MDTS field and possibly other
> + * limiting factors.
> + */
> +static void nvme_config_write_zeroes(struct request_queue *q,
> + struct nvme_ctrl *ctrl)
>  {
> - u64 max_blocks;
> -
> - if (!(ns->ctrl->oncs & NVME_CTRL_ONCS_WRITE_ZEROES) ||
> - (ns->ctrl->quirks & NVME_QUIRK_DISABLE_WRITE_ZEROES))
> - return;
> - /*
> - * Even though NVMe spec explicitly states that MDTS is not
> - * applicable to the write-zeroes:- "The restriction does not apply to
> - * commands that do not transfer data between the host and the
> - * controller (e.g., Write Uncorrectable ro Write Zeroes command).".
> - * In order to be more cautious use controller's max_hw_sectors value
> - * to configure the maximum sectors for the write-zeroes which is
> - * configured based on the controller's MDTS field in the
> - * nvme_init_identify() if available.
> - */
> - if (ns->ctrl->max_hw_sectors == UINT_MAX)
> - max_blocks = (u64)USHRT_MAX + 1;
> - else
> - max_blocks = ns->ctrl->max_hw_sectors + 1;
> -
> - blk_queue_max_write_zeroes_sectors(disk->queue,
> - nvme_lba_to_sect(ns, max_blocks));
> + if ((ctrl->oncs & NVME_CTRL_ONCS_WRITE_ZEROES) &&
> + !(ctrl->quirks & NVME_QUIRK_DISABLE_WRITE_ZEROES))
> + blk_queue_max_write_zeroes_sectors(q, ctrl->max_hw_sectors);
>  }
>
>  static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids)
> @@ -2158,7 +2146,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
>  set_capacity_and_notify(disk, capacity);
>
>  nvme_config_discard(disk, ns);
> - nvme_config_write_zeroes(disk, ns);
> + nvme_config_write_zeroes(disk->queue, ns->ctrl);
>
>  set_disk_ro(disk, (id->nsattr & NVME_NS_ATTR_RO) ||
>  test_bit(NVME_NS_FORCE_RO, >flags));
In order to exclude possible issue with incorrect request sized I've run test 
which does write_zeroes,
via fio-fallocate randrtim, which actually does fallocate punch_hole+keep_size 
which converts to blkdev_issue_zeroout()
note: fio should be patched, see: https://github.com/axboe/fio/pull/1203

fio --name t --ioengine=falloc --rw=randtrim --bs=512 --size=100M 
--filename=/dev/nvme0n1 --numjobs=16
After a couple of minutes it stuck, and then timeout occour.
cat /sys/kernel/debug/block/nvme0n1/hctx*/busy  

 
cd27b755 {.op=WRITE_ZEROES, .cmd_flags=SYNC, 
.rq_flags=DONTPREP|IO_STAT|STATS, .state=in_flight, .tag=205, .internal_tag=-1}
9d3f2b8f {.op=WRITE_ZEROES, .cmd_flags=SYNC, 
.rq_flags=DONTPREP|IO_STAT|STATS, .state=in_flight, .tag=244, .internal_tag=-1}
eb4166fe {.op=WRITE_ZEROES, .cmd_flags=SYNC, 
.rq_flags=DONTPREP|IO_STAT|STATS, .state=in_flight, .tag=709, .internal_tag=-1}
49b49c60 {.op=WRITE_ZEROES, .cmd_flags=SYNC, 
.rq_flags=DONTPREP|IO_STAT|STATS, .state=in_flight, .tag=433, .internal_tag=-1}
18b93c40 {.op=WRITE_ZEROES, .cmd_flags=SYNC, 
.rq_flags=DONTPREP|IO_STAT|STATS, .state=in_flight, .tag=5, .internal_tag=-1}
ac15ef73 {.op=WRITE_ZEROES, .cmd_flags=SYNC, 
.rq_flags=DONTPREP|IO_STAT|STATS, .state=in_flight, .tag=268, .internal_tag=-1}

So, this is definitely hardware issue, and write_zeroes should be disabled for 
this particular model.



[RESEND][PATCH] nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a

2021-03-10 Thread Dmitry Monakhov
This adds a quirk for Samsung PM1725a drive which fixes timeouts and
I/O errors due to the fact that the controller does not properly
handle the Write Zeroes command, dmesg log:

nvme nvme0: I/O 528 QID 10 timeout, aborting
nvme nvme0: I/O 529 QID 10 timeout, aborting
nvme nvme0: I/O 530 QID 10 timeout, aborting
nvme nvme0: I/O 531 QID 10 timeout, aborting
nvme nvme0: I/O 532 QID 10 timeout, aborting
nvme nvme0: I/O 533 QID 10 timeout, aborting
nvme nvme0: I/O 534 QID 10 timeout, aborting
nvme nvme0: I/O 535 QID 10 timeout, aborting
nvme nvme0: Abort status: 0x0
nvme nvme0: Abort status: 0x0
nvme nvme0: Abort status: 0x0
nvme nvme0: Abort status: 0x0
nvme nvme0: Abort status: 0x0
nvme nvme0: Abort status: 0x0
nvme nvme0: Abort status: 0x0
nvme nvme0: Abort status: 0x0
nvme nvme0: I/O 528 QID 10 timeout, reset controller
nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
nvme nvme0: Device not ready; aborting reset, CSTS=0x3
nvme nvme0: Device not ready; aborting reset, CSTS=0x3
nvme nvme0: Removing after probe failure status: -19
nvme0n1: detected capacity change from 6251233968 to 0
blk_update_request: I/O error, dev nvme0n1, sector 32776 op 0x1:(WRITE) flags 
0x3000 phys_seg 6 prio class 0
blk_update_request: I/O error, dev nvme0n1, sector 113319936 op 
0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
Buffer I/O error on dev nvme0n1p2, logical block 1, lost async page write
blk_update_request: I/O error, dev nvme0n1, sector 113319680 op 
0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
Buffer I/O error on dev nvme0n1p2, logical block 2, lost async page write
blk_update_request: I/O error, dev nvme0n1, sector 113319424 op 
0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
Buffer I/O error on dev nvme0n1p2, logical block 3, lost async page write
blk_update_request: I/O error, dev nvme0n1, sector 113319168 op 
0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
Buffer I/O error on dev nvme0n1p2, logical block 4, lost async page write
blk_update_request: I/O error, dev nvme0n1, sector 113318912 op 
0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
Buffer I/O error on dev nvme0n1p2, logical block 5, lost async page write
blk_update_request: I/O error, dev nvme0n1, sector 113318656 op 
0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
Buffer I/O error on dev nvme0n1p2, logical block 6, lost async page write
blk_update_request: I/O error, dev nvme0n1, sector 113318400 op 
0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
blk_update_request: I/O error, dev nvme0n1, sector 113318144 op 
0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
blk_update_request: I/O error, dev nvme0n1, sector 113317888 op 
0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0

Signed-off-by: Dmitry Monakhov 
---
 drivers/nvme/host/pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17ab332..7249ae7 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3246,6 +3246,7 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
{ PCI_DEVICE(0x144d, 0xa822),   /* Samsung PM1725a */
.driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY |
+   NVME_QUIRK_DISABLE_WRITE_ZEROES|
NVME_QUIRK_IGNORE_DEV_SUBNQN, },
{ PCI_DEVICE(0x1987, 0x5016),   /* Phison E16 */
.driver_data = NVME_QUIRK_IGNORE_DEV_SUBNQN, },
-- 
2.7.4



Re: [PATCH] bfq: fix blkio cgroup leakage

2020-08-11 Thread Dmitry Monakhov


Paolo Valente  writes:

>> Il giorno 9 lug 2020, alle ore 10:19, Dmitry Monakhov  
>> ha scritto:
>> 
>> Paolo Valente  writes:
>> 
>>>> Il giorno 8 lug 2020, alle ore 19:48, Dmitry Monakhov 
>>>>  ha scritto:
>>>> 
>>>> Paolo Valente  writes:
>>>> 
>>>>> Hi,
>>>>> sorry for the delay.  The commit you propose to drop fix the issues
>>>>> reported in [1].
>>>>> 
>>>>> Such a commit does introduce the leak that you report (thank you for
>>>>> spotting it).  Yet, according to the threads mentioned in [1],
>>>>> dropping that commit would take us back to those issues.
>>>>> 
>>>>> Maybe the solution is to fix the unbalance that you spotted?
>>>> I'm not quite shure that do I understand which bug was addressed for 
>>>> commit db37a34c563b.
>>>> AFAIU both bugs mentioned in original patchset was fixed by:
>>>> 478de3380 ("block, bfq: deschedule empty bfq_queues not referred by any 
>>>> proces")
>>>> f718b0932 ( block, bfq: do not plug I/O for bfq_queues with no proc refs)"
>>>> 
>>>> So I review commit db37a34c563b as independent one.
>>>> It introduces extra reference for bfq_groups via bfqg_and_blkg_get(),
>>>> but do we actually need it here?
>>>> 
>>>> #IF CONFIG_BFQ_GROUP_IOSCHED is enabled:
>>>> bfqd->root_group is holded by bfqd from bfq_init_queue()
>>>> other bfq_queue objects are owned by corresponding blkcg from 
>>>> bfq_pd_alloc()
>>>> So bfq_queue can not disappear under us.
>>>> 
>>> 
>>> You are right, but incomplete.  No extra ref is needed for an entity
>>> that represents a bfq_queue.  And this consideration mistook me before
>>> I realized that that commit was needed.  The problem is that an entity
>>> may also represent a group of entities.  In that case no reference is
>>> taken through any bfq_queue.  The commit you want to remove takes this
>>> missing reference.
>> Sorry, It looks like I've mistyped sentance above, I ment to say bfq_group.
>> So here is my statement corrected:
>> #IF CONFIG_BFQ_GROUP_IOSCHED is enabled:
>> bfqd->root_group is holded by bfqd from bfq_init_queue()
>> other *bfq_group* objects are owned by corresponding blkcg, reference get 
>> from bfq_pd_alloc()
>> So *bfq_group* can not disappear under us.
>> 
>> So no extra reference is required for entity represents bfq_group. Commit is 
>> not required.
>
> No, the entity may remain alive and on some tree after bfq_pd_offline has 
> been invoked.
But  bfq_group's entity stil holded by child's entity from here,
 -> bfq_init_entity()
->bfqg_and_blkg_get(bfqg);
->entity->parent = bfqg->my_entity

 -> bfq_put_queue(bfqq)
FINAL_PUT
->bfqg_and_blkg_put(bfqq_group(bfqq))
->kmem_cache_free(bfq_pool, bfqq);

so group can not just disappear us while tree is in service. Please corect me 
if I'm wrong.
BTW, I've send new version with updated description here [1]

Footnotes:
[1] 
https://lore.kernel.org/linux-block/20200811064340.31284-1-dmtrmonak...@yandex-team.ru
>
> Paolo
>
>>> 
>>> Paolo
>>> 
>>>> #IF CONFIG_BFQ_GROUP_IOSCHED is disabled:
>>>> we have only one  bfqd->root_group object which allocated from 
>>>> bfq_create_group_hierarch()
>>>> and bfqg_and_blkg_get() bfqg_and_blkg_put() are noop
>>>> 
>>>> Resume: in both cases extra reference is not required, so I continue to
>>>> insist that we should revert  commit db37a34c563b because it tries to
>>>> solve a non existing issue, but introduce the real one.
>>>> 
>>>> Please correct me if I'm wrong.
>>>>> 
>>>>> I'll check it ASAP, unless you do it before me.
>>>>> 
>>>>> Thanks,
>>>>> Paolo
>>>>> 
>>>>> [1] https://lkml.org/lkml/2020/1/31/94
>>>>> 
>>>>>> Il giorno 2 lug 2020, alle ore 12:57, Dmitry Monakhov 
>>>>>>  ha scritto:
>>>>>> 
>>>>>> commit db37a34c563b ("block, bfq: get a ref to a group when adding it to 
>>>>>> a service tree")
>>>>>> introduce leak forbfq_group and blkcg_gq objects because of get/put
>>>>>> imbalance. See trace balow:
>>>>>> -> blkg_alloc
>>>>>> -> bfq_pq_alloc
&g

[PATCH] bfq: fix blkio cgroup leakage v4

2020-08-11 Thread Dmitry Monakhov
Changes from v1:
- update commit description with proper ref-accounting justification

commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a 
service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance.
In fact whole idea of original commit is wrong because bfq_group entity
can not dissapear under us because it is referenced by child bfq_queue's
entities from here:
 -> bfq_init_entity()
->bfqg_and_blkg_get(bfqg);
->entity->parent = bfqg->my_entity

 -> bfq_put_queue(bfqq)
FINAL_PUT
->bfqg_and_blkg_put(bfqq_group(bfqq))
->kmem_cache_free(bfq_pool, bfqq);

So parent entity can not disappear while child entity is in tree,
and child entities already has proper protection.
This patch revert commit db37a34c563b ("block, bfq: get a ref to a group when 
adding it to a service tree")


bfq_group leak trace caused by bad commit:
-> blkg_alloc
   -> bfq_pq_alloc
 -> bfqg_get (+1)
->bfq_activate_bfqq
  ->bfq_activate_requeue_entity
-> __bfq_activate_entity
   ->bfq_get_entity
 ->bfqg_and_blkg_get (+1)  < : Note1
->bfq_del_bfqq_busy
  ->bfq_deactivate_entity+0x53/0xc0 [bfq]
->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
  -> bfq_forget_entity(is_in_service = true)
 entity->on_st_or_in_serv = false   <=== :Note2
 if (is_in_service)
 return;  ==> do not touch reference
-> blkcg_css_offline
 -> blkcg_destroy_blkgs
  -> blkg_destroy
   -> bfq_pd_offline
-> __bfq_deactivate_entity
 if (!entity->on_st_or_in_serv) /* true, because (Note2)
return false;
 -> bfq_pd_free
-> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq  will leak forever, see test-case below.


##TESTCASE_BEGIN:
#!/bin/bash

max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler

grep blkio /proc/cgroups
for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs
dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
rmdir /sys/fs/cgroup/blkio/a
grep blkio /proc/cgroups
done
##TESTCASE_END:

Signed-off-by: Dmitry Monakhov 
---
 block/bfq-cgroup.c  |  2 +-
 block/bfq-iosched.h |  1 -
 block/bfq-wf2q.c| 12 ++--
 3 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 68882b9..b791e20 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -332,7 +332,7 @@ static void bfqg_put(struct bfq_group *bfqg)
kfree(bfqg);
 }
 
-void bfqg_and_blkg_get(struct bfq_group *bfqg)
+static void bfqg_and_blkg_get(struct bfq_group *bfqg)
 {
/* see comments in bfq_bic_update_cgroup for why refcounting bfqg */
bfqg_get(bfqg);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index cd224aa..7038952 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -986,7 +986,6 @@ struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
 struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg);
 struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node);
-void bfqg_and_blkg_get(struct bfq_group *bfqg);
 void bfqg_and_blkg_put(struct bfq_group *bfqg);
 
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index eb0e2a6..26776bd 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -533,9 +533,7 @@ static void bfq_get_entity(struct bfq_entity *entity)
bfqq->ref++;
bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
 bfqq, bfqq->ref);
-   } else
-   bfqg_and_blkg_get(container_of(entity, struct bfq_group,
-  entity));
+   }
 }
 
 /**
@@ -649,14 +647,8 @@ static void bfq_forget_entity(struct bfq_service_tree *st,
 
entity->on_st_or_in_serv = false;
st->wsum -= entity->weight;
-   if (is_in_service)
-   return;
-
-   if (bfqq)
+   if (bfqq && !is_in_service)
bfq_put_queue(bfqq);
-   else
-   bfqg_and_blkg_put(container_of(entity, struct bfq_group,
-  entity));
 }
 
 /**
-- 
2.7.4



[PATCH 1/2] lib/test_lockup.c: add measure_alloc_pages_wait option

2020-08-10 Thread Dmitry Monakhov
measure_alloc_pages_wait=Y  measure maximum page allocation wait time

Signed-off-by: Dmitry Monakhov 
---
 lib/test_lockup.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/lib/test_lockup.c b/lib/test_lockup.c
index 0f81252..867b2f4 100644
--- a/lib/test_lockup.c
+++ b/lib/test_lockup.c
@@ -77,6 +77,10 @@ static bool call_cond_resched;
 module_param(call_cond_resched, bool, 0600);
 MODULE_PARM_DESC(call_cond_resched, "call cond_resched() between iterations");
 
+static bool measure_alloc_pages_wait;
+module_param(measure_alloc_pages_wait, bool, 0400);
+MODULE_PARM_DESC(measure_alloc_pages_wait, "measure page allocation wait 
time");
+
 static bool measure_lock_wait;
 module_param(measure_lock_wait, bool, 0400);
 MODULE_PARM_DESC(measure_lock_wait, "measure lock wait time");
@@ -162,6 +166,7 @@ MODULE_PARM_DESC(lock_sb_umount, "lock file -> sb -> 
s_umount");
 static atomic_t alloc_pages_failed = ATOMIC_INIT(0);
 
 static atomic64_t max_lock_wait = ATOMIC64_INIT(0);
+static atomic64_t max_alloc_pages_wait = ATOMIC64_INIT(0);
 
 static struct task_struct *main_task;
 static int master_cpu;
@@ -305,6 +310,10 @@ static void test_alloc_pages(struct list_head *pages)
 {
struct page *page;
unsigned int i;
+   u64 wait_start;
+
+   if (measure_alloc_pages_wait)
+   wait_start = local_clock();
 
for (i = 0; i < alloc_pages_nr; i++) {
page = alloc_pages(alloc_pages_gfp, alloc_pages_order);
@@ -314,6 +323,17 @@ static void test_alloc_pages(struct list_head *pages)
}
list_add(>lru, pages);
}
+   if (measure_alloc_pages_wait) {
+   s64 cur_wait = local_clock() - wait_start;
+   s64 max_wait = atomic64_read(_alloc_pages_wait);
+
+   do {
+   if (cur_wait < max_wait)
+   break;
+   max_wait = atomic64_cmpxchg(_alloc_pages_wait,
+   max_wait, cur_wait);
+   } while (max_wait != cur_wait);
+   }
 }
 
 static void test_free_pages(struct list_head *pages)
@@ -578,10 +598,13 @@ static int __init test_lockup_init(void)
pr_notice("Maximum lock wait: %lld ns\n",
  atomic64_read(_lock_wait));
 
-   if (alloc_pages_nr)
+   if (alloc_pages_nr) {
pr_notice("Page allocation failed %u times\n",
  atomic_read(_pages_failed));
-
+   if (measure_alloc_pages_wait)
+   pr_notice("Maximum pages allocation wait: %lld ns\n",
+ atomic64_read(_alloc_pages_wait));
+   }
pr_notice("FINISH in %llu ns\n", local_clock() - test_start);
 
if (test_file)
-- 
2.7.4



[PATCH 2/2] lib/test_lockup.c: allow cond_resched inside iteration

2020-08-10 Thread Dmitry Monakhov
- New options:
  cond_resched_inside=Y  call cond_resched inside iteration under lock
  measure_lockup=Y   measure maximum lockup time

- Rename option:
  call_cond_resched=Y -> cond_resched_outside=Y: call cond_resched()
 between iterations.

This allow us to simulate situation where process call cond_resched()
with locks held.

Example demonstrate priority inversion issue with epbf-program, where
low priority task sheduled out while holding cgroup_mutex for a long
periods of time which blocks others high priority tasks.

CGROUP_MUTEX=$(gawk '$3 == "cgroup_mutex" {print "0x"$1}' /proc/kallsyms)
# Emulate ebpf-applications load which can hung inside cgroup_bpf_attach()
nice -20 modprobe lib/test_lockup.ko \
  time_nsecs=1000 cooldown_nsecs=10 iterations=10 \
  lock_mutex_ptr=$CGROUP_MUTEX reacquire_locks=Y\
  cond_resched_inside=Y measure_lockup=Y &

stress-ng -c $(nproc) --timeout 10s&
time mkdir /sys/fs/cgroup/blkio/a
time rmdir /sys/fs/cgroup/blkio/a

Signed-off-by: Dmitry Monakhov 
---
 lib/test_lockup.c | 45 +
 1 file changed, 37 insertions(+), 8 deletions(-)

diff --git a/lib/test_lockup.c b/lib/test_lockup.c
index 867b2f4..5666c00 100644
--- a/lib/test_lockup.c
+++ b/lib/test_lockup.c
@@ -73,9 +73,13 @@ static bool touch_hardlockup;
 module_param(touch_hardlockup, bool, 0600);
 MODULE_PARM_DESC(touch_hardlockup, "touch hard-lockup watchdog between 
iterations");
 
-static bool call_cond_resched;
-module_param(call_cond_resched, bool, 0600);
-MODULE_PARM_DESC(call_cond_resched, "call cond_resched() between iterations");
+static bool cond_resched_inside;
+module_param(cond_resched_inside, bool, 0600);
+MODULE_PARM_DESC(cond_resched_inside, "call cond_resched() during iteration");
+
+static bool cond_resched_outside;
+module_param(cond_resched_outside, bool, 0600);
+MODULE_PARM_DESC(cond_resched_outside, "call cond_resched() between 
iterations");
 
 static bool measure_alloc_pages_wait;
 module_param(measure_alloc_pages_wait, bool, 0400);
@@ -85,6 +89,10 @@ static bool measure_lock_wait;
 module_param(measure_lock_wait, bool, 0400);
 MODULE_PARM_DESC(measure_lock_wait, "measure lock wait time");
 
+static bool measure_lockup;
+module_param(measure_lockup, bool, 0400);
+MODULE_PARM_DESC(measure_lockup, "measure maximum lockup time");
+
 static unsigned long lock_wait_threshold = ULONG_MAX;
 module_param(lock_wait_threshold, ulong, 0400);
 MODULE_PARM_DESC(lock_wait_threshold, "print lock wait time longer than this 
in nanoseconds, default off");
@@ -167,6 +175,7 @@ static atomic_t alloc_pages_failed = ATOMIC_INIT(0);
 
 static atomic64_t max_lock_wait = ATOMIC64_INIT(0);
 static atomic64_t max_alloc_pages_wait = ATOMIC64_INIT(0);
+static atomic64_t max_lockup_time = ATOMIC64_INIT(0);
 
 static struct task_struct *main_task;
 static int master_cpu;
@@ -369,6 +378,7 @@ static void test_wait(unsigned int secs, unsigned int nsecs)
 static void test_lockup(bool master)
 {
u64 lockup_start = local_clock();
+   u64 iter_start;
unsigned int iter = 0;
LIST_HEAD(pages);
 
@@ -379,12 +389,18 @@ static void test_lockup(bool master)
test_alloc_pages();
 
while (iter++ < iterations && !signal_pending(main_task)) {
+   s64 cur_lockup, max_lockup;
+
+   iter_start = local_clock();
 
if (iowait)
current->in_iowait = 1;
 
test_wait(time_secs, time_nsecs);
 
+   if (cond_resched_inside)
+   cond_resched();
+
if (iowait)
current->in_iowait = 0;
 
@@ -400,7 +416,15 @@ static void test_lockup(bool master)
if (touch_hardlockup)
touch_nmi_watchdog();
 
-   if (call_cond_resched)
+   cur_lockup  = local_clock() - iter_start;
+   max_lockup = atomic64_read(_lockup_time);
+   do {
+   if (cur_lockup < max_lockup)
+   break;
+   max_lockup = atomic64_cmpxchg(_lockup_time, 
max_lockup, cur_lockup);
+   } while (max_lockup != cur_lockup);
+
+   if (cond_resched_outside)
cond_resched();
 
test_wait(cooldown_secs, cooldown_nsecs);
@@ -515,8 +539,8 @@ static int __init test_lockup_init(void)
return -EINVAL;
 #endif
 
-   if ((wait_state != TASK_RUNNING ||
-(call_cond_resched && !reacquire_locks) ||
+   if ((wait_state != TASK_RUNNING || cond_resched_inside ||
+(cond_resched_outside && !reacquire_locks) ||
 (alloc_pages_nr && gfpflags_allow_blocking(alloc_pages_gfp))) &&
(test_dis

[PATCH] lib/test_lockup.c: add parameters for cond_resched inside loop

2020-08-09 Thread Dmitry Monakhov
call_cond_resched_before=Y  call cond_resched with resource before wait
call_cond_resched_after=Y   call cond_resched with resource after wait
measure_cond_resched=Y  measure maximum cond_resched time inside loop

This simulate situation where process call cond_resched() with lock held.

Example demonstrate priority inversion issue with epbf-program, where
low priority task sheduled out while holding cgroup_mutex for a long
periods of time which blocks others programs with high priority.

CGROUP_MUTEX=$(gawk '$3 == "cgroup_mutex" {print "0x"$1}' /proc/kallsyms)
# Emulate ebpf-application load which can hung inside cgroup_bpf_attach()
nice -20 modprobe lib/test_lockup.ko \
  time_nsecs=1000 cooldown_nsecs=10 iterations=10 \
  lock_mutex_ptr=$CGROUP_MUTEX \
  measure_lock_wait=Y call_cond_resched_after=Y &

stress-ng -c $(nproc) --timeout 10s&
time mkdir /sys/fs/cgroup/blkio/a

Signed-off-by: Dmitry Monakhov 
---
 lib/test_lockup.c | 44 +++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/lib/test_lockup.c b/lib/test_lockup.c
index 0f81252..3e05d6e 100644
--- a/lib/test_lockup.c
+++ b/lib/test_lockup.c
@@ -77,6 +77,18 @@ static bool call_cond_resched;
 module_param(call_cond_resched, bool, 0600);
 MODULE_PARM_DESC(call_cond_resched, "call cond_resched() between iterations");
 
+static bool call_cond_resched_before;
+module_param(call_cond_resched_before, bool, 0600);
+MODULE_PARM_DESC(call_cond_resched_before, "call cond_resched() before wait");
+
+static bool call_cond_resched_after;
+module_param(call_cond_resched_after, bool, 0600);
+MODULE_PARM_DESC(call_cond_resched_after, "call cond_resched() after wait");
+
+static bool measure_cond_resched;
+module_param(measure_cond_resched, bool, 0400);
+MODULE_PARM_DESC(measure_cond_resched, "measure cond_resched time");
+
 static bool measure_lock_wait;
 module_param(measure_lock_wait, bool, 0400);
 MODULE_PARM_DESC(measure_lock_wait, "measure lock wait time");
@@ -162,6 +174,7 @@ MODULE_PARM_DESC(lock_sb_umount, "lock file -> sb -> 
s_umount");
 static atomic_t alloc_pages_failed = ATOMIC_INIT(0);
 
 static atomic64_t max_lock_wait = ATOMIC64_INIT(0);
+static atomic64_t max_cond_resched = ATOMIC64_INIT(0);
 
 static struct task_struct *main_task;
 static int master_cpu;
@@ -346,6 +359,22 @@ static void test_wait(unsigned int secs, unsigned int 
nsecs)
}
 }
 
+static void test_cond_resched(void)
+{
+   s64 cur, old_max;
+   s64 start = local_clock();
+
+   cond_resched();
+
+   cur  = local_clock() - start;
+   old_max = atomic64_read(_cond_resched);
+   do {
+   if (cur < old_max)
+   break;
+   old_max = atomic64_cmpxchg(_cond_resched, old_max, cur);
+   } while (old_max != cur);
+}
+
 static void test_lockup(bool master)
 {
u64 lockup_start = local_clock();
@@ -363,8 +392,14 @@ static void test_lockup(bool master)
if (iowait)
current->in_iowait = 1;
 
+   if (call_cond_resched_before)
+   test_cond_resched();
+
test_wait(time_secs, time_nsecs);
 
+   if (call_cond_resched_after)
+   test_cond_resched();
+
if (iowait)
current->in_iowait = 0;
 
@@ -497,6 +532,7 @@ static int __init test_lockup_init(void)
 
if ((wait_state != TASK_RUNNING ||
 (call_cond_resched && !reacquire_locks) ||
+call_cond_resched_before || call_cond_resched_after ||
 (alloc_pages_nr && gfpflags_allow_blocking(alloc_pages_gfp))) &&
(test_disable_irq || disable_softirq || disable_preempt ||
 lock_rcu || lock_spinlock_ptr || lock_rwlock_ptr)) {
@@ -532,7 +568,7 @@ static int __init test_lockup_init(void)
if (test_lock_sb_umount && test_inode)
lock_rwsem_ptr = (unsigned long)_inode->i_sb->s_umount;
 
-   pr_notice("START pid=%d time=%u +%u ns cooldown=%u +%u ns iterations=%u 
state=%s %s%s%s%s%s%s%s%s%s%s%s\n",
+   pr_notice("START pid=%d time=%u +%u ns cooldown=%u +%u ns iterations=%u 
state=%s %s%s%s%s%s%s%s%s%s%s%s%s%s\n",
  main_task->pid, time_secs, time_nsecs,
  cooldown_secs, cooldown_nsecs, iterations, state,
  all_cpus ? "all_cpus " : "",
@@ -545,6 +581,8 @@ static int __init test_lockup_init(void)
  touch_softlockup ? "touch_softlockup " : "",
  touch_hardlockup ? "touch_hardlockup " : "",
  call_cond_resched ? "call_cond_resched " : "",
+ call_cond_resched_before ? "call_cond_resched_before " : "

[PATCH] block: bfq fix blkio cgroup leakage v3

2020-07-27 Thread Dmitry Monakhov
commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a 
service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance. See trace balow:
-> blkg_alloc
   -> bfq_pq_alloc
 -> bfqg_get (+1)
->bfq_activate_bfqq
  ->bfq_activate_requeue_entity
-> __bfq_activate_entity
   ->bfq_get_entity
 ->bfqg_and_blkg_get (+1)  < : Note1
->bfq_del_bfqq_busy
  ->bfq_deactivate_entity+0x53/0xc0 [bfq]
->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
  -> bfq_forget_entity(is_in_service = true)
 entity->on_st_or_in_serv = false   <=== :Note2
 if (is_in_service)
 return;  ==> do not touch reference
-> blkcg_css_offline
 -> blkcg_destroy_blkgs
  -> blkg_destroy
   -> bfq_pd_offline
-> __bfq_deactivate_entity
 if (!entity->on_st_or_in_serv) /* true, because (Note2)
return false;
 -> bfq_pd_free
-> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq  will leak forever, see test-case below.

We should drop group reference once it finaly removed from service
inside __bfq_bfqd_reset_in_service, as we do with queue entities.

##TESTCASE_BEGIN:
#!/bin/bash

max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler

grep blkio /proc/cgroups
for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs
dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
rmdir /sys/fs/cgroup/blkio/a
grep blkio /proc/cgroups
done
##TESTCASE_END:

changes since v2:
 - use safe iteration macro to prevent freed object dereference.

Signed-off-by: Dmitry Monakhov 
---
 block/bfq-wf2q.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 8113138..13b417a 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -635,14 +635,10 @@ static void bfq_idle_insert(struct bfq_service_tree *st,
  * @entity: the entity being removed.
  * @is_in_service: true if entity is currently the in-service entity.
  *
- * Forget everything about @entity. In addition, if entity represents
- * a queue, and the latter is not in service, then release the service
- * reference to the queue (the one taken through bfq_get_entity). In
- * fact, in this case, there is really no more service reference to
- * the queue, as the latter is also outside any service tree. If,
- * instead, the queue is in service, then __bfq_bfqd_reset_in_service
- * will take care of putting the reference when the queue finally
- * stops being served.
+ * Forget everything about @entity. If entity is not in service, then release
+ * the service reference to the entity (the one taken through  bfq_get_entity).
+ * If the entity is in service, then __bfq_bfqd_reset_in_service will take care
+ * of putting the reference when the entity finally stops being served.
  */
 static void bfq_forget_entity(struct bfq_service_tree *st,
  struct bfq_entity *entity,
@@ -1615,6 +1611,7 @@ bool __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
struct bfq_queue *in_serv_bfqq = bfqd->in_service_queue;
struct bfq_entity *in_serv_entity = _serv_bfqq->entity;
struct bfq_entity *entity = in_serv_entity;
+   struct bfq_entity *parent = NULL;
 
bfq_clear_bfqq_wait_request(in_serv_bfqq);
hrtimer_try_to_cancel(>idle_slice_timer);
@@ -1626,9 +1623,16 @@ bool __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
 * execute the final step: reset in_service_entity along the
 * path from entity to the root.
 */
-   for_each_entity(entity)
+   for_each_entity_safe(entity, parent) {
entity->sched_data->in_service_entity = NULL;
-
+   /*
+* Release bfq_groups reference if it was not released in
+* bfq_forget_entity, which was taken in bfq_get_entity.
+*/
+   if (!bfq_entity_to_bfqq(entity) && !entity->on_st_or_in_serv)
+   bfqg_and_blkg_put(container_of(entity, struct bfq_group,
+  entity));
+   }
/*
 * in_serv_entity is no longer in service, so, if it is in no
 * service tree either, then release the service reference to
-- 
2.7.4



[PATCH] block: bfq fix blkio cgroup leakage v2

2020-07-20 Thread Dmitry Monakhov
commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a 
service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance. See trace balow:
-> blkg_alloc
   -> bfq_pq_alloc
 -> bfqg_get (+1)
->bfq_activate_bfqq
  ->bfq_activate_requeue_entity
-> __bfq_activate_entity
   ->bfq_get_entity
 ->bfqg_and_blkg_get (+1)  < : Note1
->bfq_del_bfqq_busy
  ->bfq_deactivate_entity+0x53/0xc0 [bfq]
->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
  -> bfq_forget_entity(is_in_service = true)
 entity->on_st_or_in_serv = false   <=== :Note2
 if (is_in_service)
 return;  ==> do not touch reference
-> blkcg_css_offline
 -> blkcg_destroy_blkgs
  -> blkg_destroy
   -> bfq_pd_offline
-> __bfq_deactivate_entity
 if (!entity->on_st_or_in_serv) /* true, because (Note2)
return false;
 -> bfq_pd_free
-> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq  will leak forever, see test-case below.

We should drop group reference once it finaly removed from service
inside __bfq_bfqd_reset_in_service, as we do with queue entities.

##TESTCASE_BEGIN:
#!/bin/bash

max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler

grep blkio /proc/cgroups
for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs
dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
rmdir /sys/fs/cgroup/blkio/a
grep blkio /proc/cgroups
done
##TESTCASE_END:

Signed-off-by: Dmitry Monakhov 
---
 block/bfq-wf2q.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 8113138..93b236c 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -635,14 +635,10 @@ static void bfq_idle_insert(struct bfq_service_tree *st,
  * @entity: the entity being removed.
  * @is_in_service: true if entity is currently the in-service entity.
  *
- * Forget everything about @entity. In addition, if entity represents
- * a queue, and the latter is not in service, then release the service
- * reference to the queue (the one taken through bfq_get_entity). In
- * fact, in this case, there is really no more service reference to
- * the queue, as the latter is also outside any service tree. If,
- * instead, the queue is in service, then __bfq_bfqd_reset_in_service
- * will take care of putting the reference when the queue finally
- * stops being served.
+ * Forget everything about @entity. If entity is not in service, then release
+ * the service reference to the entity (the one taken through  bfq_get_entity).
+ * If the entity is in service, then __bfq_bfqd_reset_in_service will take care
+ * of putting the reference when the entity finally stops being served.
  */
 static void bfq_forget_entity(struct bfq_service_tree *st,
  struct bfq_entity *entity,
@@ -1626,9 +1622,16 @@ bool __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
 * execute the final step: reset in_service_entity along the
 * path from entity to the root.
 */
-   for_each_entity(entity)
+   for_each_entity(entity) {
entity->sched_data->in_service_entity = NULL;
-
+   /*
+* Release bfq_groups reference if it was not released in
+* bfq_forget_entity, which was taken in bfq_get_entity.
+*/
+   if (!bfq_entity_to_bfqq(entity) && !entity->on_st_or_in_serv)
+   bfqg_and_blkg_put(container_of(entity, struct bfq_group,
+  entity));
+   }
/*
 * in_serv_entity is no longer in service, so, if it is in no
 * service tree either, then release the service reference to
-- 
2.7.4



Re: [PATCH] bfq: fix blkio cgroup leakage

2020-07-20 Thread Dmitry Monakhov
Paolo Valente  writes:

>> Il giorno 9 lug 2020, alle ore 10:19, Dmitry Monakhov  
>> ha scritto:
>> 
>> Paolo Valente  writes:
>> 
>>>> Il giorno 8 lug 2020, alle ore 19:48, Dmitry Monakhov 
>>>>  ha scritto:
>>>> 
>>>> Paolo Valente  writes:
>>>> 
>>>>> Hi,
>>>>> sorry for the delay.  The commit you propose to drop fix the issues
>>>>> reported in [1].
>>>>> 
>>>>> Such a commit does introduce the leak that you report (thank you for
>>>>> spotting it).  Yet, according to the threads mentioned in [1],
>>>>> dropping that commit would take us back to those issues.
>>>>> 
>>>>> Maybe the solution is to fix the unbalance that you spotted?
>>>> I'm not quite shure that do I understand which bug was addressed for 
>>>> commit db37a34c563b.
>>>> AFAIU both bugs mentioned in original patchset was fixed by:
>>>> 478de3380 ("block, bfq: deschedule empty bfq_queues not referred by any 
>>>> proces")
>>>> f718b0932 ( block, bfq: do not plug I/O for bfq_queues with no proc refs)"
>>>> 
>>>> So I review commit db37a34c563b as independent one.
>>>> It introduces extra reference for bfq_groups via bfqg_and_blkg_get(),
>>>> but do we actually need it here?
>>>> 
>>>> #IF CONFIG_BFQ_GROUP_IOSCHED is enabled:
>>>> bfqd->root_group is holded by bfqd from bfq_init_queue()
>>>> other bfq_queue objects are owned by corresponding blkcg from 
>>>> bfq_pd_alloc()
>>>> So bfq_queue can not disappear under us.
>>>> 
>>> 
>>> You are right, but incomplete.  No extra ref is needed for an entity
>>> that represents a bfq_queue.  And this consideration mistook me before
>>> I realized that that commit was needed.  The problem is that an entity
>>> may also represent a group of entities.  In that case no reference is
>>> taken through any bfq_queue.  The commit you want to remove takes this
>>> missing reference.
>> Sorry, It looks like I've mistyped sentance above, I ment to say bfq_group.
>> So here is my statement corrected:
>> #IF CONFIG_BFQ_GROUP_IOSCHED is enabled:
>> bfqd->root_group is holded by bfqd from bfq_init_queue()
>> other *bfq_group* objects are owned by corresponding blkcg, reference get 
>> from bfq_pd_alloc()
>> So *bfq_group* can not disappear under us.
>> 
>> So no extra reference is required for entity represents bfq_group. Commit is 
>> not required.
>
> No, the entity may remain alive and on some tree after bfq_pd_offline has 
> been invoked.
Ok you right, we should drop the group reference inside 
__bfq_bfqd_reset_in_service() 
as we do for queue's entities. Please see updated patch version.
>From d708067cfa570f80b43c5716adf81d2a29b3d523 Mon Sep 17 00:00:00 2001
From: Dmitry Monakhov 
Date: Mon, 20 Jul 2020 15:10:34 +0300
Subject: [PATCH] block: bfq fix blkio cgroup leakage v2

commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance. See trace balow:
-> blkg_alloc
   -> bfq_pq_alloc
 -> bfqg_get (+1)
->bfq_activate_bfqq
  ->bfq_activate_requeue_entity
-> __bfq_activate_entity
   ->bfq_get_entity
 ->bfqg_and_blkg_get (+1)  < : Note1
->bfq_del_bfqq_busy
  ->bfq_deactivate_entity+0x53/0xc0 [bfq]
->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
  -> bfq_forget_entity(is_in_service = true)
	 entity->on_st_or_in_serv = false   <=== :Note2
	 if (is_in_service)
	 return;  ==> do not touch reference
-> blkcg_css_offline
 -> blkcg_destroy_blkgs
  -> blkg_destroy
   -> bfq_pd_offline
-> __bfq_deactivate_entity
 if (!entity->on_st_or_in_serv) /* true, because (Note2)
		return false;
 -> bfq_pd_free
-> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq  will leak forever, see test-case below.

We should drop group reference once it finaly removed from service
inside __bfq_bfqd_reset_in_service, as we do with queue entities.

##TESTCASE_BEGIN:
#!/bin/bash

max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler

grep blkio /proc/cgroups
for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs
dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> 

Re: [PATCH] bfq: fix blkio cgroup leakage

2020-07-09 Thread Dmitry Monakhov
Paolo Valente  writes:

>> Il giorno 8 lug 2020, alle ore 19:48, Dmitry Monakhov  
>> ha scritto:
>> 
>> Paolo Valente  writes:
>> 
>>> Hi,
>>> sorry for the delay.  The commit you propose to drop fix the issues
>>> reported in [1].
>>> 
>>> Such a commit does introduce the leak that you report (thank you for
>>> spotting it).  Yet, according to the threads mentioned in [1],
>>> dropping that commit would take us back to those issues.
>>> 
>>> Maybe the solution is to fix the unbalance that you spotted?
>> I'm not quite shure that do I understand which bug was addressed for commit 
>> db37a34c563b.
>> AFAIU both bugs mentioned in original patchset was fixed by:
>> 478de3380 ("block, bfq: deschedule empty bfq_queues not referred by any 
>> proces")
>> f718b0932 ( block, bfq: do not plug I/O for bfq_queues with no proc refs)"
>> 
>> So I review commit db37a34c563b as independent one.
>> It introduces extra reference for bfq_groups via bfqg_and_blkg_get(),
>> but do we actually need it here?
>> 
>> #IF CONFIG_BFQ_GROUP_IOSCHED is enabled:
>> bfqd->root_group is holded by bfqd from bfq_init_queue()
>> other bfq_queue objects are owned by corresponding blkcg from bfq_pd_alloc()
>> So bfq_queue can not disappear under us.
>> 
>
> You are right, but incomplete.  No extra ref is needed for an entity
> that represents a bfq_queue.  And this consideration mistook me before
> I realized that that commit was needed.  The problem is that an entity
> may also represent a group of entities.  In that case no reference is
> taken through any bfq_queue.  The commit you want to remove takes this
> missing reference.
Sorry, It looks like I've mistyped sentance above, I ment to say bfq_group.
So here is my statement corrected:
 #IF CONFIG_BFQ_GROUP_IOSCHED is enabled:
 bfqd->root_group is holded by bfqd from bfq_init_queue()
 other *bfq_group* objects are owned by corresponding blkcg, reference get from 
bfq_pd_alloc()
 So *bfq_group* can not disappear under us.

So no extra reference is required for entity represents bfq_group. Commit is 
not required.
>
> Paolo
>
>> #IF CONFIG_BFQ_GROUP_IOSCHED is disabled:
>> we have only one  bfqd->root_group object which allocated from 
>> bfq_create_group_hierarch()
>> and bfqg_and_blkg_get() bfqg_and_blkg_put() are noop
>> 
>> Resume: in both cases extra reference is not required, so I continue to
>> insist that we should revert  commit db37a34c563b because it tries to
>> solve a non existing issue, but introduce the real one.
>> 
>> Please correct me if I'm wrong.
>>> 
>>> I'll check it ASAP, unless you do it before me.
>>> 
>>> Thanks,
>>> Paolo
>>> 
>>> [1] https://lkml.org/lkml/2020/1/31/94
>>> 
>>>> Il giorno 2 lug 2020, alle ore 12:57, Dmitry Monakhov 
>>>>  ha scritto:
>>>> 
>>>> commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a 
>>>> service tree")
>>>> introduce leak forbfq_group and blkcg_gq objects because of get/put
>>>> imbalance. See trace balow:
>>>> -> blkg_alloc
>>>>  -> bfq_pq_alloc
>>>>-> bfqg_get (+1)
>>>> ->bfq_activate_bfqq
>>>> ->bfq_activate_requeue_entity
>>>>   -> __bfq_activate_entity
>>>>  ->bfq_get_entity
>> ->> ->bfqg_and_blkg_get (+1)  < : Note1
>>>> ->bfq_del_bfqq_busy
>>>> ->bfq_deactivate_entity+0x53/0xc0 [bfq]
>>>>   ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
>>>> -> bfq_forget_entity(is_in_service = true)
>>>> entity->on_st_or_in_serv = false   <=== :Note2
>>>> if (is_in_service)
>>>> return;  ==> do not touch reference
>>>> -> blkcg_css_offline
>>>> -> blkcg_destroy_blkgs
>>>> -> blkg_destroy
>>>>  -> bfq_pd_offline
>>>>   -> __bfq_deactivate_entity
>>>>if (!entity->on_st_or_in_serv) /* true, because (Note2)
>>>>return false;
>>>> -> bfq_pd_free
>>>>   -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
>>>> So bfq_group and blkcg_gq  will leak forever, see test-case below.
>>>> If fact bfq_group objects reference counting are quite different
>>>> from bfq_queue. bfq_groups object are referenced by blkcg_gq via
>>>> blkg_policy_data pointer, so  neither nor blkg_get() neither bf

Re: [PATCH] bfq: fix blkio cgroup leakage

2020-07-08 Thread Dmitry Monakhov
Paolo Valente  writes:

> Hi,
> sorry for the delay.  The commit you propose to drop fix the issues
> reported in [1].
>
> Such a commit does introduce the leak that you report (thank you for
> spotting it).  Yet, according to the threads mentioned in [1],
> dropping that commit would take us back to those issues.
>
> Maybe the solution is to fix the unbalance that you spotted?
I'm not quite shure that do I understand which bug was addressed for commit 
db37a34c563b.
AFAIU both bugs mentioned in original patchset was fixed by:
478de3380 ("block, bfq: deschedule empty bfq_queues not referred by any proces")
f718b0932 ( block, bfq: do not plug I/O for bfq_queues with no proc refs)"

So I review commit db37a34c563b as independent one.
It introduces extra reference for bfq_groups via bfqg_and_blkg_get(),
but do we actually need it here?

#IF CONFIG_BFQ_GROUP_IOSCHED is enabled:
 bfqd->root_group is holded by bfqd from bfq_init_queue()
 other bfq_queue objects are owned by corresponding blkcg from bfq_pd_alloc()
 So bfq_queue can not disappear under us.
 
#IF CONFIG_BFQ_GROUP_IOSCHED is disabled:
 we have only one  bfqd->root_group object which allocated from 
bfq_create_group_hierarch()
 and bfqg_and_blkg_get() bfqg_and_blkg_put() are noop

Resume: in both cases extra reference is not required, so I continue to
insist that we should revert  commit db37a34c563b because it tries to
solve a non existing issue, but introduce the real one.

Please correct me if I'm wrong.
>
> I'll check it ASAP, unless you do it before me.
>
> Thanks,
> Paolo
>
> [1] https://lkml.org/lkml/2020/1/31/94
>
>> Il giorno 2 lug 2020, alle ore 12:57, Dmitry Monakhov  
>> ha scritto:
>> 
>> commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a 
>> service tree")
>> introduce leak forbfq_group and blkcg_gq objects because of get/put
>> imbalance. See trace balow:
>> -> blkg_alloc
>>   -> bfq_pq_alloc
>> -> bfqg_get (+1)
>> ->bfq_activate_bfqq
>>  ->bfq_activate_requeue_entity
>>-> __bfq_activate_entity
>>   ->bfq_get_entity
->> ->bfqg_and_blkg_get (+1)  < : Note1
>> ->bfq_del_bfqq_busy
>>  ->bfq_deactivate_entity+0x53/0xc0 [bfq]
>>->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
>>  -> bfq_forget_entity(is_in_service = true)
>>   entity->on_st_or_in_serv = false   <=== :Note2
>>   if (is_in_service)
>>   return;  ==> do not touch reference
>> -> blkcg_css_offline
>> -> blkcg_destroy_blkgs
>>  -> blkg_destroy
>>   -> bfq_pd_offline
>>-> __bfq_deactivate_entity
>> if (!entity->on_st_or_in_serv) /* true, because (Note2)
>>  return false;
>> -> bfq_pd_free
>>-> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
>> So bfq_group and blkcg_gq  will leak forever, see test-case below.
>> If fact bfq_group objects reference counting are quite different
>> from bfq_queue. bfq_groups object are referenced by blkcg_gq via
>> blkg_policy_data pointer, so  neither nor blkg_get() neither bfqg_get
>> required here.
>> 
>> 
>> This patch drop commit db37a34c563b ("block, bfq: get a ref to a group when 
>> adding it to a service tree")
>> and add corresponding comment.
>> 
>> ##TESTCASE_BEGIN:
>> #!/bin/bash
>> 
>> max_iters=${1:-100}
>> #prep cgroup mounts
>> mount -t tmpfs cgroup_root /sys/fs/cgroup
>> mkdir /sys/fs/cgroup/blkio
>> mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
>> 
>> # Prepare blkdev
>> grep blkio /proc/cgroups
>> truncate -s 1M img
>> losetup /dev/loop0 img
>> echo bfq > /sys/block/loop0/queue/scheduler
>> 
>> grep blkio /proc/cgroups
>> for ((i=0;i> do
>>mkdir -p /sys/fs/cgroup/blkio/a
>>echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
>>dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
>>echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
>>rmdir /sys/fs/cgroup/blkio/a
>>grep blkio /proc/cgroups
>> done
>> ##TESTCASE_END:
>> 
>> Signed-off-by: Dmitry Monakhov 
>> ---
>> block/bfq-cgroup.c  |  2 +-
>> block/bfq-iosched.h |  1 -
>> block/bfq-wf2q.c| 15 +--
>> 3 files changed, 6 insertions(+), 12 deletions(-)
>> 
>> diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
>> index 68882b9..b791e20 100644
>> --- a/block/bfq-cgroup.c
>> +++ b/block/bfq-cgroup.c
>> @@ -332,7 +332,7 @@ static void bfqg_put(struct bfq_group *bfqg)
>>

Re: [PATCH] bfq: fix blkio cgroup leakage

2020-07-08 Thread Dmitry Monakhov
Dmitry Monakhov  writes:
Ping. Do you have any objections against this patch?

> commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a 
> service tree")
> introduce leak forbfq_group and blkcg_gq objects because of get/put
> imbalance. See trace balow:
> -> blkg_alloc
>-> bfq_pq_alloc
>  -> bfqg_get (+1)
> ->bfq_activate_bfqq
>   ->bfq_activate_requeue_entity
> -> __bfq_activate_entity
>->bfq_get_entity
>  ->bfqg_and_blkg_get (+1)  < : Note1
> ->bfq_del_bfqq_busy
>   ->bfq_deactivate_entity+0x53/0xc0 [bfq]
> ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
>   -> bfq_forget_entity(is_in_service = true)
>entity->on_st_or_in_serv = false   <=== :Note2
>if (is_in_service)
>return;  ==> do not touch reference
> -> blkcg_css_offline
>  -> blkcg_destroy_blkgs
>   -> blkg_destroy
>-> bfq_pd_offline
> -> __bfq_deactivate_entity
>  if (!entity->on_st_or_in_serv) /* true, because (Note2)
>   return false;
>  -> bfq_pd_free
> -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
> So bfq_group and blkcg_gq  will leak forever, see test-case below.
> If fact bfq_group objects reference counting are quite different
> from bfq_queue. bfq_groups object are referenced by blkcg_gq via
> blkg_policy_data pointer, so  neither nor blkg_get() neither bfqg_get
> required here.
>
>
> This patch drop commit db37a34c563b ("block, bfq: get a ref to a group when 
> adding it to a service tree")
> and add corresponding comment.
>
> ##TESTCASE_BEGIN:
> #!/bin/bash
>
> max_iters=${1:-100}
> #prep cgroup mounts
> mount -t tmpfs cgroup_root /sys/fs/cgroup
> mkdir /sys/fs/cgroup/blkio
> mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
>
> # Prepare blkdev
> grep blkio /proc/cgroups
> truncate -s 1M img
> losetup /dev/loop0 img
> echo bfq > /sys/block/loop0/queue/scheduler
>
> grep blkio /proc/cgroups
> for ((i=0;i do
> mkdir -p /sys/fs/cgroup/blkio/a
> echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
> dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
> echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
> rmdir /sys/fs/cgroup/blkio/a
> grep blkio /proc/cgroups
> done
> ##TESTCASE_END:
>
> Signed-off-by: Dmitry Monakhov 
> ---
>  block/bfq-cgroup.c  |  2 +-
>  block/bfq-iosched.h |  1 -
>  block/bfq-wf2q.c| 15 +--
>  3 files changed, 6 insertions(+), 12 deletions(-)
>
> diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
> index 68882b9..b791e20 100644
> --- a/block/bfq-cgroup.c
> +++ b/block/bfq-cgroup.c
> @@ -332,7 +332,7 @@ static void bfqg_put(struct bfq_group *bfqg)
>   kfree(bfqg);
>  }
>  
> -void bfqg_and_blkg_get(struct bfq_group *bfqg)
> +static void bfqg_and_blkg_get(struct bfq_group *bfqg)
>  {
>   /* see comments in bfq_bic_update_cgroup for why refcounting bfqg */
>   bfqg_get(bfqg);
> diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
> index cd224aa..7038952 100644
> --- a/block/bfq-iosched.h
> +++ b/block/bfq-iosched.h
> @@ -986,7 +986,6 @@ struct bfq_group *bfq_find_set_group(struct bfq_data 
> *bfqd,
>  struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg);
>  struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
>  struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int 
> node);
> -void bfqg_and_blkg_get(struct bfq_group *bfqg);
>  void bfqg_and_blkg_put(struct bfq_group *bfqg);
>  
>  #ifdef CONFIG_BFQ_GROUP_IOSCHED
> diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
> index 34ad095..6a363bb 100644
> --- a/block/bfq-wf2q.c
> +++ b/block/bfq-wf2q.c
> @@ -529,13 +529,14 @@ static void bfq_get_entity(struct bfq_entity *entity)
>  {
>   struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
>  
> + /* Grab reference only for bfq_queue's objects, bfq_group ones
> +  * are owned by blkcg_gq
> +  */
>   if (bfqq) {
>   bfqq->ref++;
>   bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
>bfqq, bfqq->ref);
> - } else
> - bfqg_and_blkg_get(container_of(entity, struct bfq_group,
> -entity));
> + }
>  }
>  
>  /**
> @@ -649,14 +650,8 @@ static void bfq_forget_entity(struct bfq_service_tree 
> *st,
>  
>   entity->on_st_or_in_serv = false;
>   st->wsum -= entity->weight;
> - if (is_in_service)
> - return;
> -
> - if (bfqq)
> + if (bfqq && !is_in_service)
>   bfq_put_queue(bfqq);
> - else
> - bfqg_and_blkg_put(container_of(entity, struct bfq_group,
> -entity));
>  }
>  
>  /**
> -- 
> 2.7.4


[PATCH] bfq: fix blkio cgroup leakage

2020-07-02 Thread Dmitry Monakhov
commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a 
service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance. See trace balow:
-> blkg_alloc
   -> bfq_pq_alloc
 -> bfqg_get (+1)
->bfq_activate_bfqq
  ->bfq_activate_requeue_entity
-> __bfq_activate_entity
   ->bfq_get_entity
 ->bfqg_and_blkg_get (+1)  < : Note1
->bfq_del_bfqq_busy
  ->bfq_deactivate_entity+0x53/0xc0 [bfq]
->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
  -> bfq_forget_entity(is_in_service = true)
 entity->on_st_or_in_serv = false   <=== :Note2
 if (is_in_service)
 return;  ==> do not touch reference
-> blkcg_css_offline
 -> blkcg_destroy_blkgs
  -> blkg_destroy
   -> bfq_pd_offline
-> __bfq_deactivate_entity
 if (!entity->on_st_or_in_serv) /* true, because (Note2)
return false;
 -> bfq_pd_free
-> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq  will leak forever, see test-case below.
If fact bfq_group objects reference counting are quite different
from bfq_queue. bfq_groups object are referenced by blkcg_gq via
blkg_policy_data pointer, so  neither nor blkg_get() neither bfqg_get
required here.


This patch drop commit db37a34c563b ("block, bfq: get a ref to a group when 
adding it to a service tree")
and add corresponding comment.

##TESTCASE_BEGIN:
#!/bin/bash

max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler

grep blkio /proc/cgroups
for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs
dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
    rmdir /sys/fs/cgroup/blkio/a
grep blkio /proc/cgroups
done
##TESTCASE_END:

Signed-off-by: Dmitry Monakhov 
---
 block/bfq-cgroup.c  |  2 +-
 block/bfq-iosched.h |  1 -
 block/bfq-wf2q.c| 15 +--
 3 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 68882b9..b791e20 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -332,7 +332,7 @@ static void bfqg_put(struct bfq_group *bfqg)
kfree(bfqg);
 }
 
-void bfqg_and_blkg_get(struct bfq_group *bfqg)
+static void bfqg_and_blkg_get(struct bfq_group *bfqg)
 {
/* see comments in bfq_bic_update_cgroup for why refcounting bfqg */
bfqg_get(bfqg);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index cd224aa..7038952 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -986,7 +986,6 @@ struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
 struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg);
 struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node);
-void bfqg_and_blkg_get(struct bfq_group *bfqg);
 void bfqg_and_blkg_put(struct bfq_group *bfqg);
 
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 34ad095..6a363bb 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -529,13 +529,14 @@ static void bfq_get_entity(struct bfq_entity *entity)
 {
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 
+   /* Grab reference only for bfq_queue's objects, bfq_group ones
+* are owned by blkcg_gq
+*/
if (bfqq) {
bfqq->ref++;
bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
 bfqq, bfqq->ref);
-   } else
-   bfqg_and_blkg_get(container_of(entity, struct bfq_group,
-  entity));
+   }
 }
 
 /**
@@ -649,14 +650,8 @@ static void bfq_forget_entity(struct bfq_service_tree *st,
 
entity->on_st_or_in_serv = false;
st->wsum -= entity->weight;
-   if (is_in_service)
-   return;
-
-   if (bfqq)
+   if (bfqq && !is_in_service)
bfq_put_queue(bfqq);
-   else
-   bfqg_and_blkg_put(container_of(entity, struct bfq_group,
-  entity));
 }
 
 /**
-- 
2.7.4



[PATCH] bfq: fix blkio cgroup leakage

2020-06-28 Thread Dmitry Monakhov
commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a 
service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance. See trace balow:
-> blkg_alloc
   -> bfq_pq_alloc
 -> bfqg_get (+1)
->bfq_activate_bfqq
  ->bfq_activate_requeue_entity
-> __bfq_activate_entity
   ->bfq_get_entity
 ->bfqg_and_blkg_get (+1)  < : Note1
->bfq_del_bfqq_busy
  ->bfq_deactivate_entity+0x53/0xc0 [bfq]
->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
  -> bfq_forget_entity(is_in_service = true)
 entity->on_st_or_in_serv = false   <=== :Note2
 if (is_in_service)
 return;  ==> do not touch reference
-> blkcg_css_offline
 -> blkcg_destroy_blkgs
  -> blkg_destroy
   -> bfq_pd_offline
-> __bfq_deactivate_entity
 if (!entity->on_st_or_in_serv) /* true, because (Note2)
return false;
 -> bfq_pd_free
-> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq  will leak forever, see test-case below.
If fact bfq_group objects reference counting are quite different
from bfq_queue. bfq_groups object are referenced by blkcg_gq via
blkg_policy_data pointer, so  neither nor blkg_get() neither bfqg_get
required here.


This patch drop commit db37a34c563b ("block, bfq: get a ref to a group when 
adding it to a service tree")
and add corresponding comment.

##TESTCASE_BEGIN:
#!/bin/bash

max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler

grep blkio /proc/cgroups
for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs
dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
    rmdir /sys/fs/cgroup/blkio/a
grep blkio /proc/cgroups
done
##TESTCASE_END:

Signed-off-by: Dmitry Monakhov 
---
 block/bfq-cgroup.c  |  2 +-
 block/bfq-iosched.h |  1 -
 block/bfq-wf2q.c| 15 +--
 3 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 68882b9..b791e20 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -332,7 +332,7 @@ static void bfqg_put(struct bfq_group *bfqg)
kfree(bfqg);
 }
 
-void bfqg_and_blkg_get(struct bfq_group *bfqg)
+static void bfqg_and_blkg_get(struct bfq_group *bfqg)
 {
/* see comments in bfq_bic_update_cgroup for why refcounting bfqg */
bfqg_get(bfqg);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index cd224aa..7038952 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -986,7 +986,6 @@ struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
 struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg);
 struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node);
-void bfqg_and_blkg_get(struct bfq_group *bfqg);
 void bfqg_and_blkg_put(struct bfq_group *bfqg);
 
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 34ad095..6a363bb 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -529,13 +529,14 @@ static void bfq_get_entity(struct bfq_entity *entity)
 {
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 
+   /* Grab reference only for bfq_queue's objects, bfq_group ones
+* are owned by blkcg_gq
+*/
if (bfqq) {
bfqq->ref++;
bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
 bfqq, bfqq->ref);
-   } else
-   bfqg_and_blkg_get(container_of(entity, struct bfq_group,
-  entity));
+   }
 }
 
 /**
@@ -649,14 +650,8 @@ static void bfq_forget_entity(struct bfq_service_tree *st,
 
entity->on_st_or_in_serv = false;
st->wsum -= entity->weight;
-   if (is_in_service)
-   return;
-
-   if (bfqq)
+   if (bfqq && !is_in_service)
bfq_put_queue(bfqq);
-   else
-   bfqg_and_blkg_put(container_of(entity, struct bfq_group,
-  entity));
 }
 
 /**
-- 
2.7.4



Re: [PATCH 0/9] block: T10/DIF Fixes and cleanups v3

2017-05-10 Thread Dmitry Monakhov

Christoph Hellwig  writes:

> Hi Dmitry,
>
> can you resend this series? 
Sorry for a very long delay, I'm in the middle of honeymoon and this is 
not a good time for a work :)
> I really think we should get this into 4.12 at least.
Please see updated version in the LKML list.


Re: [PATCH 0/9] block: T10/DIF Fixes and cleanups v3

2017-05-10 Thread Dmitry Monakhov

Christoph Hellwig  writes:

> Hi Dmitry,
>
> can you resend this series? 
Sorry for a very long delay, I'm in the middle of honeymoon and this is 
not a good time for a work :)
> I really think we should get this into 4.12 at least.
Please see updated version in the LKML list.


Re: [PATCH 3/9] bio-integrity: bio_integrity_advance must update integrity seed

2017-05-03 Thread Dmitry Monakhov
"Martin K. Petersen" <martin.peter...@oracle.com> writes:

> Dmitry Monakhov <dmonak...@openvz.org> writes:
>
>> SCSI drivers do care about bip_seed so we must update it accordingly.
>
>> +bip->bip_iter.bi_sector += bytes_done >> 9;
>
> This needs to count protection intervals. Otherwise things will break
> for block sizes different from 512 bytes.
No,
AFAIU: bip->bip_iter.bi_sector is always equals to bio->bi_iter.bi_sector
at least bip_set_seed() and bip_get_seed() relays on that.
Only bip->bip_vec must be advanced in intervals (this behavior not
changed by the patch this patch).
>
> -- 
> Martin K. PetersenOracle Linux Engineering


Re: [PATCH 3/9] bio-integrity: bio_integrity_advance must update integrity seed

2017-05-03 Thread Dmitry Monakhov
"Martin K. Petersen"  writes:

> Dmitry Monakhov  writes:
>
>> SCSI drivers do care about bip_seed so we must update it accordingly.
>
>> +bip->bip_iter.bi_sector += bytes_done >> 9;
>
> This needs to count protection intervals. Otherwise things will break
> for block sizes different from 512 bytes.
No,
AFAIU: bip->bip_iter.bi_sector is always equals to bio->bi_iter.bi_sector
at least bip_set_seed() and bip_get_seed() relays on that.
Only bip->bip_vec must be advanced in intervals (this behavior not
changed by the patch this patch).
>
> -- 
> Martin K. PetersenOracle Linux Engineering


Re: [PATCH 1/5] bh: Prevent panic on invalid BHs

2017-04-06 Thread Dmitry Monakhov
Christoph Hellwig  writes:

> This look ok, but how did you manage to trigger this case? 

# testcases
# TEST1
# Via bug in fallocate
truncate -l 1G img
losetup  /dev/loop img
mkfs.ext4 -qF /dev/loop0
mkdir m
mount /dev/loop0 m
# command above truncate bdevs pagecache
xfs_io -c "falloc -k 0 32G" -d /dev/loop0
for ((i=0;i<100;i++));do
xfs_io -c "pwrite 0 4k" -d m/test-$i;
done
sync



# TEST2: NBD close_sock -> kill_bdev
mkdir  -p a/mnt
cd a
truncate -s 1G img
mkfs.ext4 -qF img
qemu-nbd -c /dev/nbd0 img
mount /dev/nbd0 /mnt
cp -r /bin/ /mnt&
# Disconnect nbd while cp is active
qemu-nbd -d /dev/nbd0
sync

> I think
> we might have a deeper problem here.
Probably. It seems that !buffer_locked(bh) case should stay BUG_ON
because it is hard to make semi-correct decesion here.


Re: [PATCH 1/5] bh: Prevent panic on invalid BHs

2017-04-06 Thread Dmitry Monakhov
Christoph Hellwig  writes:

> This look ok, but how did you manage to trigger this case? 

# testcases
# TEST1
# Via bug in fallocate
truncate -l 1G img
losetup  /dev/loop img
mkfs.ext4 -qF /dev/loop0
mkdir m
mount /dev/loop0 m
# command above truncate bdevs pagecache
xfs_io -c "falloc -k 0 32G" -d /dev/loop0
for ((i=0;i<100;i++));do
xfs_io -c "pwrite 0 4k" -d m/test-$i;
done
sync



# TEST2: NBD close_sock -> kill_bdev
mkdir  -p a/mnt
cd a
truncate -s 1G img
mkfs.ext4 -qF img
qemu-nbd -c /dev/nbd0 img
mount /dev/nbd0 /mnt
cp -r /bin/ /mnt&
# Disconnect nbd while cp is active
qemu-nbd -d /dev/nbd0
sync

> I think
> we might have a deeper problem here.
Probably. It seems that !buffer_locked(bh) case should stay BUG_ON
because it is hard to make semi-correct decesion here.


Re: [PATCH 5/5] block: truncate page cache only when necessary on fallocate

2017-04-06 Thread Dmitry Monakhov
Christoph Hellwig <h...@infradead.org> writes:

> why?
because it is not good thing to truncate page cache and fiew lines later
realize that feature is not supported !blk_queue_discard(q) and return ENOTSUPP.

Event more:  if mode ==  FALLOC_FL_KEEP_SIZE then we do nothing and
return ENOTSUPP unconditionally.

IMHO (mode ==  FALLOC_FL_KEEP_SIZE) is sane API for thin-provision blkdevs
to preallocate space in advance. Nobody use it at the moment, but it may
be usefull in future.
>
> On Thu, Apr 06, 2017 at 04:02:49PM +0400, Dmitry Monakhov wrote:
>> Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
>> ---
>>  fs/block_dev.c | 9 -
>>  1 file changed, 4 insertions(+), 5 deletions(-)
>> 
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 2eca00e..f4b13e1 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -2075,7 +2075,7 @@ static long blkdev_fallocate(struct file *file, int 
>> mode, loff_t start,
>>  {
>>  struct block_device *bdev = I_BDEV(bdev_file_inode(file));
>>  struct request_queue *q = bdev_get_queue(bdev);
>> -struct address_space *mapping;
>> +struct address_space *mapping = bdev->bd_inode->i_mapping;
>>  loff_t end = start + len - 1;
>>  loff_t isize;
>>  int error;
>> @@ -2102,13 +2102,10 @@ static long blkdev_fallocate(struct file *file, int 
>> mode, loff_t start,
>>  if ((start | len) & (bdev_logical_block_size(bdev) - 1))
>>  return -EINVAL;
>>  
>> -/* Invalidate the page cache, including dirty pages. */
>> -mapping = bdev->bd_inode->i_mapping;
>> -truncate_inode_pages_range(mapping, start, end);
>> -
>>  switch (mode) {
>>  case FALLOC_FL_ZERO_RANGE:
>>  case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
>> +truncate_inode_pages_range(mapping, start, end);
>>  error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
>>  GFP_KERNEL, false);
>>  break;
>> @@ -2116,12 +2113,14 @@ static long blkdev_fallocate(struct file *file, int 
>> mode, loff_t start,
>>  /* Only punch if the device can do zeroing discard. */
>>  if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
>>  return -EOPNOTSUPP;
>> +truncate_inode_pages_range(mapping, start, end);
>>  error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
>>   GFP_KERNEL, 0);
>>  break;
>>  case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
>> FALLOC_FL_NO_HIDE_STALE:
>>  if (!blk_queue_discard(q))
>>  return -EOPNOTSUPP;
>> +truncate_inode_pages_range(mapping, start, end);
>>  error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
>>   GFP_KERNEL, 0);
>>  break;
>> -- 
>> 2.9.3
>> 
> ---end quoted text---


Re: [PATCH 5/5] block: truncate page cache only when necessary on fallocate

2017-04-06 Thread Dmitry Monakhov
Christoph Hellwig  writes:

> why?
because it is not good thing to truncate page cache and fiew lines later
realize that feature is not supported !blk_queue_discard(q) and return ENOTSUPP.

Event more:  if mode ==  FALLOC_FL_KEEP_SIZE then we do nothing and
return ENOTSUPP unconditionally.

IMHO (mode ==  FALLOC_FL_KEEP_SIZE) is sane API for thin-provision blkdevs
to preallocate space in advance. Nobody use it at the moment, but it may
be usefull in future.
>
> On Thu, Apr 06, 2017 at 04:02:49PM +0400, Dmitry Monakhov wrote:
>> Signed-off-by: Dmitry Monakhov 
>> ---
>>  fs/block_dev.c | 9 -
>>  1 file changed, 4 insertions(+), 5 deletions(-)
>> 
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 2eca00e..f4b13e1 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -2075,7 +2075,7 @@ static long blkdev_fallocate(struct file *file, int 
>> mode, loff_t start,
>>  {
>>  struct block_device *bdev = I_BDEV(bdev_file_inode(file));
>>  struct request_queue *q = bdev_get_queue(bdev);
>> -struct address_space *mapping;
>> +struct address_space *mapping = bdev->bd_inode->i_mapping;
>>  loff_t end = start + len - 1;
>>  loff_t isize;
>>  int error;
>> @@ -2102,13 +2102,10 @@ static long blkdev_fallocate(struct file *file, int 
>> mode, loff_t start,
>>  if ((start | len) & (bdev_logical_block_size(bdev) - 1))
>>  return -EINVAL;
>>  
>> -/* Invalidate the page cache, including dirty pages. */
>> -mapping = bdev->bd_inode->i_mapping;
>> -truncate_inode_pages_range(mapping, start, end);
>> -
>>  switch (mode) {
>>  case FALLOC_FL_ZERO_RANGE:
>>  case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
>> +truncate_inode_pages_range(mapping, start, end);
>>  error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
>>  GFP_KERNEL, false);
>>  break;
>> @@ -2116,12 +2113,14 @@ static long blkdev_fallocate(struct file *file, int 
>> mode, loff_t start,
>>  /* Only punch if the device can do zeroing discard. */
>>  if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
>>  return -EOPNOTSUPP;
>> +truncate_inode_pages_range(mapping, start, end);
>>  error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
>>   GFP_KERNEL, 0);
>>  break;
>>  case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
>> FALLOC_FL_NO_HIDE_STALE:
>>  if (!blk_queue_discard(q))
>>  return -EOPNOTSUPP;
>> +truncate_inode_pages_range(mapping, start, end);
>>  error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
>>   GFP_KERNEL, 0);
>>  break;
>> -- 
>> 2.9.3
>> 
> ---end quoted text---


[PATCH 2/5] block: protect bdevname from null pointer bdev

2017-04-06 Thread Dmitry Monakhov
Some callers (usually error paths) call bdevname with null bdev.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/partition-generic.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/partition-generic.c b/block/partition-generic.c
index 7afb990..284de18 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -46,6 +46,8 @@ char *disk_name(struct gendisk *hd, int partno, char *buf)
 
 const char *bdevname(struct block_device *bdev, char *buf)
 {
+   if (unlikely(!bdev))
+   return snprintf(buf, BDEVNAME_SIZE, "unknown-block(null)");
return disk_name(bdev->bd_disk, bdev->bd_part->partno, buf);
 }
 
-- 
2.9.3



[PATCH 3/5] bio: Protect submit_bio from bdevless bio-s

2017-04-06 Thread Dmitry Monakhov
Idially this type of check should be handled at generic_make_request_checks
, but it is too late for bdevless bios, bad pointer was already
dereferenced at that point.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/blk-core.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 275c1e4..e583ded 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2065,6 +2065,11 @@ EXPORT_SYMBOL(generic_make_request);
  */
 blk_qc_t submit_bio(struct bio *bio)
 {
+   if (WARN_ON_ONCE(!bio->bi_bdev)) {
+   bio_io_error(bio);
+   return BLK_QC_T_NONE;
+   }
+
/*
 * If it's a regular read/write or a barrier with data attached,
 * go through the normal accounting stuff before submission.
-- 
2.9.3



[PATCH 2/5] block: protect bdevname from null pointer bdev

2017-04-06 Thread Dmitry Monakhov
Some callers (usually error paths) call bdevname with null bdev.

Signed-off-by: Dmitry Monakhov 
---
 block/partition-generic.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/partition-generic.c b/block/partition-generic.c
index 7afb990..284de18 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -46,6 +46,8 @@ char *disk_name(struct gendisk *hd, int partno, char *buf)
 
 const char *bdevname(struct block_device *bdev, char *buf)
 {
+   if (unlikely(!bdev))
+   return snprintf(buf, BDEVNAME_SIZE, "unknown-block(null)");
return disk_name(bdev->bd_disk, bdev->bd_part->partno, buf);
 }
 
-- 
2.9.3



[PATCH 3/5] bio: Protect submit_bio from bdevless bio-s

2017-04-06 Thread Dmitry Monakhov
Idially this type of check should be handled at generic_make_request_checks
, but it is too late for bdevless bios, bad pointer was already
dereferenced at that point.

Signed-off-by: Dmitry Monakhov 
---
 block/blk-core.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 275c1e4..e583ded 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2065,6 +2065,11 @@ EXPORT_SYMBOL(generic_make_request);
  */
 blk_qc_t submit_bio(struct bio *bio)
 {
+   if (WARN_ON_ONCE(!bio->bi_bdev)) {
+   bio_io_error(bio);
+   return BLK_QC_T_NONE;
+   }
+
/*
 * If it's a regular read/write or a barrier with data attached,
 * go through the normal accounting stuff before submission.
-- 
2.9.3



[PATCH 1/5] bh: Prevent panic on invalid BHs

2017-04-06 Thread Dmitry Monakhov
- Convert BUG_ON to WARN_ON+EIO on submit_bh.
  Leave BUG_ON(!bh->b_end_io) as is because this is static bug in
  submission logic which can not be handled at runtime anyway.
  unmapped BH is also special case which signal about user
 misbehavior, so just dump error messageю

- guard __find_get_block from null pointer bdev.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/buffer.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9196f2a..4c8ce74 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1355,8 +1355,12 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, 
unsigned size)
 struct buffer_head *
 __find_get_block(struct block_device *bdev, sector_t block, unsigned size)
 {
-   struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
+   struct buffer_head *bh;
+
+   if (WARN_ON_ONCE(!bdev))
+   return NULL;
 
+   bh = lookup_bh_lru(bdev, block, size);
if (bh == NULL) {
/* __find_get_block_slow will mark the page accessed */
bh = __find_get_block_slow(bdev, block);
@@ -3099,11 +3103,18 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
 {
struct bio *bio;
 
-   BUG_ON(!buffer_locked(bh));
-   BUG_ON(!buffer_mapped(bh));
BUG_ON(!bh->b_end_io);
-   BUG_ON(buffer_delay(bh));
-   BUG_ON(buffer_unwritten(bh));
+
+   if (WARN_ON_ONCE(!buffer_locked(bh)))
+   goto bad_bh;
+   if (WARN_ON_ONCE(buffer_delay(bh)))
+   goto bad_bh;
+   if (WARN_ON_ONCE(buffer_unwritten(bh)))
+   goto bad_bh;
+   if (unlikely(!buffer_mapped(bh))) {
+   buffer_io_error(bh, ", bh not mapped");
+   goto bad_bh;
+   }
 
/*
 * Only clear out a write error when rewriting
@@ -3143,6 +3154,9 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
 
submit_bio(bio);
return 0;
+bad_bh:
+   bh->b_end_io(bh, 0);
+   return -EIO;
 }
 
 int _submit_bh(int op, int op_flags, struct buffer_head *bh,
-- 
2.9.3



[PATCH 1/5] bh: Prevent panic on invalid BHs

2017-04-06 Thread Dmitry Monakhov
- Convert BUG_ON to WARN_ON+EIO on submit_bh.
  Leave BUG_ON(!bh->b_end_io) as is because this is static bug in
  submission logic which can not be handled at runtime anyway.
  unmapped BH is also special case which signal about user
 misbehavior, so just dump error messageю

- guard __find_get_block from null pointer bdev.

Signed-off-by: Dmitry Monakhov 
---
 fs/buffer.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9196f2a..4c8ce74 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1355,8 +1355,12 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, 
unsigned size)
 struct buffer_head *
 __find_get_block(struct block_device *bdev, sector_t block, unsigned size)
 {
-   struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
+   struct buffer_head *bh;
+
+   if (WARN_ON_ONCE(!bdev))
+   return NULL;
 
+   bh = lookup_bh_lru(bdev, block, size);
if (bh == NULL) {
/* __find_get_block_slow will mark the page accessed */
bh = __find_get_block_slow(bdev, block);
@@ -3099,11 +3103,18 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
 {
struct bio *bio;
 
-   BUG_ON(!buffer_locked(bh));
-   BUG_ON(!buffer_mapped(bh));
BUG_ON(!bh->b_end_io);
-   BUG_ON(buffer_delay(bh));
-   BUG_ON(buffer_unwritten(bh));
+
+   if (WARN_ON_ONCE(!buffer_locked(bh)))
+   goto bad_bh;
+   if (WARN_ON_ONCE(buffer_delay(bh)))
+   goto bad_bh;
+   if (WARN_ON_ONCE(buffer_unwritten(bh)))
+   goto bad_bh;
+   if (unlikely(!buffer_mapped(bh))) {
+   buffer_io_error(bh, ", bh not mapped");
+   goto bad_bh;
+   }
 
/*
 * Only clear out a write error when rewriting
@@ -3143,6 +3154,9 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
 
submit_bio(bio);
return 0;
+bad_bh:
+   bh->b_end_io(bh, 0);
+   return -EIO;
 }
 
 int _submit_bh(int op, int op_flags, struct buffer_head *bh,
-- 
2.9.3



[PATCH 5/5] block: truncate page cache only when necessary on fallocate

2017-04-06 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/block_dev.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00e..f4b13e1 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2075,7 +2075,7 @@ static long blkdev_fallocate(struct file *file, int mode, 
loff_t start,
 {
struct block_device *bdev = I_BDEV(bdev_file_inode(file));
struct request_queue *q = bdev_get_queue(bdev);
-   struct address_space *mapping;
+   struct address_space *mapping = bdev->bd_inode->i_mapping;
loff_t end = start + len - 1;
loff_t isize;
int error;
@@ -2102,13 +2102,10 @@ static long blkdev_fallocate(struct file *file, int 
mode, loff_t start,
if ((start | len) & (bdev_logical_block_size(bdev) - 1))
return -EINVAL;
 
-   /* Invalidate the page cache, including dirty pages. */
-   mapping = bdev->bd_inode->i_mapping;
-   truncate_inode_pages_range(mapping, start, end);
-
switch (mode) {
case FALLOC_FL_ZERO_RANGE:
case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
+   truncate_inode_pages_range(mapping, start, end);
error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
GFP_KERNEL, false);
break;
@@ -2116,12 +2113,14 @@ static long blkdev_fallocate(struct file *file, int 
mode, loff_t start,
/* Only punch if the device can do zeroing discard. */
if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
return -EOPNOTSUPP;
+   truncate_inode_pages_range(mapping, start, end);
error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 GFP_KERNEL, 0);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
FALLOC_FL_NO_HIDE_STALE:
if (!blk_queue_discard(q))
return -EOPNOTSUPP;
+   truncate_inode_pages_range(mapping, start, end);
error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 GFP_KERNEL, 0);
break;
-- 
2.9.3



[PATCH 4/5] jbd2: use stable bdev pointer

2017-04-06 Thread Dmitry Monakhov
This prevent us from panic if someone invalidate bh under us.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/jbd2/revoke.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index f9aefcd..e3b791d 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -459,7 +459,8 @@ int jbd2_journal_cancel_revoke(handle_t *handle, struct 
journal_head *jh)
 * state machine will get very upset later on. */
if (need_cancel) {
struct buffer_head *bh2;
-   bh2 = __find_get_block(bh->b_bdev, bh->b_blocknr, bh->b_size);
+   bh2 = __find_get_block(journal->j_dev, bh->b_blocknr,
+  bh->b_size);
if (bh2) {
if (bh2 != bh)
clear_buffer_revoked(bh2);
-- 
2.9.3



[PATCH 5/5] block: truncate page cache only when necessary on fallocate

2017-04-06 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov 
---
 fs/block_dev.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00e..f4b13e1 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2075,7 +2075,7 @@ static long blkdev_fallocate(struct file *file, int mode, 
loff_t start,
 {
struct block_device *bdev = I_BDEV(bdev_file_inode(file));
struct request_queue *q = bdev_get_queue(bdev);
-   struct address_space *mapping;
+   struct address_space *mapping = bdev->bd_inode->i_mapping;
loff_t end = start + len - 1;
loff_t isize;
int error;
@@ -2102,13 +2102,10 @@ static long blkdev_fallocate(struct file *file, int 
mode, loff_t start,
if ((start | len) & (bdev_logical_block_size(bdev) - 1))
return -EINVAL;
 
-   /* Invalidate the page cache, including dirty pages. */
-   mapping = bdev->bd_inode->i_mapping;
-   truncate_inode_pages_range(mapping, start, end);
-
switch (mode) {
case FALLOC_FL_ZERO_RANGE:
case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
+   truncate_inode_pages_range(mapping, start, end);
error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
GFP_KERNEL, false);
break;
@@ -2116,12 +2113,14 @@ static long blkdev_fallocate(struct file *file, int 
mode, loff_t start,
/* Only punch if the device can do zeroing discard. */
if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
return -EOPNOTSUPP;
+   truncate_inode_pages_range(mapping, start, end);
error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 GFP_KERNEL, 0);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
FALLOC_FL_NO_HIDE_STALE:
if (!blk_queue_discard(q))
return -EOPNOTSUPP;
+   truncate_inode_pages_range(mapping, start, end);
error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 GFP_KERNEL, 0);
break;
-- 
2.9.3



[PATCH 4/5] jbd2: use stable bdev pointer

2017-04-06 Thread Dmitry Monakhov
This prevent us from panic if someone invalidate bh under us.

Signed-off-by: Dmitry Monakhov 
---
 fs/jbd2/revoke.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index f9aefcd..e3b791d 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -459,7 +459,8 @@ int jbd2_journal_cancel_revoke(handle_t *handle, struct 
journal_head *jh)
 * state machine will get very upset later on. */
if (need_cancel) {
struct buffer_head *bh2;
-   bh2 = __find_get_block(bh->b_bdev, bh->b_blocknr, bh->b_size);
+   bh2 = __find_get_block(journal->j_dev, bh->b_blocknr,
+  bh->b_size);
if (bh2) {
if (bh2 != bh)
clear_buffer_revoked(bh2);
-- 
2.9.3



[PATCH 0/5] falloc on blockdevice: what possibly can go whong?

2017-04-06 Thread Dmitry Monakhov
If you saw a command "fallocate -k -l 1G /dev/vda" you probably think
that user want to preallocate space on thin-provision blkdev. Right?
What possibly can go wrong? Unfortunately you may destroy your filesystem
and kernel panic. The reason is the bug in blkdev_fallocate() which
unconditionally truncate bdev cache. But even if we fix this particular bug
there are other places where we still have to truncate blkdev cache even
if FS is mounted and holds some bh's

1) nbd: If server disconnected we call kill_bdev() which destroy bdev cache
2) bdev falloc{ FALLOC_FL_ZERO_RANGE, FALLOC_FL_PUNCH_HOLE } definitely
   expect bdev cache to be truncated.
3) ioctl: BLKDISCARD also must truncate bdev cache

There is a discussion whenever we have to permit (2) and (3) on bdev with
active filesytem, why shouldn't we force bd_claim for this? But this is
advisory user-space interface, because by historical reasons we allow
direct_io to blkdev while fs is mounted.

I prefer to treat all three cases while FS is mounted as runtime errors.
Fs may be corrupted, but we should not panic.
This patchset guard fs/blk layer from panic in case of such runtime errors.
0001-bh-Prevent-panic-on-invalid-BHs
0002-block-protect-bdevname-from-null-pointer-bdev
0003-bio-Protect-submit_bio-from-bdevless-bio-s
0004-jbd2-use-stable-bdev-pointer
# Finally fix the bug with unconditional cache truncate on bdev
0005-block-truncate-page-cache-only-when-necessary-on-falloc

Testcases:
  xfstests: ./check blockdev/004 blockdev/005
  https://github.com/dmonakhov/xfstests/tree/blkdev-falloc-tests-v1

TODO: Prepare patch for util-linux fallocate(2) should claim bdev.


[PATCH 0/5] falloc on blockdevice: what possibly can go whong?

2017-04-06 Thread Dmitry Monakhov
If you saw a command "fallocate -k -l 1G /dev/vda" you probably think
that user want to preallocate space on thin-provision blkdev. Right?
What possibly can go wrong? Unfortunately you may destroy your filesystem
and kernel panic. The reason is the bug in blkdev_fallocate() which
unconditionally truncate bdev cache. But even if we fix this particular bug
there are other places where we still have to truncate blkdev cache even
if FS is mounted and holds some bh's

1) nbd: If server disconnected we call kill_bdev() which destroy bdev cache
2) bdev falloc{ FALLOC_FL_ZERO_RANGE, FALLOC_FL_PUNCH_HOLE } definitely
   expect bdev cache to be truncated.
3) ioctl: BLKDISCARD also must truncate bdev cache

There is a discussion whenever we have to permit (2) and (3) on bdev with
active filesytem, why shouldn't we force bd_claim for this? But this is
advisory user-space interface, because by historical reasons we allow
direct_io to blkdev while fs is mounted.

I prefer to treat all three cases while FS is mounted as runtime errors.
Fs may be corrupted, but we should not panic.
This patchset guard fs/blk layer from panic in case of such runtime errors.
0001-bh-Prevent-panic-on-invalid-BHs
0002-block-protect-bdevname-from-null-pointer-bdev
0003-bio-Protect-submit_bio-from-bdevless-bio-s
0004-jbd2-use-stable-bdev-pointer
# Finally fix the bug with unconditional cache truncate on bdev
0005-block-truncate-page-cache-only-when-necessary-on-falloc

Testcases:
  xfstests: ./check blockdev/004 blockdev/005
  https://github.com/dmonakhov/xfstests/tree/blkdev-falloc-tests-v1

TODO: Prepare patch for util-linux fallocate(2) should claim bdev.


[PATCH 1/9] bio-integrity: Do not allocate integrity context for bio w/o data

2017-04-04 Thread Dmitry Monakhov
If bio has no data, such as ones from blkdev_issue_flush(),
then we have nothing to protect.

This patch prevent bugon like follows:

kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
kernel BUG at mm/slab.c:2773!
invalid opcode:  [#1] SMP
Modules linked in: bcache
CPU: 0 PID: 4428 Comm: xfs_io Tainted: GW   
4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
task: 880137786440 task.stack: c9ba8000
RIP: 0010:kfree_debugcheck+0x25/0x2a
RSP: 0018:c9babde0 EFLAGS: 00010082
RAX: 0034 RBX: ac1fa1d106742a5a RCX: 0007
RDX:  RSI:  RDI: 88013f3ccb40
RBP: c9babde8 R08:  R09: 
R10: fcb76420 R11: 725172ed R12: 0282
R13: 8150e766 R14: 88013a145e00 R15: 0001
FS:  7fb09384bf40() GS:88013f20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fd0172f9e40 CR3: 000137fa9000 CR4: 06f0
Call Trace:
 kfree+0xc8/0x1b3
 bio_integrity_free+0xc3/0x16b
 bio_free+0x25/0x66
 bio_put+0x14/0x26
 blkdev_issue_flush+0x7a/0x85
 blkdev_fsync+0x35/0x42
 vfs_fsync_range+0x8e/0x9f
 vfs_fsync+0x1c/0x1e
 do_fsync+0x31/0x4a
 SyS_fsync+0x10/0x14
 entry_SYSCALL_64_fastpath+0x1f/0xc2

Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio-integrity.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713..b5009a8 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -175,6 +175,9 @@ bool bio_integrity_enabled(struct bio *bio)
if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
return false;
 
+   if (!bio_sectors(bio))
+   return false;
+
/* Already protected? */
if (bio_integrity(bio))
return false;
-- 
2.9.3



[PATCH 1/9] bio-integrity: Do not allocate integrity context for bio w/o data

2017-04-04 Thread Dmitry Monakhov
If bio has no data, such as ones from blkdev_issue_flush(),
then we have nothing to protect.

This patch prevent bugon like follows:

kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
kernel BUG at mm/slab.c:2773!
invalid opcode:  [#1] SMP
Modules linked in: bcache
CPU: 0 PID: 4428 Comm: xfs_io Tainted: GW   
4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
task: 880137786440 task.stack: c9ba8000
RIP: 0010:kfree_debugcheck+0x25/0x2a
RSP: 0018:c9babde0 EFLAGS: 00010082
RAX: 0034 RBX: ac1fa1d106742a5a RCX: 0007
RDX:  RSI:  RDI: 88013f3ccb40
RBP: c9babde8 R08:  R09: 
R10: fcb76420 R11: 725172ed R12: 0282
R13: 8150e766 R14: 88013a145e00 R15: 0001
FS:  7fb09384bf40() GS:88013f20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fd0172f9e40 CR3: 000137fa9000 CR4: 06f0
Call Trace:
 kfree+0xc8/0x1b3
 bio_integrity_free+0xc3/0x16b
 bio_free+0x25/0x66
 bio_put+0x14/0x26
 blkdev_issue_flush+0x7a/0x85
 blkdev_fsync+0x35/0x42
 vfs_fsync_range+0x8e/0x9f
 vfs_fsync+0x1c/0x1e
 do_fsync+0x31/0x4a
 SyS_fsync+0x10/0x14
 entry_SYSCALL_64_fastpath+0x1f/0xc2

Reviewed-by: Christoph Hellwig 
Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713..b5009a8 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -175,6 +175,9 @@ bool bio_integrity_enabled(struct bio *bio)
if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
return false;
 
+   if (!bio_sectors(bio))
+   return false;
+
/* Already protected? */
if (bio_integrity(bio))
return false;
-- 
2.9.3



[PATCH 3/9] bio-integrity: bio_integrity_advance must update integrity seed

2017-04-04 Thread Dmitry Monakhov
SCSI drivers do care about bip_seed so we must update it accordingly.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio-integrity.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b5009a8..82a6ffb 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -425,6 +425,7 @@ void bio_integrity_advance(struct bio *bio, unsigned int 
bytes_done)
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
 
+   bip->bip_iter.bi_sector += bytes_done >> 9;
bvec_iter_advance(bip->bip_vec, >bip_iter, bytes);
 }
 EXPORT_SYMBOL(bio_integrity_advance);
-- 
2.9.3



[PATCH 3/9] bio-integrity: bio_integrity_advance must update integrity seed

2017-04-04 Thread Dmitry Monakhov
SCSI drivers do care about bip_seed so we must update it accordingly.

Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b5009a8..82a6ffb 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -425,6 +425,7 @@ void bio_integrity_advance(struct bio *bio, unsigned int 
bytes_done)
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
 
+   bip->bip_iter.bi_sector += bytes_done >> 9;
bvec_iter_advance(bip->bip_vec, >bip_iter, bytes);
 }
 EXPORT_SYMBOL(bio_integrity_advance);
-- 
2.9.3



[PATCH 4/9] bio-integrity: fix interface for bio_integrity_trim v2

2017-04-04 Thread Dmitry Monakhov
bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be insync with bio's data offset. That is why we have
integrity update hook in bio_advance()

So only meaningful values are: offset == 0, sectors == bio_sectors(bio)
Let's just remove them completely.

changes from v1:
 - remove 'sectors' arguments

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio-integrity.c | 11 ++-
 block/bio.c   |  4 ++--
 drivers/md/dm.c   |  2 +-
 include/linux/bio.h   |  5 ++---
 4 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 82a6ffb..6170dad 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -433,22 +433,15 @@ EXPORT_SYMBOL(bio_integrity_advance);
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:   bio whose integrity vector to update
- * @offset:offset to first data sector
- * @sectors:   number of data sectors
  *
  * Description: Used to trim the integrity vector in a cloned bio.
- * The ivec will be advanced corresponding to 'offset' data sectors
- * and the length will be truncated corresponding to 'len' data
- * sectors.
  */
-void bio_integrity_trim(struct bio *bio, unsigned int offset,
-   unsigned int sectors)
+void bio_integrity_trim(struct bio *bio)
 {
struct bio_integrity_payload *bip = bio_integrity(bio);
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 
-   bio_integrity_advance(bio, offset << 9);
-   bip->bip_iter.bi_size = bio_integrity_bytes(bi, sectors);
+   bip->bip_iter.bi_size = bio_integrity_bytes(bi, bio_sectors(bio));
 }
 EXPORT_SYMBOL(bio_integrity_trim);
 
diff --git a/block/bio.c b/block/bio.c
index fa84323..65da4c9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1878,7 +1878,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
split->bi_iter.bi_size = sectors << 9;
 
if (bio_integrity(split))
-   bio_integrity_trim(split, 0, sectors);
+   bio_integrity_trim(split);
 
bio_advance(bio, split->bi_iter.bi_size);
 
@@ -1909,7 +1909,7 @@ void bio_trim(struct bio *bio, int offset, int size)
bio->bi_iter.bi_size = size;
 
if (bio_integrity(bio))
-   bio_integrity_trim(bio, 0, size);
+   bio_integrity_trim(bio);
 
 }
 EXPORT_SYMBOL_GPL(bio_trim);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb7597..25c5a8b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1102,7 +1102,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio 
*bio,
clone->bi_iter.bi_size = to_bytes(len);
 
if (bio_integrity(bio))
-   bio_integrity_trim(clone, 0, len);
+   bio_integrity_trim(clone);
 
return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..7f7bf37 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -731,7 +731,7 @@ extern bool bio_integrity_enabled(struct bio *bio);
 extern int bio_integrity_prep(struct bio *);
 extern void bio_integrity_endio(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
-extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
+extern void bio_integrity_trim(struct bio *);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
 extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
@@ -781,8 +781,7 @@ static inline void bio_integrity_advance(struct bio *bio,
return;
 }
 
-static inline void bio_integrity_trim(struct bio *bio, unsigned int offset,
- unsigned int sectors)
+static inline void bio_integrity_trim(struct bio *bio)
 {
return;
 }
-- 
2.9.3



[PATCH 4/9] bio-integrity: fix interface for bio_integrity_trim v2

2017-04-04 Thread Dmitry Monakhov
bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be insync with bio's data offset. That is why we have
integrity update hook in bio_advance()

So only meaningful values are: offset == 0, sectors == bio_sectors(bio)
Let's just remove them completely.

changes from v1:
 - remove 'sectors' arguments

Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 11 ++-
 block/bio.c   |  4 ++--
 drivers/md/dm.c   |  2 +-
 include/linux/bio.h   |  5 ++---
 4 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 82a6ffb..6170dad 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -433,22 +433,15 @@ EXPORT_SYMBOL(bio_integrity_advance);
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:   bio whose integrity vector to update
- * @offset:offset to first data sector
- * @sectors:   number of data sectors
  *
  * Description: Used to trim the integrity vector in a cloned bio.
- * The ivec will be advanced corresponding to 'offset' data sectors
- * and the length will be truncated corresponding to 'len' data
- * sectors.
  */
-void bio_integrity_trim(struct bio *bio, unsigned int offset,
-   unsigned int sectors)
+void bio_integrity_trim(struct bio *bio)
 {
struct bio_integrity_payload *bip = bio_integrity(bio);
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 
-   bio_integrity_advance(bio, offset << 9);
-   bip->bip_iter.bi_size = bio_integrity_bytes(bi, sectors);
+   bip->bip_iter.bi_size = bio_integrity_bytes(bi, bio_sectors(bio));
 }
 EXPORT_SYMBOL(bio_integrity_trim);
 
diff --git a/block/bio.c b/block/bio.c
index fa84323..65da4c9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1878,7 +1878,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
split->bi_iter.bi_size = sectors << 9;
 
if (bio_integrity(split))
-   bio_integrity_trim(split, 0, sectors);
+   bio_integrity_trim(split);
 
bio_advance(bio, split->bi_iter.bi_size);
 
@@ -1909,7 +1909,7 @@ void bio_trim(struct bio *bio, int offset, int size)
bio->bi_iter.bi_size = size;
 
if (bio_integrity(bio))
-   bio_integrity_trim(bio, 0, size);
+   bio_integrity_trim(bio);
 
 }
 EXPORT_SYMBOL_GPL(bio_trim);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb7597..25c5a8b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1102,7 +1102,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio 
*bio,
clone->bi_iter.bi_size = to_bytes(len);
 
if (bio_integrity(bio))
-   bio_integrity_trim(clone, 0, len);
+   bio_integrity_trim(clone);
 
return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..7f7bf37 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -731,7 +731,7 @@ extern bool bio_integrity_enabled(struct bio *bio);
 extern int bio_integrity_prep(struct bio *);
 extern void bio_integrity_endio(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
-extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
+extern void bio_integrity_trim(struct bio *);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
 extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
@@ -781,8 +781,7 @@ static inline void bio_integrity_advance(struct bio *bio,
return;
 }
 
-static inline void bio_integrity_trim(struct bio *bio, unsigned int offset,
- unsigned int sectors)
+static inline void bio_integrity_trim(struct bio *bio)
 {
return;
 }
-- 
2.9.3



[PATCH 8/9] bio: add bvec_iter rewind API

2017-04-04 Thread Dmitry Monakhov
Some ->bi_end_io handlers (for example: pi_verify or decrypt handlers)
need to know original data vector, but after bio traverse io-stack it may
be advanced, splited and relocated many times so it is hard to guess
original iterator. Let's add 'bi_done' conter which accounts number
of bytes iterator was advanced during it's evolution. Later end_io handler
may easily restore original iterator by rewinding iterator to
iter->bi_done.

Note: this change makes sizeof (struct bvec_iter) multiple to 8
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 include/linux/bio.h  | 21 +++--
 include/linux/bvec.h | 26 ++
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 12b7fcd..491fb0c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -166,9 +166,10 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 {
iter->bi_sector += bytes >> 9;
 
-   if (bio_no_advance_iter(bio))
+   if (bio_no_advance_iter(bio)) {
iter->bi_size -= bytes;
-   else {
+   iter->bi_done += bytes;
+   } else {
int err;
err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
if (unlikely(err))
@@ -176,6 +177,22 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
}
 }
 
+static inline void bio_rewind_iter(struct bio *bio, struct bvec_iter *iter,
+   unsigned bytes)
+{
+   iter->bi_sector -= bytes >> 9;
+
+   if (bio_no_advance_iter(bio)) {
+   iter->bi_size += bytes;
+   iter->bi_done -= bytes;
+   } else {
+   int err;
+   err = bvec_iter_rewind(bio->bi_io_vec, iter, bytes);
+   if (unlikely(err))
+   bio->bi_error = err;
+   }
+}
+
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
for (iter = (start);\
 (iter).bi_size &&  \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 984a7a8..35cb97b 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -40,6 +40,8 @@ struct bvec_iter {
 
unsigned intbi_idx; /* current index into bvl_vec */
 
+   unsigned intbi_done;/* number of bytes completed */
+
unsigned intbi_bvec_done;   /* number of bytes completed in
   current bvec */
 };
@@ -84,6 +86,7 @@ static inline int bvec_iter_advance(const struct bio_vec *bv,
bytes -= len;
iter->bi_size -= len;
iter->bi_bvec_done += len;
+   iter->bi_done += len;
 
if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
iter->bi_bvec_done = 0;
@@ -93,6 +96,29 @@ static inline int bvec_iter_advance(const struct bio_vec *bv,
return 0;
 }
 
+static inline int bvec_iter_rewind(const struct bio_vec *bv,
+struct bvec_iter *iter,
+unsigned bytes)
+{
+   while (bytes) {
+   unsigned len = min(bytes, iter->bi_bvec_done);
+   if (iter->bi_bvec_done == 0 ) {
+   if (WARN_ONCE(iter->bi_idx == 0,
+ "Attempted to rewind iter beyond "
+ "bvec's boundaries\n")) {
+   return -EINVAL;
+   }
+   iter->bi_idx--;
+   iter->bi_bvec_done = __bvec_iter_bvec(bv, 
*iter)->bv_len;
+   continue;
+   }
+   bytes -= len;
+   iter->bi_size += len;
+   iter->bi_bvec_done -= len;
+   }
+   return 0;
+}
+
 #define for_each_bvec(bvl, bio_vec, iter, start)   \
for (iter = (start);\
 (iter).bi_size &&  \
-- 
2.9.3



[PATCH 8/9] bio: add bvec_iter rewind API

2017-04-04 Thread Dmitry Monakhov
Some ->bi_end_io handlers (for example: pi_verify or decrypt handlers)
need to know original data vector, but after bio traverse io-stack it may
be advanced, splited and relocated many times so it is hard to guess
original iterator. Let's add 'bi_done' conter which accounts number
of bytes iterator was advanced during it's evolution. Later end_io handler
may easily restore original iterator by rewinding iterator to
iter->bi_done.

Note: this change makes sizeof (struct bvec_iter) multiple to 8
Signed-off-by: Dmitry Monakhov 
---
 include/linux/bio.h  | 21 +++--
 include/linux/bvec.h | 26 ++
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 12b7fcd..491fb0c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -166,9 +166,10 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 {
iter->bi_sector += bytes >> 9;
 
-   if (bio_no_advance_iter(bio))
+   if (bio_no_advance_iter(bio)) {
iter->bi_size -= bytes;
-   else {
+   iter->bi_done += bytes;
+   } else {
int err;
err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
if (unlikely(err))
@@ -176,6 +177,22 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
}
 }
 
+static inline void bio_rewind_iter(struct bio *bio, struct bvec_iter *iter,
+   unsigned bytes)
+{
+   iter->bi_sector -= bytes >> 9;
+
+   if (bio_no_advance_iter(bio)) {
+   iter->bi_size += bytes;
+   iter->bi_done -= bytes;
+   } else {
+   int err;
+   err = bvec_iter_rewind(bio->bi_io_vec, iter, bytes);
+   if (unlikely(err))
+   bio->bi_error = err;
+   }
+}
+
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
for (iter = (start);\
 (iter).bi_size &&  \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 984a7a8..35cb97b 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -40,6 +40,8 @@ struct bvec_iter {
 
unsigned intbi_idx; /* current index into bvl_vec */
 
+   unsigned intbi_done;/* number of bytes completed */
+
unsigned intbi_bvec_done;   /* number of bytes completed in
   current bvec */
 };
@@ -84,6 +86,7 @@ static inline int bvec_iter_advance(const struct bio_vec *bv,
bytes -= len;
iter->bi_size -= len;
iter->bi_bvec_done += len;
+   iter->bi_done += len;
 
if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
iter->bi_bvec_done = 0;
@@ -93,6 +96,29 @@ static inline int bvec_iter_advance(const struct bio_vec *bv,
return 0;
 }
 
+static inline int bvec_iter_rewind(const struct bio_vec *bv,
+struct bvec_iter *iter,
+unsigned bytes)
+{
+   while (bytes) {
+   unsigned len = min(bytes, iter->bi_bvec_done);
+   if (iter->bi_bvec_done == 0 ) {
+   if (WARN_ONCE(iter->bi_idx == 0,
+ "Attempted to rewind iter beyond "
+ "bvec's boundaries\n")) {
+   return -EINVAL;
+   }
+   iter->bi_idx--;
+   iter->bi_bvec_done = __bvec_iter_bvec(bv, 
*iter)->bv_len;
+   continue;
+   }
+   bytes -= len;
+   iter->bi_size += len;
+   iter->bi_bvec_done -= len;
+   }
+   return 0;
+}
+
 #define for_each_bvec(bvl, bio_vec, iter, start)   \
for (iter = (start);\
 (iter).bi_size &&  \
-- 
2.9.3



[PATCH 7/9] Guard bvec iteration logic v3

2017-04-04 Thread Dmitry Monakhov
Currently if some one try to advance bvec beyond it's size we simply
dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
This simply means that we endup dereferencing/corrupting random memory
region.

Sane reaction would be to propagate error back to calling context
But bvec_iter_advance's calling context is not always good for error
handling. For safity reason let truncate iterator size to zero which
will break external iteration loop which prevent us from unpredictable
memory range corruption. And even it caller ignores an error, it will
corrupt it's own bvecs, not others.

This patch does:
- Return error back to caller with hope that it will react on this
- Truncate iterator size

Code was added long time ago here 4550dd6c, luckily no one hit it
in real life :)

changes since V1:
 - Replace  BUG_ON with error logic.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/nvdimm/blk.c |  4 +++-
 drivers/nvdimm/btt.c |  4 +++-
 include/linux/bio.h  |  8 ++--
 include/linux/bvec.h | 11 ---
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 0b49336..c82331b 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk 
*nsblk,
 
len -= cur_len;
dev_offset += cur_len;
-   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   err = bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   if (err)
+   return err;
}
 
return err;
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 3d7a9fe..5a68681 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct 
bio_integrity_payload *bip,
 
len -= cur_len;
meta_nsoff += cur_len;
-   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   ret = bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   if (ret)
+   return ret;
}
 
return ret;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 65445c0..12b7fcd 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
-   else
-   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   else {
+   int err;
+   err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   if (unlikely(err))
+   bio->bi_error = err;
+   }
 }
 
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 89b65b8..984a7a8 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -22,6 +22,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -66,12 +67,15 @@ struct bvec_iter {
.bv_offset  = bvec_iter_offset((bvec), (iter)), \
 })
 
-static inline void bvec_iter_advance(const struct bio_vec *bv,
+static inline int bvec_iter_advance(const struct bio_vec *bv,
 struct bvec_iter *iter,
 unsigned bytes)
 {
-   WARN_ONCE(bytes > iter->bi_size,
- "Attempted to advance past end of bvec iter\n");
+   if (WARN_ONCE(bytes > iter->bi_size,
+"Attempted to advance past end of bvec iter\n")) {
+   iter->bi_size = 0;
+   return -EINVAL;
+   }
 
while (bytes) {
unsigned iter_len = bvec_iter_len(bv, *iter);
@@ -86,6 +90,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
iter->bi_idx++;
}
}
+   return 0;
 }
 
 #define for_each_bvec(bvl, bio_vec, iter, start)   \
-- 
2.9.3



[PATCH 7/9] Guard bvec iteration logic v3

2017-04-04 Thread Dmitry Monakhov
Currently if some one try to advance bvec beyond it's size we simply
dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
This simply means that we endup dereferencing/corrupting random memory
region.

Sane reaction would be to propagate error back to calling context
But bvec_iter_advance's calling context is not always good for error
handling. For safity reason let truncate iterator size to zero which
will break external iteration loop which prevent us from unpredictable
memory range corruption. And even it caller ignores an error, it will
corrupt it's own bvecs, not others.

This patch does:
- Return error back to caller with hope that it will react on this
- Truncate iterator size

Code was added long time ago here 4550dd6c, luckily no one hit it
in real life :)

changes since V1:
 - Replace  BUG_ON with error logic.

Signed-off-by: Dmitry Monakhov 
---
 drivers/nvdimm/blk.c |  4 +++-
 drivers/nvdimm/btt.c |  4 +++-
 include/linux/bio.h  |  8 ++--
 include/linux/bvec.h | 11 ---
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 0b49336..c82331b 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk 
*nsblk,
 
len -= cur_len;
dev_offset += cur_len;
-   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   err = bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   if (err)
+   return err;
}
 
return err;
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 3d7a9fe..5a68681 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct 
bio_integrity_payload *bip,
 
len -= cur_len;
meta_nsoff += cur_len;
-   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   ret = bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   if (ret)
+   return ret;
}
 
return ret;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 65445c0..12b7fcd 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
-   else
-   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   else {
+   int err;
+   err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   if (unlikely(err))
+   bio->bi_error = err;
+   }
 }
 
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 89b65b8..984a7a8 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -22,6 +22,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -66,12 +67,15 @@ struct bvec_iter {
.bv_offset  = bvec_iter_offset((bvec), (iter)), \
 })
 
-static inline void bvec_iter_advance(const struct bio_vec *bv,
+static inline int bvec_iter_advance(const struct bio_vec *bv,
 struct bvec_iter *iter,
 unsigned bytes)
 {
-   WARN_ONCE(bytes > iter->bi_size,
- "Attempted to advance past end of bvec iter\n");
+   if (WARN_ONCE(bytes > iter->bi_size,
+"Attempted to advance past end of bvec iter\n")) {
+   iter->bi_size = 0;
+   return -EINVAL;
+   }
 
while (bytes) {
unsigned iter_len = bvec_iter_len(bv, *iter);
@@ -86,6 +90,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
iter->bi_idx++;
}
}
+   return 0;
 }
 
 #define for_each_bvec(bvl, bio_vec, iter, start)   \
-- 
2.9.3



[PATCH 6/9] T10: Move opencoded contants to common header

2017-04-04 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/t10-pi.c   | 9 +++--
 drivers/scsi/lpfc/lpfc_scsi.c| 5 +++--
 drivers/scsi/qla2xxx/qla_isr.c   | 8 
 drivers/target/target_core_sbc.c | 2 +-
 include/linux/t10-pi.h   | 2 ++
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index 2c97912..485cecd 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -28,9 +28,6 @@
 
 typedef __be16 (csum_fn) (void *, unsigned int);
 
-static const __be16 APP_ESCAPE = (__force __be16) 0x;
-static const __be32 REF_ESCAPE = (__force __be32) 0x;
-
 static __be16 t10_pi_crc_fn(void *data, unsigned int len)
 {
return cpu_to_be16(crc_t10dif(data, len));
@@ -82,7 +79,7 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
switch (type) {
case 1:
case 2:
-   if (pi->app_tag == APP_ESCAPE)
+   if (pi->app_tag == T10_APP_ESCAPE)
goto next;
 
if (be32_to_cpu(pi->ref_tag) !=
@@ -95,8 +92,8 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
}
break;
case 3:
-   if (pi->app_tag == APP_ESCAPE &&
-   pi->ref_tag == REF_ESCAPE)
+   if (pi->app_tag == T10_APP_ESCAPE &&
+   pi->ref_tag == T10_REF_ESCAPE)
goto next;
break;
}
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 54fd0c8..6f6b40e 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -2934,8 +2935,8 @@ lpfc_calc_bg_err(struct lpfc_hba *phba, struct 
lpfc_scsi_buf *lpfc_cmd)
 * First check to see if a protection data
 * check is valid
 */
-   if ((src->ref_tag == 0x) ||
-   (src->app_tag == 0x)) {
+   if ((src->ref_tag == T10_REF_ESCAPE) ||
+   (src->app_tag == T10_APP_ESCAPE)) {
start_ref_tag++;
goto skipit;
}
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 3203367..ed4b302 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -1950,9 +1950,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
 * For type 3: ref & app tag is all 'f's
 * For type 0,1,2: app tag is all 'f's
 */
-   if ((a_app_tag == 0x) &&
+   if ((a_app_tag == T10_APP_ESCAPE) &&
((scsi_get_prot_type(cmd) != SCSI_PROT_DIF_TYPE3) ||
-(a_ref_tag == 0x))) {
+(a_ref_tag == T10_REF_ESCAPE))) {
uint32_t blocks_done, resid;
sector_t lba_s = scsi_get_lba(cmd);
 
@@ -1994,9 +1994,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
spt = page_address(sg_page(sg)) + sg->offset;
spt += j;
 
-   spt->app_tag = 0x;
+   spt->app_tag = T10_APP_ESCAPE;
if (scsi_get_prot_type(cmd) == SCSI_PROT_DIF_TYPE3)
-   spt->ref_tag = 0x;
+   spt->ref_tag = T10_REF_ESCAPE;
}
 
return 0;
diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index c194063..927ef44 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -1446,7 +1446,7 @@ sbc_dif_verify(struct se_cmd *cmd, sector_t start, 
unsigned int sectors,
 (unsigned long long)sector, sdt->guard_tag,
 sdt->app_tag, be32_to_cpu(sdt->ref_tag));
 
-   if (sdt->app_tag == cpu_to_be16(0x)) {
+   if (sdt->app_tag == T10_APP_ESCAPE) {
dsg_off += block_size;
goto next;
}
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index 9fba9dd..f892442 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -33,6 +33,8 @@ struct t10_pi_tuple {
__be32 ref_tag; /* Target LBA or indirect LBA */
 };
 
+#define T10_APP_ESCAPE cpu_to_be16(0x)
+#define T10_REF_ESCAPE cpu_to_be32(0x)
 
 e

[PATCH 6/9] T10: Move opencoded contants to common header

2017-04-04 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov 
---
 block/t10-pi.c   | 9 +++--
 drivers/scsi/lpfc/lpfc_scsi.c| 5 +++--
 drivers/scsi/qla2xxx/qla_isr.c   | 8 
 drivers/target/target_core_sbc.c | 2 +-
 include/linux/t10-pi.h   | 2 ++
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index 2c97912..485cecd 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -28,9 +28,6 @@
 
 typedef __be16 (csum_fn) (void *, unsigned int);
 
-static const __be16 APP_ESCAPE = (__force __be16) 0x;
-static const __be32 REF_ESCAPE = (__force __be32) 0x;
-
 static __be16 t10_pi_crc_fn(void *data, unsigned int len)
 {
return cpu_to_be16(crc_t10dif(data, len));
@@ -82,7 +79,7 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
switch (type) {
case 1:
case 2:
-   if (pi->app_tag == APP_ESCAPE)
+   if (pi->app_tag == T10_APP_ESCAPE)
goto next;
 
if (be32_to_cpu(pi->ref_tag) !=
@@ -95,8 +92,8 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
}
break;
case 3:
-   if (pi->app_tag == APP_ESCAPE &&
-   pi->ref_tag == REF_ESCAPE)
+   if (pi->app_tag == T10_APP_ESCAPE &&
+   pi->ref_tag == T10_REF_ESCAPE)
goto next;
break;
}
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 54fd0c8..6f6b40e 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -2934,8 +2935,8 @@ lpfc_calc_bg_err(struct lpfc_hba *phba, struct 
lpfc_scsi_buf *lpfc_cmd)
 * First check to see if a protection data
 * check is valid
 */
-   if ((src->ref_tag == 0x) ||
-   (src->app_tag == 0x)) {
+   if ((src->ref_tag == T10_REF_ESCAPE) ||
+   (src->app_tag == T10_APP_ESCAPE)) {
start_ref_tag++;
goto skipit;
}
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 3203367..ed4b302 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -1950,9 +1950,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
 * For type 3: ref & app tag is all 'f's
 * For type 0,1,2: app tag is all 'f's
 */
-   if ((a_app_tag == 0x) &&
+   if ((a_app_tag == T10_APP_ESCAPE) &&
((scsi_get_prot_type(cmd) != SCSI_PROT_DIF_TYPE3) ||
-(a_ref_tag == 0x))) {
+(a_ref_tag == T10_REF_ESCAPE))) {
uint32_t blocks_done, resid;
sector_t lba_s = scsi_get_lba(cmd);
 
@@ -1994,9 +1994,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
spt = page_address(sg_page(sg)) + sg->offset;
spt += j;
 
-   spt->app_tag = 0x;
+   spt->app_tag = T10_APP_ESCAPE;
if (scsi_get_prot_type(cmd) == SCSI_PROT_DIF_TYPE3)
-   spt->ref_tag = 0x;
+   spt->ref_tag = T10_REF_ESCAPE;
}
 
return 0;
diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index c194063..927ef44 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -1446,7 +1446,7 @@ sbc_dif_verify(struct se_cmd *cmd, sector_t start, 
unsigned int sectors,
 (unsigned long long)sector, sdt->guard_tag,
 sdt->app_tag, be32_to_cpu(sdt->ref_tag));
 
-   if (sdt->app_tag == cpu_to_be16(0x)) {
+   if (sdt->app_tag == T10_APP_ESCAPE) {
dsg_off += block_size;
goto next;
}
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index 9fba9dd..f892442 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -33,6 +33,8 @@ struct t10_pi_tuple {
__be32 ref_tag; /* Target LBA or indirect LBA */
 };
 
+#define T10_APP_ESCAPE cpu_to_be16(0x)
+#define T10_REF_ESCAPE cpu_to_be32(0x)
 
 extern struct blk_integrity_profile t10_pi_type1_crc;
 extern struct blk_integrity_profile t10_pi_type1_ip;
-- 
2.9.3



[PATCH 9/9] bio-integrity: Restore original iterator on verify stage

2017-04-04 Thread Dmitry Monakhov
Currently ->verify_fn not woks at all because at the moment it is called
bio->bi_iter.bi_size == 0, so we do not iterate integrity bvecs at all.

In order to perform verification we need to know original data vector,
with new bvec rewind API this is trivial.

testcase: 
https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio-integrity.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index f2b9f09..f08096d 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -184,9 +184,10 @@ static inline unsigned int bio_integrity_bytes(struct 
blk_integrity *bi,
 /**
  * bio_integrity_process - Process integrity metadata for a bio
  * @bio:   bio to generate/verify integrity metadata for
+ * @proc_iter:  iterator to process
  * @proc_fn:   Pointer to the relevant processing function
  */
-static int bio_integrity_process(struct bio *bio,
+static int bio_integrity_process(struct bio *bio, struct bvec_iter *proc_iter,
 integrity_processing_fn *proc_fn)
 {
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
@@ -200,10 +201,10 @@ static int bio_integrity_process(struct bio *bio,
 
iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
iter.interval = 1 << bi->interval_exp;
-   iter.seed = bip_get_seed(bip);
+   iter.seed = proc_iter->bi_sector;
iter.prot_buf = prot_buf;
 
-   bio_for_each_segment(bv, bio, bviter) {
+   __bio_for_each_segment(bv, bio, bviter, *proc_iter) {
void *kaddr = kmap_atomic(bv.bv_page);
 
iter.data_buf = kaddr + bv.bv_offset;
@@ -310,7 +311,8 @@ static int bio_integrity_setup(struct bio *bio)
 
/* Auto-generate integrity metadata if this is a write */
if (bio_data_dir(bio) == WRITE)
-   bio_integrity_process(bio, bi->profile->generate_fn);
+   bio_integrity_process(bio, >bi_iter,
+ bi->profile->generate_fn);
 
return 0;
 }
@@ -376,9 +378,15 @@ static void bio_integrity_verify_fn(struct work_struct 
*work)
container_of(work, struct bio_integrity_payload, bip_work);
struct bio *bio = bip->bip_bio;
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-
-   bio->bi_error = bio_integrity_process(bio, bi->profile->verify_fn);
-
+   struct bvec_iter iter = bio->bi_iter;
+
+   /* At the moment verify is called bio's iterator was advanced
+* during split and completion, we need to rewind iterator to
+* it's original position */
+   bio_rewind_iter(bio, , iter.bi_done);
+   if (!bio->bi_error)
+   bio->bi_error = bio_integrity_process(bio, ,
+ bi->profile->verify_fn);
/* Restore original bio completion handler */
bio->bi_end_io = bip->bip_end_io;
bio_endio(bio);
-- 
2.9.3



[PATCH 5/9] bio-integrity: fold bio_integrity_enabled to bio_integrity_prep

2017-04-04 Thread Dmitry Monakhov
Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly.

In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
so it is reasonable to fold it in to one function.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 Documentation/block/data-integrity.txt |  3 --
 block/bio-integrity.c  | 88 ++
 block/blk-core.c   |  5 +-
 block/blk-mq.c |  8 +---
 drivers/nvdimm/blk.c   | 13 +
 drivers/nvdimm/btt.c   | 13 +
 include/linux/bio.h|  6 ---
 7 files changed, 55 insertions(+), 81 deletions(-)

diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index f56ec97..37ac837 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -202,9 +202,6 @@ will require extra work due to the application tag.
   added.  It is up to the caller to ensure that the bio does not
   change while I/O is in progress.
 
-  bio_integrity_prep() should only be called if
-  bio_integrity_enabled() returned 1.
-
 
 5.3 PASSING EXISTING INTEGRITY METADATA
 
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 6170dad..f2b9f09 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -160,44 +160,6 @@ int bio_integrity_add_page(struct bio *bio, struct page 
*page,
 EXPORT_SYMBOL(bio_integrity_add_page);
 
 /**
- * bio_integrity_enabled - Check whether integrity can be passed
- * @bio:   bio to check
- *
- * Description: Determines whether bio_integrity_prep() can be called
- * on this bio or not. bio data direction and target device must be
- * set prior to calling.  The functions honors the write_generate and
- * read_verify flags in sysfs.
- */
-bool bio_integrity_enabled(struct bio *bio)
-{
-   struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-
-   if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
-   return false;
-
-   if (!bio_sectors(bio))
-   return false;
-
-   /* Already protected? */
-   if (bio_integrity(bio))
-   return false;
-
-   if (bi == NULL)
-   return false;
-
-   if (bio_data_dir(bio) == READ && bi->profile->verify_fn != NULL &&
-   (bi->flags & BLK_INTEGRITY_VERIFY))
-   return true;
-
-   if (bio_data_dir(bio) == WRITE && bi->profile->generate_fn != NULL &&
-   (bi->flags & BLK_INTEGRITY_GENERATE))
-   return true;
-
-   return false;
-}
-EXPORT_SYMBOL(bio_integrity_enabled);
-
-/**
  * bio_integrity_intervals - Return number of integrity intervals for a bio
  * @bi:blk_integrity profile for device
  * @sectors:   Size of the bio in 512-byte sectors
@@ -259,7 +221,7 @@ static int bio_integrity_process(struct bio *bio,
 }
 
 /**
- * bio_integrity_prep - Prepare bio for integrity I/O
+ * bio_integrity_setup - Prepare bio for integrity I/O
  * @bio:   bio to prepare
  *
  * Description: Allocates a buffer for integrity metadata, maps the
@@ -269,7 +231,7 @@ static int bio_integrity_process(struct bio *bio,
  * block device's integrity function.  In the READ case, the buffer
  * will be prepared for DMA and a suitable end_io handler set up.
  */
-int bio_integrity_prep(struct bio *bio)
+static int bio_integrity_setup(struct bio *bio)
 {
struct bio_integrity_payload *bip;
struct blk_integrity *bi;
@@ -352,6 +314,52 @@ int bio_integrity_prep(struct bio *bio)
 
return 0;
 }
+/**
+ * bio_integrity_prep - Prepare bio for integrity I/O
+ * @bio:   bio to prepare
+ *
+ * Description:  Checks if the bio already has an integrity payload attached.
+ * If it does, the payload has been generated by another kernel subsystem,
+ * and we just pass it through. Otherwise allocates integrity payload.
+ */
+
+int bio_integrity_prep(struct bio *bio)
+{
+   int err;
+   struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+
+   if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
+   return 0;
+
+   if (!bio_sectors(bio))
+   return 0;
+
+   /* Already protected? */
+   if (bio_integrity(bio))
+   return 0;
+
+   if (bi == NULL)
+   return 0;
+
+   if (bio_data_dir(bio) == READ && bi->profile->verify_fn != NULL &&
+   (bi->flags & BLK_INTEGRITY_VERIFY))
+   goto need_prep;
+
+   if (bio_data_dir(bio) == WRITE && bi->profile->generate_fn != NULL &&
+   (bi->flags & BLK_INTEGRITY_GENERATE))
+   goto need_prep;
+
+   ret

[PATCH 9/9] bio-integrity: Restore original iterator on verify stage

2017-04-04 Thread Dmitry Monakhov
Currently ->verify_fn not woks at all because at the moment it is called
bio->bi_iter.bi_size == 0, so we do not iterate integrity bvecs at all.

In order to perform verification we need to know original data vector,
with new bvec rewind API this is trivial.

testcase: 
https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85
Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index f2b9f09..f08096d 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -184,9 +184,10 @@ static inline unsigned int bio_integrity_bytes(struct 
blk_integrity *bi,
 /**
  * bio_integrity_process - Process integrity metadata for a bio
  * @bio:   bio to generate/verify integrity metadata for
+ * @proc_iter:  iterator to process
  * @proc_fn:   Pointer to the relevant processing function
  */
-static int bio_integrity_process(struct bio *bio,
+static int bio_integrity_process(struct bio *bio, struct bvec_iter *proc_iter,
 integrity_processing_fn *proc_fn)
 {
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
@@ -200,10 +201,10 @@ static int bio_integrity_process(struct bio *bio,
 
iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
iter.interval = 1 << bi->interval_exp;
-   iter.seed = bip_get_seed(bip);
+   iter.seed = proc_iter->bi_sector;
iter.prot_buf = prot_buf;
 
-   bio_for_each_segment(bv, bio, bviter) {
+   __bio_for_each_segment(bv, bio, bviter, *proc_iter) {
void *kaddr = kmap_atomic(bv.bv_page);
 
iter.data_buf = kaddr + bv.bv_offset;
@@ -310,7 +311,8 @@ static int bio_integrity_setup(struct bio *bio)
 
/* Auto-generate integrity metadata if this is a write */
if (bio_data_dir(bio) == WRITE)
-   bio_integrity_process(bio, bi->profile->generate_fn);
+   bio_integrity_process(bio, >bi_iter,
+ bi->profile->generate_fn);
 
return 0;
 }
@@ -376,9 +378,15 @@ static void bio_integrity_verify_fn(struct work_struct 
*work)
container_of(work, struct bio_integrity_payload, bip_work);
struct bio *bio = bip->bip_bio;
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-
-   bio->bi_error = bio_integrity_process(bio, bi->profile->verify_fn);
-
+   struct bvec_iter iter = bio->bi_iter;
+
+   /* At the moment verify is called bio's iterator was advanced
+* during split and completion, we need to rewind iterator to
+* it's original position */
+   bio_rewind_iter(bio, , iter.bi_done);
+   if (!bio->bi_error)
+   bio->bi_error = bio_integrity_process(bio, ,
+ bi->profile->verify_fn);
/* Restore original bio completion handler */
bio->bi_end_io = bip->bip_end_io;
bio_endio(bio);
-- 
2.9.3



[PATCH 5/9] bio-integrity: fold bio_integrity_enabled to bio_integrity_prep

2017-04-04 Thread Dmitry Monakhov
Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly.

In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
so it is reasonable to fold it in to one function.

Signed-off-by: Dmitry Monakhov 
---
 Documentation/block/data-integrity.txt |  3 --
 block/bio-integrity.c  | 88 ++
 block/blk-core.c   |  5 +-
 block/blk-mq.c |  8 +---
 drivers/nvdimm/blk.c   | 13 +
 drivers/nvdimm/btt.c   | 13 +
 include/linux/bio.h|  6 ---
 7 files changed, 55 insertions(+), 81 deletions(-)

diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index f56ec97..37ac837 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -202,9 +202,6 @@ will require extra work due to the application tag.
   added.  It is up to the caller to ensure that the bio does not
   change while I/O is in progress.
 
-  bio_integrity_prep() should only be called if
-  bio_integrity_enabled() returned 1.
-
 
 5.3 PASSING EXISTING INTEGRITY METADATA
 
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 6170dad..f2b9f09 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -160,44 +160,6 @@ int bio_integrity_add_page(struct bio *bio, struct page 
*page,
 EXPORT_SYMBOL(bio_integrity_add_page);
 
 /**
- * bio_integrity_enabled - Check whether integrity can be passed
- * @bio:   bio to check
- *
- * Description: Determines whether bio_integrity_prep() can be called
- * on this bio or not. bio data direction and target device must be
- * set prior to calling.  The functions honors the write_generate and
- * read_verify flags in sysfs.
- */
-bool bio_integrity_enabled(struct bio *bio)
-{
-   struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-
-   if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
-   return false;
-
-   if (!bio_sectors(bio))
-   return false;
-
-   /* Already protected? */
-   if (bio_integrity(bio))
-   return false;
-
-   if (bi == NULL)
-   return false;
-
-   if (bio_data_dir(bio) == READ && bi->profile->verify_fn != NULL &&
-   (bi->flags & BLK_INTEGRITY_VERIFY))
-   return true;
-
-   if (bio_data_dir(bio) == WRITE && bi->profile->generate_fn != NULL &&
-   (bi->flags & BLK_INTEGRITY_GENERATE))
-   return true;
-
-   return false;
-}
-EXPORT_SYMBOL(bio_integrity_enabled);
-
-/**
  * bio_integrity_intervals - Return number of integrity intervals for a bio
  * @bi:blk_integrity profile for device
  * @sectors:   Size of the bio in 512-byte sectors
@@ -259,7 +221,7 @@ static int bio_integrity_process(struct bio *bio,
 }
 
 /**
- * bio_integrity_prep - Prepare bio for integrity I/O
+ * bio_integrity_setup - Prepare bio for integrity I/O
  * @bio:   bio to prepare
  *
  * Description: Allocates a buffer for integrity metadata, maps the
@@ -269,7 +231,7 @@ static int bio_integrity_process(struct bio *bio,
  * block device's integrity function.  In the READ case, the buffer
  * will be prepared for DMA and a suitable end_io handler set up.
  */
-int bio_integrity_prep(struct bio *bio)
+static int bio_integrity_setup(struct bio *bio)
 {
struct bio_integrity_payload *bip;
struct blk_integrity *bi;
@@ -352,6 +314,52 @@ int bio_integrity_prep(struct bio *bio)
 
return 0;
 }
+/**
+ * bio_integrity_prep - Prepare bio for integrity I/O
+ * @bio:   bio to prepare
+ *
+ * Description:  Checks if the bio already has an integrity payload attached.
+ * If it does, the payload has been generated by another kernel subsystem,
+ * and we just pass it through. Otherwise allocates integrity payload.
+ */
+
+int bio_integrity_prep(struct bio *bio)
+{
+   int err;
+   struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+
+   if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
+   return 0;
+
+   if (!bio_sectors(bio))
+   return 0;
+
+   /* Already protected? */
+   if (bio_integrity(bio))
+   return 0;
+
+   if (bi == NULL)
+   return 0;
+
+   if (bio_data_dir(bio) == READ && bi->profile->verify_fn != NULL &&
+   (bi->flags & BLK_INTEGRITY_VERIFY))
+   goto need_prep;
+
+   if (bio_data_dir(bio) == WRITE && bi->profile->generate_fn != NULL &&
+   (bi->flags & BLK_INTEGRITY_GENERATE))
+   goto need_prep;
+
+   return 0;
+need_prep:
+   err = bio_

[PATCH 0/9] block: T10/DIF Fixes and cleanups v3

2017-04-04 Thread Dmitry Monakhov
This patch set fix various problems spotted during T10/DIF integrity machinery 
testing.

TOC:
## Fix various bugs in T10/DIF/DIX infrastructure
0001-bio-integrity-Do-not-allocate-integrity-context-for
0002-bio-integrity-bio_trim-should-truncate-integrity-vec
0003-bio-integrity-bio_integrity_advance-must-update-inte
0004-bio-integrity-fix-interface-for-bio_integrity_trim
0005-bio-integrity-fold-bio_integrity_enabled-to-bio_integrity
## Cleanup T10/DIF/DIX infrastructure
0006-T10-Move-opencoded-contants-to-common-header
0007-Guard-bvec-iteration-logic-v3
0008-bio-add-bvec_iter-rewind-API
0009-bio-integrity-Restore-original-iterator-on-verify

Changes since V2
 - Cleanups in response to commens from axboe@ and hch@
 - Use rewind bvec api to restore original vector on verify
 - fix one more bug with bip_seed update on bio split.
 
Changes since V1
 - fix issues potted by kbuild bot
 - Replace BUG_ON with error logic for 7'th patch

Testcase: xfstest blockdev/003
https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85


[PATCH 0/9] block: T10/DIF Fixes and cleanups v3

2017-04-04 Thread Dmitry Monakhov
This patch set fix various problems spotted during T10/DIF integrity machinery 
testing.

TOC:
## Fix various bugs in T10/DIF/DIX infrastructure
0001-bio-integrity-Do-not-allocate-integrity-context-for
0002-bio-integrity-bio_trim-should-truncate-integrity-vec
0003-bio-integrity-bio_integrity_advance-must-update-inte
0004-bio-integrity-fix-interface-for-bio_integrity_trim
0005-bio-integrity-fold-bio_integrity_enabled-to-bio_integrity
## Cleanup T10/DIF/DIX infrastructure
0006-T10-Move-opencoded-contants-to-common-header
0007-Guard-bvec-iteration-logic-v3
0008-bio-add-bvec_iter-rewind-API
0009-bio-integrity-Restore-original-iterator-on-verify

Changes since V2
 - Cleanups in response to commens from axboe@ and hch@
 - Use rewind bvec api to restore original vector on verify
 - fix one more bug with bip_seed update on bio split.
 
Changes since V1
 - fix issues potted by kbuild bot
 - Replace BUG_ON with error logic for 7'th patch

Testcase: xfstest blockdev/003
https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85


[PATCH 2/9] bio-integrity: bio_trim should truncate integrity vector accordingly

2017-04-04 Thread Dmitry Monakhov
Reviewed-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index e75878f..fa84323 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1907,6 +1907,10 @@ void bio_trim(struct bio *bio, int offset, int size)
bio_advance(bio, offset << 9);
 
bio->bi_iter.bi_size = size;
+
+   if (bio_integrity(bio))
+   bio_integrity_trim(bio, 0, size);
+
 }
 EXPORT_SYMBOL_GPL(bio_trim);
 
-- 
2.9.3



[PATCH 2/9] bio-integrity: bio_trim should truncate integrity vector accordingly

2017-04-04 Thread Dmitry Monakhov
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dmitry Monakhov 
---
 block/bio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index e75878f..fa84323 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1907,6 +1907,10 @@ void bio_trim(struct bio *bio, int offset, int size)
bio_advance(bio, offset << 9);
 
bio->bi_iter.bi_size = size;
+
+   if (bio_integrity(bio))
+   bio_integrity_trim(bio, 0, size);
+
 }
 EXPORT_SYMBOL_GPL(bio_trim);
 
-- 
2.9.3



Re: [PATCH 7/7] Guard bvec iteration logic v2

2017-04-04 Thread Dmitry Monakhov
Ming Lei <tom.leim...@gmail.com> writes:

> On Mon, Apr 3, 2017 at 3:23 PM, Dmitry Monakhov <dmonak...@openvz.org> wrote:
>> Currently if some one try to advance bvec beyond it's size we simply
>> dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
>> This simply means that we endup dereferencing/corrupting random memory
>> region.
>>
>> Sane reaction would be to propagate error back to calling context
>> But bvec_iter_advance's calling context is not always good for error
>> handling. For safity reason let truncate iterator size to zero which
>
> IMO, we can avoid continuing to iterate by checking the return value,
> and looks it is rude to just set iterator size as 0.
But situation itself is horrible already. IMHO this is BUG_ON situation,
but since Linus hate bugons, I try to replace with something loud, but
very safe. Since there is no guarantee that caller will ignore an error
and try to dereference bvec the only safe thing we can to is to clamp
iterator to zero to prevent any possible usage in future.
>
>> will break external iteration loop which prevent us from unpredictable
>> memory range corruption. And even it caller ignores an error, it will
>> corrupt it's own bvecs, not others.
>>
>> This patch does:
>> - Return error back to caller with hope that it will react on this
>> - Truncate iterator size
>>
>> Code was added long time ago here 4550dd6c, luckily no one hit it
>> in real life :)
>>
>> changes since V1:
>>  - Replace  BUG_ON with error logic.
>>
>> Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
>> ---
>>  drivers/nvdimm/blk.c |  4 +++-
>>  drivers/nvdimm/btt.c |  4 +++-
>>  include/linux/bio.h  |  8 ++--
>>  include/linux/bvec.h | 11 ---
>>  4 files changed, 20 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
>> index 1edb3f3..04c3075 100644
>> --- a/drivers/nvdimm/blk.c
>> +++ b/drivers/nvdimm/blk.c
>> @@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk 
>> *nsblk,
>>
>> len -= cur_len;
>> dev_offset += cur_len;
>> -   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
>> +   err = bvec_iter_advance(bip->bip_vec, >bip_iter, 
>> cur_len);
>> +   if (err)
>> +   return err;
>> }
>>
>> return err;
>> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
>> index 03ded8d..3f3aa7b 100644
>> --- a/drivers/nvdimm/btt.c
>> +++ b/drivers/nvdimm/btt.c
>> @@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct 
>> bio_integrity_payload *bip,
>>
>> len -= cur_len;
>> meta_nsoff += cur_len;
>> -   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
>> +   ret = bvec_iter_advance(bip->bip_vec, >bip_iter, 
>> cur_len);
>> +   if (ret)
>> +   return ret;
>> }
>>
>> return ret;
>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>> index 0c1c95c..8bf1564 100644
>> --- a/include/linux/bio.h
>> +++ b/include/linux/bio.h
>> @@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, 
>> struct bvec_iter *iter,
>>
>> if (bio_no_advance_iter(bio))
>> iter->bi_size -= bytes;
>> -   else
>> -   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
>> +   else {
>> +   int err;
>> +   err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
>> +   if (unlikely(err))
>> +   bio->bi_error = err;
>> +   }
>>  }
>>
>>  #define __bio_for_each_segment(bvl, bio, iter, start)  \
>> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
>> index 89b65b8..c117f1a 100644
>> --- a/include/linux/bvec.h
>> +++ b/include/linux/bvec.h
>> @@ -22,6 +22,7 @@
>>
>>  #include 
>>  #include 
>> +#include 
>>
>>  /*
>>   * was unsigned short, but we might as well be ready for > 64kB I/O pages
>> @@ -66,12 +67,15 @@ struct bvec_iter {
>> .bv_offset  = bvec_iter_offset((bvec), (iter)), \
>>  })
>>
>> -static inline void bvec_iter_advance(const struct bio_vec *bv,
>> +static inline int bvec_iter_advance(const struct bio_vec *bv,
>>  

Re: [PATCH 7/7] Guard bvec iteration logic v2

2017-04-04 Thread Dmitry Monakhov
Ming Lei  writes:

> On Mon, Apr 3, 2017 at 3:23 PM, Dmitry Monakhov  wrote:
>> Currently if some one try to advance bvec beyond it's size we simply
>> dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
>> This simply means that we endup dereferencing/corrupting random memory
>> region.
>>
>> Sane reaction would be to propagate error back to calling context
>> But bvec_iter_advance's calling context is not always good for error
>> handling. For safity reason let truncate iterator size to zero which
>
> IMO, we can avoid continuing to iterate by checking the return value,
> and looks it is rude to just set iterator size as 0.
But situation itself is horrible already. IMHO this is BUG_ON situation,
but since Linus hate bugons, I try to replace with something loud, but
very safe. Since there is no guarantee that caller will ignore an error
and try to dereference bvec the only safe thing we can to is to clamp
iterator to zero to prevent any possible usage in future.
>
>> will break external iteration loop which prevent us from unpredictable
>> memory range corruption. And even it caller ignores an error, it will
>> corrupt it's own bvecs, not others.
>>
>> This patch does:
>> - Return error back to caller with hope that it will react on this
>> - Truncate iterator size
>>
>> Code was added long time ago here 4550dd6c, luckily no one hit it
>> in real life :)
>>
>> changes since V1:
>>  - Replace  BUG_ON with error logic.
>>
>> Signed-off-by: Dmitry Monakhov 
>> ---
>>  drivers/nvdimm/blk.c |  4 +++-
>>  drivers/nvdimm/btt.c |  4 +++-
>>  include/linux/bio.h  |  8 ++--
>>  include/linux/bvec.h | 11 ---
>>  4 files changed, 20 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
>> index 1edb3f3..04c3075 100644
>> --- a/drivers/nvdimm/blk.c
>> +++ b/drivers/nvdimm/blk.c
>> @@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk 
>> *nsblk,
>>
>> len -= cur_len;
>> dev_offset += cur_len;
>> -   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
>> +   err = bvec_iter_advance(bip->bip_vec, >bip_iter, 
>> cur_len);
>> +   if (err)
>> +   return err;
>> }
>>
>> return err;
>> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
>> index 03ded8d..3f3aa7b 100644
>> --- a/drivers/nvdimm/btt.c
>> +++ b/drivers/nvdimm/btt.c
>> @@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct 
>> bio_integrity_payload *bip,
>>
>> len -= cur_len;
>> meta_nsoff += cur_len;
>> -   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
>> +   ret = bvec_iter_advance(bip->bip_vec, >bip_iter, 
>> cur_len);
>> +   if (ret)
>> +   return ret;
>> }
>>
>> return ret;
>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>> index 0c1c95c..8bf1564 100644
>> --- a/include/linux/bio.h
>> +++ b/include/linux/bio.h
>> @@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, 
>> struct bvec_iter *iter,
>>
>> if (bio_no_advance_iter(bio))
>> iter->bi_size -= bytes;
>> -   else
>> -   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
>> +   else {
>> +   int err;
>> +   err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
>> +   if (unlikely(err))
>> +   bio->bi_error = err;
>> +   }
>>  }
>>
>>  #define __bio_for_each_segment(bvl, bio, iter, start)  \
>> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
>> index 89b65b8..c117f1a 100644
>> --- a/include/linux/bvec.h
>> +++ b/include/linux/bvec.h
>> @@ -22,6 +22,7 @@
>>
>>  #include 
>>  #include 
>> +#include 
>>
>>  /*
>>   * was unsigned short, but we might as well be ready for > 64kB I/O pages
>> @@ -66,12 +67,15 @@ struct bvec_iter {
>> .bv_offset  = bvec_iter_offset((bvec), (iter)), \
>>  })
>>
>> -static inline void bvec_iter_advance(const struct bio_vec *bv,
>> +static inline int bvec_iter_advance(const struct bio_vec *bv,
>>  struct bvec_iter *iter,
>>  unsigned b

Re: [PATCH 2/7] bio-integrity: save original iterator for verify stage

2017-04-04 Thread Dmitry Monakhov
Christoph Hellwig  writes:

> This is a pretty big increase in the bio_integrity_payload size,
> but I guess we can't get around it..

Yes, everybody hate this solution, me too, but I've stated with
other approach and it is appeaded to be very ugly.

My idea was that we have two types of iterator incrementors: bio_advance()
and bio_xx_complete, First one is called during split, later is called
on completion ( req_bio_endio() ) . So we can add new field "bi_done" to
iterator which has similar meaning as bi_bvec_done, but at full iterator
scope.  It is incremented during completion, but before end_io.
Chain bios will propogate bi_done to parent bio to parent one.
On ->vefify_fn() iterator will be rewinded (counter part of bvec_advance) to
iter->bi_done bytes, so we will get oritinal iterator. 
I've even prepare a patch for this idea and it looks big and awful.
Even more it does not works if chained bios overlapts (raid1,raid10,
etc).

But... at the time I've wrote this email I've realized that I do not
care about what happen with chained bios. The only thing is important
is parent bio and how far it was advanced. If bi_done is incremented
inside bvec_iter_advance() I can be shure that at the moment
->bi_end_io()
original position can be restored by rewinding back to io_done bytes.

I'll try to implement this.

>
> Reviewed-by: Christoph Hellwig 


Re: [PATCH 2/7] bio-integrity: save original iterator for verify stage

2017-04-04 Thread Dmitry Monakhov
Christoph Hellwig  writes:

> This is a pretty big increase in the bio_integrity_payload size,
> but I guess we can't get around it..

Yes, everybody hate this solution, me too, but I've stated with
other approach and it is appeaded to be very ugly.

My idea was that we have two types of iterator incrementors: bio_advance()
and bio_xx_complete, First one is called during split, later is called
on completion ( req_bio_endio() ) . So we can add new field "bi_done" to
iterator which has similar meaning as bi_bvec_done, but at full iterator
scope.  It is incremented during completion, but before end_io.
Chain bios will propogate bi_done to parent bio to parent one.
On ->vefify_fn() iterator will be rewinded (counter part of bvec_advance) to
iter->bi_done bytes, so we will get oritinal iterator. 
I've even prepare a patch for this idea and it looks big and awful.
Even more it does not works if chained bios overlapts (raid1,raid10,
etc).

But... at the time I've wrote this email I've realized that I do not
care about what happen with chained bios. The only thing is important
is parent bio and how far it was advanced. If bi_done is incremented
inside bvec_iter_advance() I can be shure that at the moment
->bi_end_io()
original position can be restored by rewinding back to io_done bytes.

I'll try to implement this.

>
> Reviewed-by: Christoph Hellwig 


[PATCH 1/7] bio-integrity: Do not allocate integrity context for bio w/o data

2017-04-03 Thread Dmitry Monakhov
If bio has no data, such as ones from blkdev_issue_flush(),
then we have nothing to protect.

This patch prevent bugon like follows:

kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
kernel BUG at mm/slab.c:2773!
invalid opcode:  [#1] SMP
Modules linked in: bcache
CPU: 0 PID: 4428 Comm: xfs_io Tainted: GW   
4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
task: 880137786440 task.stack: c9ba8000
RIP: 0010:kfree_debugcheck+0x25/0x2a
RSP: 0018:c9babde0 EFLAGS: 00010082
RAX: 0034 RBX: ac1fa1d106742a5a RCX: 0007
RDX:  RSI:  RDI: 88013f3ccb40
RBP: c9babde8 R08:  R09: 
R10: fcb76420 R11: 725172ed R12: 0282
R13: 8150e766 R14: 88013a145e00 R15: 0001
FS:  7fb09384bf40() GS:88013f20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fd0172f9e40 CR3: 000137fa9000 CR4: 06f0
Call Trace:
 kfree+0xc8/0x1b3
 bio_integrity_free+0xc3/0x16b
 bio_free+0x25/0x66
 bio_put+0x14/0x26
 blkdev_issue_flush+0x7a/0x85
 blkdev_fsync+0x35/0x42
 vfs_fsync_range+0x8e/0x9f
 vfs_fsync+0x1c/0x1e
 do_fsync+0x31/0x4a
 SyS_fsync+0x10/0x14
 entry_SYSCALL_64_fastpath+0x1f/0xc2

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio-integrity.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713..b5009a8 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -175,6 +175,9 @@ bool bio_integrity_enabled(struct bio *bio)
if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
return false;
 
+   if (!bio_sectors(bio))
+   return false;
+
/* Already protected? */
if (bio_integrity(bio))
return false;
-- 
2.9.3



[PATCH 1/7] bio-integrity: Do not allocate integrity context for bio w/o data

2017-04-03 Thread Dmitry Monakhov
If bio has no data, such as ones from blkdev_issue_flush(),
then we have nothing to protect.

This patch prevent bugon like follows:

kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
kernel BUG at mm/slab.c:2773!
invalid opcode:  [#1] SMP
Modules linked in: bcache
CPU: 0 PID: 4428 Comm: xfs_io Tainted: GW   
4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
task: 880137786440 task.stack: c9ba8000
RIP: 0010:kfree_debugcheck+0x25/0x2a
RSP: 0018:c9babde0 EFLAGS: 00010082
RAX: 0034 RBX: ac1fa1d106742a5a RCX: 0007
RDX:  RSI:  RDI: 88013f3ccb40
RBP: c9babde8 R08:  R09: 
R10: fcb76420 R11: 725172ed R12: 0282
R13: 8150e766 R14: 88013a145e00 R15: 0001
FS:  7fb09384bf40() GS:88013f20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fd0172f9e40 CR3: 000137fa9000 CR4: 06f0
Call Trace:
 kfree+0xc8/0x1b3
 bio_integrity_free+0xc3/0x16b
 bio_free+0x25/0x66
 bio_put+0x14/0x26
 blkdev_issue_flush+0x7a/0x85
 blkdev_fsync+0x35/0x42
 vfs_fsync_range+0x8e/0x9f
 vfs_fsync+0x1c/0x1e
 do_fsync+0x31/0x4a
 SyS_fsync+0x10/0x14
 entry_SYSCALL_64_fastpath+0x1f/0xc2

Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713..b5009a8 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -175,6 +175,9 @@ bool bio_integrity_enabled(struct bio *bio)
if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
return false;
 
+   if (!bio_sectors(bio))
+   return false;
+
/* Already protected? */
if (bio_integrity(bio))
return false;
-- 
2.9.3



[PATCH 3/7] bio-integrity: bio_trim should truncate integrity vector accordingly

2017-04-03 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index e75878f..fa84323 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1907,6 +1907,10 @@ void bio_trim(struct bio *bio, int offset, int size)
bio_advance(bio, offset << 9);
 
bio->bi_iter.bi_size = size;
+
+   if (bio_integrity(bio))
+   bio_integrity_trim(bio, 0, size);
+
 }
 EXPORT_SYMBOL_GPL(bio_trim);
 
-- 
2.9.3



[PATCH 3/7] bio-integrity: bio_trim should truncate integrity vector accordingly

2017-04-03 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov 
---
 block/bio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index e75878f..fa84323 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1907,6 +1907,10 @@ void bio_trim(struct bio *bio, int offset, int size)
bio_advance(bio, offset << 9);
 
bio->bi_iter.bi_size = size;
+
+   if (bio_integrity(bio))
+   bio_integrity_trim(bio, 0, size);
+
 }
 EXPORT_SYMBOL_GPL(bio_trim);
 
-- 
2.9.3



[PATCH 4/7] bio-integrity: fix interface for bio_integrity_trim

2017-04-03 Thread Dmitry Monakhov
bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be insync with bio's data offset. That is why we have
integrity update hook in bio_advance()

So only meaningful offset is 0. Let's just remove it completely.

TODO: add xfstests testcase here
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio-integrity.c | 8 +---
 block/bio.c   | 4 ++--
 drivers/md/dm.c   | 2 +-
 include/linux/bio.h   | 5 ++---
 4 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 43a4476..43895a0 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -433,21 +433,15 @@ EXPORT_SYMBOL(bio_integrity_advance);
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:   bio whose integrity vector to update
- * @offset:offset to first data sector
  * @sectors:   number of data sectors
  *
  * Description: Used to trim the integrity vector in a cloned bio.
- * The ivec will be advanced corresponding to 'offset' data sectors
- * and the length will be truncated corresponding to 'len' data
- * sectors.
  */
-void bio_integrity_trim(struct bio *bio, unsigned int offset,
-   unsigned int sectors)
+void bio_integrity_trim(struct bio *bio, unsigned int sectors)
 {
struct bio_integrity_payload *bip = bio_integrity(bio);
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 
-   bio_integrity_advance(bio, offset << 9);
bip->bip_iter.bi_size = bio_integrity_bytes(bi, sectors);
 }
 EXPORT_SYMBOL(bio_integrity_trim);
diff --git a/block/bio.c b/block/bio.c
index fa84323..6895986 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1878,7 +1878,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
split->bi_iter.bi_size = sectors << 9;
 
if (bio_integrity(split))
-   bio_integrity_trim(split, 0, sectors);
+   bio_integrity_trim(split, sectors);
 
bio_advance(bio, split->bi_iter.bi_size);
 
@@ -1909,7 +1909,7 @@ void bio_trim(struct bio *bio, int offset, int size)
bio->bi_iter.bi_size = size;
 
if (bio_integrity(bio))
-   bio_integrity_trim(bio, 0, size);
+   bio_integrity_trim(bio, size);
 
 }
 EXPORT_SYMBOL_GPL(bio_trim);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb7597..e54ecdd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1102,7 +1102,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio 
*bio,
clone->bi_iter.bi_size = to_bytes(len);
 
if (bio_integrity(bio))
-   bio_integrity_trim(clone, 0, len);
+   bio_integrity_trim(clone, len);
 
return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 00b086a..350f71d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -732,7 +732,7 @@ extern bool bio_integrity_enabled(struct bio *bio);
 extern int bio_integrity_prep(struct bio *);
 extern void bio_integrity_endio(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
-extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
+extern void bio_integrity_trim(struct bio *, unsigned int);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
 extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
@@ -782,8 +782,7 @@ static inline void bio_integrity_advance(struct bio *bio,
return;
 }
 
-static inline void bio_integrity_trim(struct bio *bio, unsigned int offset,
- unsigned int sectors)
+static inline void bio_integrity_trim(struct bio *bio, unsigned int sectors)
 {
return;
 }
-- 
2.9.3



[PATCH 4/7] bio-integrity: fix interface for bio_integrity_trim

2017-04-03 Thread Dmitry Monakhov
bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be insync with bio's data offset. That is why we have
integrity update hook in bio_advance()

So only meaningful offset is 0. Let's just remove it completely.

TODO: add xfstests testcase here
Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 8 +---
 block/bio.c   | 4 ++--
 drivers/md/dm.c   | 2 +-
 include/linux/bio.h   | 5 ++---
 4 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 43a4476..43895a0 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -433,21 +433,15 @@ EXPORT_SYMBOL(bio_integrity_advance);
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:   bio whose integrity vector to update
- * @offset:offset to first data sector
  * @sectors:   number of data sectors
  *
  * Description: Used to trim the integrity vector in a cloned bio.
- * The ivec will be advanced corresponding to 'offset' data sectors
- * and the length will be truncated corresponding to 'len' data
- * sectors.
  */
-void bio_integrity_trim(struct bio *bio, unsigned int offset,
-   unsigned int sectors)
+void bio_integrity_trim(struct bio *bio, unsigned int sectors)
 {
struct bio_integrity_payload *bip = bio_integrity(bio);
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 
-   bio_integrity_advance(bio, offset << 9);
bip->bip_iter.bi_size = bio_integrity_bytes(bi, sectors);
 }
 EXPORT_SYMBOL(bio_integrity_trim);
diff --git a/block/bio.c b/block/bio.c
index fa84323..6895986 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1878,7 +1878,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
split->bi_iter.bi_size = sectors << 9;
 
if (bio_integrity(split))
-   bio_integrity_trim(split, 0, sectors);
+   bio_integrity_trim(split, sectors);
 
bio_advance(bio, split->bi_iter.bi_size);
 
@@ -1909,7 +1909,7 @@ void bio_trim(struct bio *bio, int offset, int size)
bio->bi_iter.bi_size = size;
 
if (bio_integrity(bio))
-   bio_integrity_trim(bio, 0, size);
+   bio_integrity_trim(bio, size);
 
 }
 EXPORT_SYMBOL_GPL(bio_trim);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb7597..e54ecdd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1102,7 +1102,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio 
*bio,
clone->bi_iter.bi_size = to_bytes(len);
 
if (bio_integrity(bio))
-   bio_integrity_trim(clone, 0, len);
+   bio_integrity_trim(clone, len);
 
return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 00b086a..350f71d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -732,7 +732,7 @@ extern bool bio_integrity_enabled(struct bio *bio);
 extern int bio_integrity_prep(struct bio *);
 extern void bio_integrity_endio(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
-extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
+extern void bio_integrity_trim(struct bio *, unsigned int);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
 extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
@@ -782,8 +782,7 @@ static inline void bio_integrity_advance(struct bio *bio,
return;
 }
 
-static inline void bio_integrity_trim(struct bio *bio, unsigned int offset,
- unsigned int sectors)
+static inline void bio_integrity_trim(struct bio *bio, unsigned int sectors)
 {
return;
 }
-- 
2.9.3



[PATCH 2/7] bio-integrity: save original iterator for verify stage

2017-04-03 Thread Dmitry Monakhov
In order to perform verification we need to know original data vector
But, after bio traverse io-stack it may be advanced, splited and relocated
many times so it is hard to guess original data vector.

In fact currently ->verify_fn not woks at all because at the moment
it it called bio->bi_iter.bi_size == 0

The simplest way to fix that is to save original data vector and treat is
as immutable.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio-integrity.c | 6 --
 include/linux/bio.h   | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b5009a8..43a4476 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -238,10 +238,10 @@ static int bio_integrity_process(struct bio *bio,
 
iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
iter.interval = 1 << bi->interval_exp;
-   iter.seed = bip_get_seed(bip);
+   iter.seed = bip->bip_verify_iter.bi_sector;
iter.prot_buf = prot_buf;
 
-   bio_for_each_segment(bv, bio, bviter) {
+   __bio_for_each_segment(bv, bio, bviter, bip->bip_verify_iter) {
void *kaddr = kmap_atomic(bv.bv_page);
 
iter.data_buf = kaddr + bv.bv_offset;
@@ -310,6 +310,7 @@ int bio_integrity_prep(struct bio *bio)
bip->bip_flags |= BIP_BLOCK_INTEGRITY;
bip->bip_iter.bi_size = len;
bip_set_seed(bip, bio->bi_iter.bi_sector);
+   bip->bip_verify_iter = bio->bi_iter;
 
if (bi->flags & BLK_INTEGRITY_IP_CHECKSUM)
bip->bip_flags |= BIP_IP_CHECKSUM;
@@ -476,6 +477,7 @@ int bio_integrity_clone(struct bio *bio, struct bio 
*bio_src,
 
bip->bip_vcnt = bip_src->bip_vcnt;
bip->bip_iter = bip_src->bip_iter;
+   bip->bip_verify_iter = bip_src->bip_verify_iter;
 
return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..00b086a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -308,6 +308,7 @@ struct bio_integrity_payload {
struct bio  *bip_bio;   /* parent bio */
 
struct bvec_iterbip_iter;
+   struct bvec_iterbip_verify_iter;/* saved orig data iterator */
 
bio_end_io_t*bip_end_io;/* saved I/O completion fn */
 
-- 
2.9.3



[PATCH 5/7] bio-integrity: add bio_integrity_setup helper

2017-04-03 Thread Dmitry Monakhov
Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly. For example retry in
case of ENOMEM.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/blk-core.c |  5 +
 block/blk-mq.c   |  8 ++--
 drivers/nvdimm/blk.c | 13 ++---
 drivers/nvdimm/btt.c | 13 ++---
 include/linux/bio.h  | 25 +
 5 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c22..071a998 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1637,11 +1637,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 
blk_queue_split(q, , q->bio_split);
 
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio->bi_error = -EIO;
-   bio_endio(bio);
+   if (bio_integrity_setup(bio))
return BLK_QC_T_NONE;
-   }
 
if (op_is_flush(bio->bi_opf)) {
spin_lock_irq(q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 08a49c6..a9931ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1489,10 +1489,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
 
blk_queue_bounce(q, );
 
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio_io_error(bio);
+   if (bio_integrity_setup(bio))
return BLK_QC_T_NONE;
-   }
 
blk_queue_split(q, , q->bio_split);
 
@@ -1611,10 +1609,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
 
blk_queue_bounce(q, );
 
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio_io_error(bio);
+   if (bio_integrity_setup(bio))
return BLK_QC_T_NONE;
-   }
 
blk_queue_split(q, , q->bio_split);
 
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 9faaa96..1edb3f3 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -179,16 +179,8 @@ static blk_qc_t nd_blk_make_request(struct request_queue 
*q, struct bio *bio)
int err = 0, rw;
bool do_acct;
 
-   /*
-* bio_integrity_enabled also checks if the bio already has an
-* integrity payload attached. If it does, we *don't* do a
-* bio_integrity_prep here - the payload has been generated by
-* another kernel subsystem, and we just pass it through.
-*/
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio->bi_error = -EIO;
-   goto out;
-   }
+   if (bio_integrity_setup(bio))
+   return BLK_QC_T_NONE;
 
bip = bio_integrity(bio);
nsblk = q->queuedata;
@@ -212,7 +204,6 @@ static blk_qc_t nd_blk_make_request(struct request_queue 
*q, struct bio *bio)
if (do_acct)
nd_iostat_end(bio, start);
 
- out:
bio_endio(bio);
return BLK_QC_T_NONE;
 }
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 368795a..03ded8d 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1158,16 +1158,8 @@ static blk_qc_t btt_make_request(struct request_queue 
*q, struct bio *bio)
int err = 0;
bool do_acct;
 
-   /*
-* bio_integrity_enabled also checks if the bio already has an
-* integrity payload attached. If it does, we *don't* do a
-* bio_integrity_prep here - the payload has been generated by
-* another kernel subsystem, and we just pass it through.
-*/
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio->bi_error = -EIO;
-   goto out;
-   }
+   if (bio_integrity_setup(bio))
+   return BLK_QC_T_NONE;
 
do_acct = nd_iostat_start(bio, );
bio_for_each_segment(bvec, bio, iter) {
@@ -1194,7 +1186,6 @@ static blk_qc_t btt_make_request(struct request_queue *q, 
struct bio *bio)
if (do_acct)
nd_iostat_end(bio, start);
 
-out:
bio_endio(bio);
return BLK_QC_T_NONE;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 350f71d..0c1c95c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -738,6 +738,26 @@ extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
 extern void bio_integrity_init(void);
 
+static inline int bio_integrity_setup(struct bio *bio)
+{
+   int err = 0;
+
+   /*
+* bio_integrity_enabled also checks if the bio already has an
+* integrity payload attached. If it does, we *don't* do a
+* bio_integrity_prep here - the payload has been generated by
+* another kernel subsystem, and we just pass it through.
+*/
+   if (b

[PATCH 2/7] bio-integrity: save original iterator for verify stage

2017-04-03 Thread Dmitry Monakhov
In order to perform verification we need to know original data vector
But, after bio traverse io-stack it may be advanced, splited and relocated
many times so it is hard to guess original data vector.

In fact currently ->verify_fn not woks at all because at the moment
it it called bio->bi_iter.bi_size == 0

The simplest way to fix that is to save original data vector and treat is
as immutable.

Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 6 --
 include/linux/bio.h   | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b5009a8..43a4476 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -238,10 +238,10 @@ static int bio_integrity_process(struct bio *bio,
 
iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
iter.interval = 1 << bi->interval_exp;
-   iter.seed = bip_get_seed(bip);
+   iter.seed = bip->bip_verify_iter.bi_sector;
iter.prot_buf = prot_buf;
 
-   bio_for_each_segment(bv, bio, bviter) {
+   __bio_for_each_segment(bv, bio, bviter, bip->bip_verify_iter) {
void *kaddr = kmap_atomic(bv.bv_page);
 
iter.data_buf = kaddr + bv.bv_offset;
@@ -310,6 +310,7 @@ int bio_integrity_prep(struct bio *bio)
bip->bip_flags |= BIP_BLOCK_INTEGRITY;
bip->bip_iter.bi_size = len;
bip_set_seed(bip, bio->bi_iter.bi_sector);
+   bip->bip_verify_iter = bio->bi_iter;
 
if (bi->flags & BLK_INTEGRITY_IP_CHECKSUM)
bip->bip_flags |= BIP_IP_CHECKSUM;
@@ -476,6 +477,7 @@ int bio_integrity_clone(struct bio *bio, struct bio 
*bio_src,
 
bip->bip_vcnt = bip_src->bip_vcnt;
bip->bip_iter = bip_src->bip_iter;
+   bip->bip_verify_iter = bip_src->bip_verify_iter;
 
return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..00b086a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -308,6 +308,7 @@ struct bio_integrity_payload {
struct bio  *bip_bio;   /* parent bio */
 
struct bvec_iterbip_iter;
+   struct bvec_iterbip_verify_iter;/* saved orig data iterator */
 
bio_end_io_t*bip_end_io;/* saved I/O completion fn */
 
-- 
2.9.3



[PATCH 5/7] bio-integrity: add bio_integrity_setup helper

2017-04-03 Thread Dmitry Monakhov
Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly. For example retry in
case of ENOMEM.

Signed-off-by: Dmitry Monakhov 
---
 block/blk-core.c |  5 +
 block/blk-mq.c   |  8 ++--
 drivers/nvdimm/blk.c | 13 ++---
 drivers/nvdimm/btt.c | 13 ++---
 include/linux/bio.h  | 25 +
 5 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c22..071a998 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1637,11 +1637,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 
blk_queue_split(q, , q->bio_split);
 
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio->bi_error = -EIO;
-   bio_endio(bio);
+   if (bio_integrity_setup(bio))
return BLK_QC_T_NONE;
-   }
 
if (op_is_flush(bio->bi_opf)) {
spin_lock_irq(q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 08a49c6..a9931ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1489,10 +1489,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
 
blk_queue_bounce(q, );
 
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio_io_error(bio);
+   if (bio_integrity_setup(bio))
return BLK_QC_T_NONE;
-   }
 
blk_queue_split(q, , q->bio_split);
 
@@ -1611,10 +1609,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
 
blk_queue_bounce(q, );
 
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio_io_error(bio);
+   if (bio_integrity_setup(bio))
return BLK_QC_T_NONE;
-   }
 
blk_queue_split(q, , q->bio_split);
 
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 9faaa96..1edb3f3 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -179,16 +179,8 @@ static blk_qc_t nd_blk_make_request(struct request_queue 
*q, struct bio *bio)
int err = 0, rw;
bool do_acct;
 
-   /*
-* bio_integrity_enabled also checks if the bio already has an
-* integrity payload attached. If it does, we *don't* do a
-* bio_integrity_prep here - the payload has been generated by
-* another kernel subsystem, and we just pass it through.
-*/
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio->bi_error = -EIO;
-   goto out;
-   }
+   if (bio_integrity_setup(bio))
+   return BLK_QC_T_NONE;
 
bip = bio_integrity(bio);
nsblk = q->queuedata;
@@ -212,7 +204,6 @@ static blk_qc_t nd_blk_make_request(struct request_queue 
*q, struct bio *bio)
if (do_acct)
nd_iostat_end(bio, start);
 
- out:
bio_endio(bio);
return BLK_QC_T_NONE;
 }
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 368795a..03ded8d 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1158,16 +1158,8 @@ static blk_qc_t btt_make_request(struct request_queue 
*q, struct bio *bio)
int err = 0;
bool do_acct;
 
-   /*
-* bio_integrity_enabled also checks if the bio already has an
-* integrity payload attached. If it does, we *don't* do a
-* bio_integrity_prep here - the payload has been generated by
-* another kernel subsystem, and we just pass it through.
-*/
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio->bi_error = -EIO;
-   goto out;
-   }
+   if (bio_integrity_setup(bio))
+   return BLK_QC_T_NONE;
 
do_acct = nd_iostat_start(bio, );
bio_for_each_segment(bvec, bio, iter) {
@@ -1194,7 +1186,6 @@ static blk_qc_t btt_make_request(struct request_queue *q, 
struct bio *bio)
if (do_acct)
nd_iostat_end(bio, start);
 
-out:
bio_endio(bio);
return BLK_QC_T_NONE;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 350f71d..0c1c95c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -738,6 +738,26 @@ extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
 extern void bio_integrity_init(void);
 
+static inline int bio_integrity_setup(struct bio *bio)
+{
+   int err = 0;
+
+   /*
+* bio_integrity_enabled also checks if the bio already has an
+* integrity payload attached. If it does, we *don't* do a
+* bio_integrity_prep here - the payload has been generated by
+* another kernel subsystem, and we just pass it through.
+*/
+   if (bio_int

[PATCH 0/7] block: T10/DIF Fixes and cleanups v2

2017-04-03 Thread Dmitry Monakhov
This patch set fix various problems spotted during T10/DIF integrity machinery 
testing.

TOC:
## Fix various bugs in T10/DIF/DIX infrastructure
0001-bio-integrity-Do-not-allocate-integrity-context-for-fsync
0002-bio-integrity-save-original-iterator-for-verify-stage
0003-bio-integrity-bio_trim-should-truncate-integrity-vec
0004-bio-integrity-fix-interface-for-bio_integrity_trim
## Cleanup T10/DIF/DIX infrastructure
0005-bio-integrity-add-bio_integrity_setup-helper
0006-T10-Move-opencoded-contants-to-common-header
## General bulletproof protection for block layer
0007-Guard-bvec-iteration-logic-v2

Changes since V1
 - fix issues potted by kbuild bot
 - Replace BUG_ON with error logic for 7'th patch

Testcase: xfstest blockdev/003
https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85



[PATCH 6/7] T10: Move opencoded contants to common header

2017-04-03 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/t10-pi.c   | 9 +++--
 drivers/scsi/lpfc/lpfc_scsi.c| 5 +++--
 drivers/scsi/qla2xxx/qla_isr.c   | 8 
 drivers/target/target_core_sbc.c | 2 +-
 include/linux/t10-pi.h   | 3 +++
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index 2c97912..485cecd 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -28,9 +28,6 @@
 
 typedef __be16 (csum_fn) (void *, unsigned int);
 
-static const __be16 APP_ESCAPE = (__force __be16) 0x;
-static const __be32 REF_ESCAPE = (__force __be32) 0x;
-
 static __be16 t10_pi_crc_fn(void *data, unsigned int len)
 {
return cpu_to_be16(crc_t10dif(data, len));
@@ -82,7 +79,7 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
switch (type) {
case 1:
case 2:
-   if (pi->app_tag == APP_ESCAPE)
+   if (pi->app_tag == T10_APP_ESCAPE)
goto next;
 
if (be32_to_cpu(pi->ref_tag) !=
@@ -95,8 +92,8 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
}
break;
case 3:
-   if (pi->app_tag == APP_ESCAPE &&
-   pi->ref_tag == REF_ESCAPE)
+   if (pi->app_tag == T10_APP_ESCAPE &&
+   pi->ref_tag == T10_REF_ESCAPE)
goto next;
break;
}
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 54fd0c8..6f6b40e 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -2934,8 +2935,8 @@ lpfc_calc_bg_err(struct lpfc_hba *phba, struct 
lpfc_scsi_buf *lpfc_cmd)
 * First check to see if a protection data
 * check is valid
 */
-   if ((src->ref_tag == 0x) ||
-   (src->app_tag == 0x)) {
+   if ((src->ref_tag == T10_REF_ESCAPE) ||
+   (src->app_tag == T10_APP_ESCAPE)) {
start_ref_tag++;
goto skipit;
}
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 3203367..ed4b302 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -1950,9 +1950,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
 * For type 3: ref & app tag is all 'f's
 * For type 0,1,2: app tag is all 'f's
 */
-   if ((a_app_tag == 0x) &&
+   if ((a_app_tag == T10_APP_ESCAPE) &&
((scsi_get_prot_type(cmd) != SCSI_PROT_DIF_TYPE3) ||
-(a_ref_tag == 0x))) {
+(a_ref_tag == T10_REF_ESCAPE))) {
uint32_t blocks_done, resid;
sector_t lba_s = scsi_get_lba(cmd);
 
@@ -1994,9 +1994,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
spt = page_address(sg_page(sg)) + sg->offset;
spt += j;
 
-   spt->app_tag = 0x;
+   spt->app_tag = T10_APP_ESCAPE;
if (scsi_get_prot_type(cmd) == SCSI_PROT_DIF_TYPE3)
-   spt->ref_tag = 0x;
+   spt->ref_tag = T10_REF_ESCAPE;
}
 
return 0;
diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index c194063..927ef44 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -1446,7 +1446,7 @@ sbc_dif_verify(struct se_cmd *cmd, sector_t start, 
unsigned int sectors,
 (unsigned long long)sector, sdt->guard_tag,
 sdt->app_tag, be32_to_cpu(sdt->ref_tag));
 
-   if (sdt->app_tag == cpu_to_be16(0x)) {
+   if (sdt->app_tag == T10_APP_ESCAPE) {
dsg_off += block_size;
goto next;
}
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index 9fba9dd..c96845c 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -24,6 +24,9 @@ enum t10_dif_type {
T10_PI_TYPE3_PROTECTION = 0x3,
 };
 
+static const __be16 T10_APP_ESCAPE = (__force __be16) 0x;
+static const __be32 T10_REF_ESCAPE = (__force __be32) 0x;
+
 /*
  * T10 Protection Information tuple.
  */
-- 
2.9.3



[PATCH 7/7] Guard bvec iteration logic v2

2017-04-03 Thread Dmitry Monakhov
Currently if some one try to advance bvec beyond it's size we simply
dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
This simply means that we endup dereferencing/corrupting random memory
region.

Sane reaction would be to propagate error back to calling context
But bvec_iter_advance's calling context is not always good for error
handling. For safity reason let truncate iterator size to zero which
will break external iteration loop which prevent us from unpredictable
memory range corruption. And even it caller ignores an error, it will
corrupt it's own bvecs, not others.

This patch does:
- Return error back to caller with hope that it will react on this
- Truncate iterator size

Code was added long time ago here 4550dd6c, luckily no one hit it
in real life :)

changes since V1:
 - Replace  BUG_ON with error logic.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/nvdimm/blk.c |  4 +++-
 drivers/nvdimm/btt.c |  4 +++-
 include/linux/bio.h  |  8 ++--
 include/linux/bvec.h | 11 ---
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 1edb3f3..04c3075 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk 
*nsblk,
 
len -= cur_len;
dev_offset += cur_len;
-   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   err = bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   if (err)
+   return err;
}
 
return err;
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 03ded8d..3f3aa7b 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct 
bio_integrity_payload *bip,
 
len -= cur_len;
meta_nsoff += cur_len;
-   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   ret = bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   if (ret)
+   return ret;
}
 
return ret;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 0c1c95c..8bf1564 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
-   else
-   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   else {
+   int err;
+   err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   if (unlikely(err))
+   bio->bi_error = err;
+   }
 }
 
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 89b65b8..c117f1a 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -22,6 +22,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -66,12 +67,15 @@ struct bvec_iter {
.bv_offset  = bvec_iter_offset((bvec), (iter)), \
 })
 
-static inline void bvec_iter_advance(const struct bio_vec *bv,
+static inline int bvec_iter_advance(const struct bio_vec *bv,
 struct bvec_iter *iter,
 unsigned bytes)
 {
-   WARN_ONCE(bytes > iter->bi_size,
- "Attempted to advance past end of bvec iter\n");
+   if(unlikely(bytes > iter->bi_size)) {
+   WARN(1, "Attempted to advance past end of bvec iter\n");
+   iter->bi_size = 0;
+   return -EINVAL;
+   }
 
while (bytes) {
unsigned iter_len = bvec_iter_len(bv, *iter);
@@ -86,6 +90,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
iter->bi_idx++;
}
}
+   return 0;
 }
 
 #define for_each_bvec(bvl, bio_vec, iter, start)   \
-- 
2.9.3



[PATCH 0/7] block: T10/DIF Fixes and cleanups v2

2017-04-03 Thread Dmitry Monakhov
This patch set fix various problems spotted during T10/DIF integrity machinery 
testing.

TOC:
## Fix various bugs in T10/DIF/DIX infrastructure
0001-bio-integrity-Do-not-allocate-integrity-context-for-fsync
0002-bio-integrity-save-original-iterator-for-verify-stage
0003-bio-integrity-bio_trim-should-truncate-integrity-vec
0004-bio-integrity-fix-interface-for-bio_integrity_trim
## Cleanup T10/DIF/DIX infrastructure
0005-bio-integrity-add-bio_integrity_setup-helper
0006-T10-Move-opencoded-contants-to-common-header
## General bulletproof protection for block layer
0007-Guard-bvec-iteration-logic-v2

Changes since V1
 - fix issues potted by kbuild bot
 - Replace BUG_ON with error logic for 7'th patch

Testcase: xfstest blockdev/003
https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85



[PATCH 6/7] T10: Move opencoded contants to common header

2017-04-03 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov 
---
 block/t10-pi.c   | 9 +++--
 drivers/scsi/lpfc/lpfc_scsi.c| 5 +++--
 drivers/scsi/qla2xxx/qla_isr.c   | 8 
 drivers/target/target_core_sbc.c | 2 +-
 include/linux/t10-pi.h   | 3 +++
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index 2c97912..485cecd 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -28,9 +28,6 @@
 
 typedef __be16 (csum_fn) (void *, unsigned int);
 
-static const __be16 APP_ESCAPE = (__force __be16) 0x;
-static const __be32 REF_ESCAPE = (__force __be32) 0x;
-
 static __be16 t10_pi_crc_fn(void *data, unsigned int len)
 {
return cpu_to_be16(crc_t10dif(data, len));
@@ -82,7 +79,7 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
switch (type) {
case 1:
case 2:
-   if (pi->app_tag == APP_ESCAPE)
+   if (pi->app_tag == T10_APP_ESCAPE)
goto next;
 
if (be32_to_cpu(pi->ref_tag) !=
@@ -95,8 +92,8 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
}
break;
case 3:
-   if (pi->app_tag == APP_ESCAPE &&
-   pi->ref_tag == REF_ESCAPE)
+   if (pi->app_tag == T10_APP_ESCAPE &&
+   pi->ref_tag == T10_REF_ESCAPE)
goto next;
break;
}
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 54fd0c8..6f6b40e 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -2934,8 +2935,8 @@ lpfc_calc_bg_err(struct lpfc_hba *phba, struct 
lpfc_scsi_buf *lpfc_cmd)
 * First check to see if a protection data
 * check is valid
 */
-   if ((src->ref_tag == 0x) ||
-   (src->app_tag == 0x)) {
+   if ((src->ref_tag == T10_REF_ESCAPE) ||
+   (src->app_tag == T10_APP_ESCAPE)) {
start_ref_tag++;
goto skipit;
}
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 3203367..ed4b302 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -1950,9 +1950,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
 * For type 3: ref & app tag is all 'f's
 * For type 0,1,2: app tag is all 'f's
 */
-   if ((a_app_tag == 0x) &&
+   if ((a_app_tag == T10_APP_ESCAPE) &&
((scsi_get_prot_type(cmd) != SCSI_PROT_DIF_TYPE3) ||
-(a_ref_tag == 0x))) {
+(a_ref_tag == T10_REF_ESCAPE))) {
uint32_t blocks_done, resid;
sector_t lba_s = scsi_get_lba(cmd);
 
@@ -1994,9 +1994,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
spt = page_address(sg_page(sg)) + sg->offset;
spt += j;
 
-   spt->app_tag = 0x;
+   spt->app_tag = T10_APP_ESCAPE;
if (scsi_get_prot_type(cmd) == SCSI_PROT_DIF_TYPE3)
-   spt->ref_tag = 0x;
+   spt->ref_tag = T10_REF_ESCAPE;
}
 
return 0;
diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index c194063..927ef44 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -1446,7 +1446,7 @@ sbc_dif_verify(struct se_cmd *cmd, sector_t start, 
unsigned int sectors,
 (unsigned long long)sector, sdt->guard_tag,
 sdt->app_tag, be32_to_cpu(sdt->ref_tag));
 
-   if (sdt->app_tag == cpu_to_be16(0x)) {
+   if (sdt->app_tag == T10_APP_ESCAPE) {
dsg_off += block_size;
goto next;
}
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index 9fba9dd..c96845c 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -24,6 +24,9 @@ enum t10_dif_type {
T10_PI_TYPE3_PROTECTION = 0x3,
 };
 
+static const __be16 T10_APP_ESCAPE = (__force __be16) 0x;
+static const __be32 T10_REF_ESCAPE = (__force __be32) 0x;
+
 /*
  * T10 Protection Information tuple.
  */
-- 
2.9.3



[PATCH 7/7] Guard bvec iteration logic v2

2017-04-03 Thread Dmitry Monakhov
Currently if some one try to advance bvec beyond it's size we simply
dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
This simply means that we endup dereferencing/corrupting random memory
region.

Sane reaction would be to propagate error back to calling context
But bvec_iter_advance's calling context is not always good for error
handling. For safity reason let truncate iterator size to zero which
will break external iteration loop which prevent us from unpredictable
memory range corruption. And even it caller ignores an error, it will
corrupt it's own bvecs, not others.

This patch does:
- Return error back to caller with hope that it will react on this
- Truncate iterator size

Code was added long time ago here 4550dd6c, luckily no one hit it
in real life :)

changes since V1:
 - Replace  BUG_ON with error logic.

Signed-off-by: Dmitry Monakhov 
---
 drivers/nvdimm/blk.c |  4 +++-
 drivers/nvdimm/btt.c |  4 +++-
 include/linux/bio.h  |  8 ++--
 include/linux/bvec.h | 11 ---
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 1edb3f3..04c3075 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk 
*nsblk,
 
len -= cur_len;
dev_offset += cur_len;
-   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   err = bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   if (err)
+   return err;
}
 
return err;
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 03ded8d..3f3aa7b 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct 
bio_integrity_payload *bip,
 
len -= cur_len;
meta_nsoff += cur_len;
-   bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   ret = bvec_iter_advance(bip->bip_vec, >bip_iter, cur_len);
+   if (ret)
+   return ret;
}
 
return ret;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 0c1c95c..8bf1564 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
-   else
-   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   else {
+   int err;
+   err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   if (unlikely(err))
+   bio->bi_error = err;
+   }
 }
 
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 89b65b8..c117f1a 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -22,6 +22,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -66,12 +67,15 @@ struct bvec_iter {
.bv_offset  = bvec_iter_offset((bvec), (iter)), \
 })
 
-static inline void bvec_iter_advance(const struct bio_vec *bv,
+static inline int bvec_iter_advance(const struct bio_vec *bv,
 struct bvec_iter *iter,
 unsigned bytes)
 {
-   WARN_ONCE(bytes > iter->bi_size,
- "Attempted to advance past end of bvec iter\n");
+   if(unlikely(bytes > iter->bi_size)) {
+   WARN(1, "Attempted to advance past end of bvec iter\n");
+   iter->bi_size = 0;
+   return -EINVAL;
+   }
 
while (bytes) {
unsigned iter_len = bvec_iter_len(bv, *iter);
@@ -86,6 +90,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
iter->bi_idx++;
}
}
+   return 0;
 }
 
 #define for_each_bvec(bvl, bio_vec, iter, start)   \
-- 
2.9.3



[PATCH 3/8] bio-integrity: save original iterator for verify stage

2017-03-30 Thread Dmitry Monakhov
In order to perform verification we need to know original data vector
But, after bio traverse io-stack it may be advanced, splited and relocated
many times so it is hard to guess original data vector.

In fact currently ->verify_fn not woks at all because at the moment
it is called bio->bi_iter.bi_size == 0

The simplest way to fix that is to save original data vector and treat is
as immutable.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio-integrity.c | 6 --
 include/linux/bio.h   | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b5009a8..43a4476 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -238,10 +238,10 @@ static int bio_integrity_process(struct bio *bio,
 
iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
iter.interval = 1 << bi->interval_exp;
-   iter.seed = bip_get_seed(bip);
+   iter.seed = bip->bip_verify_iter.bi_sector;
iter.prot_buf = prot_buf;
 
-   bio_for_each_segment(bv, bio, bviter) {
+   __bio_for_each_segment(bv, bio, bviter, bip->bip_verify_iter) {
void *kaddr = kmap_atomic(bv.bv_page);
 
iter.data_buf = kaddr + bv.bv_offset;
@@ -310,6 +310,7 @@ int bio_integrity_prep(struct bio *bio)
bip->bip_flags |= BIP_BLOCK_INTEGRITY;
bip->bip_iter.bi_size = len;
bip_set_seed(bip, bio->bi_iter.bi_sector);
+   bip->bip_verify_iter = bio->bi_iter;
 
if (bi->flags & BLK_INTEGRITY_IP_CHECKSUM)
bip->bip_flags |= BIP_IP_CHECKSUM;
@@ -476,6 +477,7 @@ int bio_integrity_clone(struct bio *bio, struct bio 
*bio_src,
 
bip->bip_vcnt = bip_src->bip_vcnt;
bip->bip_iter = bip_src->bip_iter;
+   bip->bip_verify_iter = bip_src->bip_verify_iter;
 
return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..00b086a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -308,6 +308,7 @@ struct bio_integrity_payload {
struct bio  *bip_bio;   /* parent bio */
 
struct bvec_iterbip_iter;
+   struct bvec_iterbip_verify_iter;/* saved orig data iterator */
 
bio_end_io_t*bip_end_io;/* saved I/O completion fn */
 
-- 
2.9.3



[PATCH 3/8] bio-integrity: save original iterator for verify stage

2017-03-30 Thread Dmitry Monakhov
In order to perform verification we need to know original data vector
But, after bio traverse io-stack it may be advanced, splited and relocated
many times so it is hard to guess original data vector.

In fact currently ->verify_fn not woks at all because at the moment
it is called bio->bi_iter.bi_size == 0

The simplest way to fix that is to save original data vector and treat is
as immutable.

Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 6 --
 include/linux/bio.h   | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b5009a8..43a4476 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -238,10 +238,10 @@ static int bio_integrity_process(struct bio *bio,
 
iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
iter.interval = 1 << bi->interval_exp;
-   iter.seed = bip_get_seed(bip);
+   iter.seed = bip->bip_verify_iter.bi_sector;
iter.prot_buf = prot_buf;
 
-   bio_for_each_segment(bv, bio, bviter) {
+   __bio_for_each_segment(bv, bio, bviter, bip->bip_verify_iter) {
void *kaddr = kmap_atomic(bv.bv_page);
 
iter.data_buf = kaddr + bv.bv_offset;
@@ -310,6 +310,7 @@ int bio_integrity_prep(struct bio *bio)
bip->bip_flags |= BIP_BLOCK_INTEGRITY;
bip->bip_iter.bi_size = len;
bip_set_seed(bip, bio->bi_iter.bi_sector);
+   bip->bip_verify_iter = bio->bi_iter;
 
if (bi->flags & BLK_INTEGRITY_IP_CHECKSUM)
bip->bip_flags |= BIP_IP_CHECKSUM;
@@ -476,6 +477,7 @@ int bio_integrity_clone(struct bio *bio, struct bio 
*bio_src,
 
bip->bip_vcnt = bip_src->bip_vcnt;
bip->bip_iter = bip_src->bip_iter;
+   bip->bip_verify_iter = bip_src->bip_verify_iter;
 
return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..00b086a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -308,6 +308,7 @@ struct bio_integrity_payload {
struct bio  *bip_bio;   /* parent bio */
 
struct bvec_iterbip_iter;
+   struct bvec_iterbip_verify_iter;/* saved orig data iterator */
 
bio_end_io_t*bip_end_io;/* saved I/O completion fn */
 
-- 
2.9.3



[PATCH 4/8] bio-integrity: bio_trim should truncate integrity vector accordingly

2017-03-30 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index e75878f..fa84323 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1907,6 +1907,10 @@ void bio_trim(struct bio *bio, int offset, int size)
bio_advance(bio, offset << 9);
 
bio->bi_iter.bi_size = size;
+
+   if (bio_integrity(bio))
+   bio_integrity_trim(bio, 0, size);
+
 }
 EXPORT_SYMBOL_GPL(bio_trim);
 
-- 
2.9.3



[PATCH 4/8] bio-integrity: bio_trim should truncate integrity vector accordingly

2017-03-30 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov 
---
 block/bio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index e75878f..fa84323 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1907,6 +1907,10 @@ void bio_trim(struct bio *bio, int offset, int size)
bio_advance(bio, offset << 9);
 
bio->bi_iter.bi_size = size;
+
+   if (bio_integrity(bio))
+   bio_integrity_trim(bio, 0, size);
+
 }
 EXPORT_SYMBOL_GPL(bio_trim);
 
-- 
2.9.3



[PATCH 7/8] T10: Move opencoded contants to common header

2017-03-30 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/t10-pi.c   | 9 +++--
 drivers/scsi/lpfc/lpfc_scsi.c| 4 ++--
 drivers/scsi/qla2xxx/qla_isr.c   | 8 
 drivers/target/target_core_sbc.c | 2 +-
 include/linux/t10-pi.h   | 3 +++
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index 2c97912..485cecd 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -28,9 +28,6 @@
 
 typedef __be16 (csum_fn) (void *, unsigned int);
 
-static const __be16 APP_ESCAPE = (__force __be16) 0x;
-static const __be32 REF_ESCAPE = (__force __be32) 0x;
-
 static __be16 t10_pi_crc_fn(void *data, unsigned int len)
 {
return cpu_to_be16(crc_t10dif(data, len));
@@ -82,7 +79,7 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
switch (type) {
case 1:
case 2:
-   if (pi->app_tag == APP_ESCAPE)
+   if (pi->app_tag == T10_APP_ESCAPE)
goto next;
 
if (be32_to_cpu(pi->ref_tag) !=
@@ -95,8 +92,8 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
}
break;
case 3:
-   if (pi->app_tag == APP_ESCAPE &&
-   pi->ref_tag == REF_ESCAPE)
+   if (pi->app_tag == T10_APP_ESCAPE &&
+   pi->ref_tag == T10_REF_ESCAPE)
goto next;
break;
}
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 54fd0c8..6703512 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -2934,8 +2934,8 @@ lpfc_calc_bg_err(struct lpfc_hba *phba, struct 
lpfc_scsi_buf *lpfc_cmd)
 * First check to see if a protection data
 * check is valid
 */
-   if ((src->ref_tag == 0x) ||
-   (src->app_tag == 0x)) {
+   if ((src->ref_tag == T10_REF_ESCAPE) ||
+   (src->app_tag == T10_APP_ESCAPE)) {
start_ref_tag++;
goto skipit;
}
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 3203367..dfab093 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -1950,9 +1950,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
 * For type 3: ref & app tag is all 'f's
 * For type 0,1,2: app tag is all 'f's
 */
-   if ((a_app_tag == 0x) &&
+   if ((a_app_tag == T10_APP_TAG) &&
((scsi_get_prot_type(cmd) != SCSI_PROT_DIF_TYPE3) ||
-(a_ref_tag == 0x))) {
+(a_ref_tag == T10_REF_TAG))) {
uint32_t blocks_done, resid;
sector_t lba_s = scsi_get_lba(cmd);
 
@@ -1994,9 +1994,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
spt = page_address(sg_page(sg)) + sg->offset;
spt += j;
 
-   spt->app_tag = 0x;
+   spt->app_tag = T10_APP_TAG;
if (scsi_get_prot_type(cmd) == SCSI_PROT_DIF_TYPE3)
-   spt->ref_tag = 0x;
+   spt->ref_tag = T10_REF_TAG;
}
 
return 0;
diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index c194063..927ef44 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -1446,7 +1446,7 @@ sbc_dif_verify(struct se_cmd *cmd, sector_t start, 
unsigned int sectors,
 (unsigned long long)sector, sdt->guard_tag,
 sdt->app_tag, be32_to_cpu(sdt->ref_tag));
 
-   if (sdt->app_tag == cpu_to_be16(0x)) {
+   if (sdt->app_tag == T10_APP_ESCAPE) {
dsg_off += block_size;
goto next;
}
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index 9fba9dd..c96845c 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -24,6 +24,9 @@ enum t10_dif_type {
T10_PI_TYPE3_PROTECTION = 0x3,
 };
 
+static const __be16 T10_APP_ESCAPE = (__force __be16) 0x;
+static const __be32 T10_REF_ESCAPE = (__force __be32) 0x;
+
 /*
  * T10 Protection Information tuple.
  */
-- 
2.9.3



[PATCH 7/8] T10: Move opencoded contants to common header

2017-03-30 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov 
---
 block/t10-pi.c   | 9 +++--
 drivers/scsi/lpfc/lpfc_scsi.c| 4 ++--
 drivers/scsi/qla2xxx/qla_isr.c   | 8 
 drivers/target/target_core_sbc.c | 2 +-
 include/linux/t10-pi.h   | 3 +++
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index 2c97912..485cecd 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -28,9 +28,6 @@
 
 typedef __be16 (csum_fn) (void *, unsigned int);
 
-static const __be16 APP_ESCAPE = (__force __be16) 0x;
-static const __be32 REF_ESCAPE = (__force __be32) 0x;
-
 static __be16 t10_pi_crc_fn(void *data, unsigned int len)
 {
return cpu_to_be16(crc_t10dif(data, len));
@@ -82,7 +79,7 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
switch (type) {
case 1:
case 2:
-   if (pi->app_tag == APP_ESCAPE)
+   if (pi->app_tag == T10_APP_ESCAPE)
goto next;
 
if (be32_to_cpu(pi->ref_tag) !=
@@ -95,8 +92,8 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, 
csum_fn *fn,
}
break;
case 3:
-   if (pi->app_tag == APP_ESCAPE &&
-   pi->ref_tag == REF_ESCAPE)
+   if (pi->app_tag == T10_APP_ESCAPE &&
+   pi->ref_tag == T10_REF_ESCAPE)
goto next;
break;
}
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 54fd0c8..6703512 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -2934,8 +2934,8 @@ lpfc_calc_bg_err(struct lpfc_hba *phba, struct 
lpfc_scsi_buf *lpfc_cmd)
 * First check to see if a protection data
 * check is valid
 */
-   if ((src->ref_tag == 0x) ||
-   (src->app_tag == 0x)) {
+   if ((src->ref_tag == T10_REF_ESCAPE) ||
+   (src->app_tag == T10_APP_ESCAPE)) {
start_ref_tag++;
goto skipit;
}
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 3203367..dfab093 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -1950,9 +1950,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
 * For type 3: ref & app tag is all 'f's
 * For type 0,1,2: app tag is all 'f's
 */
-   if ((a_app_tag == 0x) &&
+   if ((a_app_tag == T10_APP_TAG) &&
((scsi_get_prot_type(cmd) != SCSI_PROT_DIF_TYPE3) ||
-(a_ref_tag == 0x))) {
+(a_ref_tag == T10_REF_TAG))) {
uint32_t blocks_done, resid;
sector_t lba_s = scsi_get_lba(cmd);
 
@@ -1994,9 +1994,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx 
*sts24)
spt = page_address(sg_page(sg)) + sg->offset;
spt += j;
 
-   spt->app_tag = 0x;
+   spt->app_tag = T10_APP_TAG;
if (scsi_get_prot_type(cmd) == SCSI_PROT_DIF_TYPE3)
-   spt->ref_tag = 0x;
+   spt->ref_tag = T10_REF_TAG;
}
 
return 0;
diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index c194063..927ef44 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -1446,7 +1446,7 @@ sbc_dif_verify(struct se_cmd *cmd, sector_t start, 
unsigned int sectors,
 (unsigned long long)sector, sdt->guard_tag,
 sdt->app_tag, be32_to_cpu(sdt->ref_tag));
 
-   if (sdt->app_tag == cpu_to_be16(0x)) {
+   if (sdt->app_tag == T10_APP_ESCAPE) {
dsg_off += block_size;
goto next;
}
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index 9fba9dd..c96845c 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -24,6 +24,9 @@ enum t10_dif_type {
T10_PI_TYPE3_PROTECTION = 0x3,
 };
 
+static const __be16 T10_APP_ESCAPE = (__force __be16) 0x;
+static const __be32 T10_REF_ESCAPE = (__force __be32) 0x;
+
 /*
  * T10 Protection Information tuple.
  */
-- 
2.9.3



[PATCH 5/8] bio-integrity: fix interface for bio_integrity_trim

2017-03-30 Thread Dmitry Monakhov
bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be in sync with bio's data offset. That is why we have
integrity update hook in bio_advance()

So the only meaningful offset is 0. Let's just remove it completely.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio-integrity.c | 8 +---
 block/bio.c   | 4 ++--
 drivers/md/dm.c   | 2 +-
 include/linux/bio.h   | 5 ++---
 4 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 43a4476..43895a0 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -433,21 +433,15 @@ EXPORT_SYMBOL(bio_integrity_advance);
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:   bio whose integrity vector to update
- * @offset:offset to first data sector
  * @sectors:   number of data sectors
  *
  * Description: Used to trim the integrity vector in a cloned bio.
- * The ivec will be advanced corresponding to 'offset' data sectors
- * and the length will be truncated corresponding to 'len' data
- * sectors.
  */
-void bio_integrity_trim(struct bio *bio, unsigned int offset,
-   unsigned int sectors)
+void bio_integrity_trim(struct bio *bio, unsigned int sectors)
 {
struct bio_integrity_payload *bip = bio_integrity(bio);
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 
-   bio_integrity_advance(bio, offset << 9);
bip->bip_iter.bi_size = bio_integrity_bytes(bi, sectors);
 }
 EXPORT_SYMBOL(bio_integrity_trim);
diff --git a/block/bio.c b/block/bio.c
index fa84323..6895986 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1878,7 +1878,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
split->bi_iter.bi_size = sectors << 9;
 
if (bio_integrity(split))
-   bio_integrity_trim(split, 0, sectors);
+   bio_integrity_trim(split, sectors);
 
bio_advance(bio, split->bi_iter.bi_size);
 
@@ -1909,7 +1909,7 @@ void bio_trim(struct bio *bio, int offset, int size)
bio->bi_iter.bi_size = size;
 
if (bio_integrity(bio))
-   bio_integrity_trim(bio, 0, size);
+   bio_integrity_trim(bio, size);
 
 }
 EXPORT_SYMBOL_GPL(bio_trim);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb7597..e54ecdd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1102,7 +1102,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio 
*bio,
clone->bi_iter.bi_size = to_bytes(len);
 
if (bio_integrity(bio))
-   bio_integrity_trim(clone, 0, len);
+   bio_integrity_trim(clone, len);
 
return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 00b086a..350f71d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -732,7 +732,7 @@ extern bool bio_integrity_enabled(struct bio *bio);
 extern int bio_integrity_prep(struct bio *);
 extern void bio_integrity_endio(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
-extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
+extern void bio_integrity_trim(struct bio *, unsigned int);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
 extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
@@ -782,8 +782,7 @@ static inline void bio_integrity_advance(struct bio *bio,
return;
 }
 
-static inline void bio_integrity_trim(struct bio *bio, unsigned int offset,
- unsigned int sectors)
+static inline void bio_integrity_trim(struct bio *bio, unsigned int sectors)
 {
return;
 }
-- 
2.9.3



[PATCH 1/8] Guard bvec iteration logic

2017-03-30 Thread Dmitry Monakhov
If some one try to attempt advance bvec beyond it's size we simply
dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
This simply means that we endup dereferencing/corrupting random memory
region.

Code was added long time ago here 4550dd6c, luckily no one hit it
in real life :)

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 include/linux/bvec.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 89b65b8..86b914f 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -70,8 +70,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
 struct bvec_iter *iter,
 unsigned bytes)
 {
-   WARN_ONCE(bytes > iter->bi_size,
- "Attempted to advance past end of bvec iter\n");
+   BUG_ON(bytes > iter->bi_size);
 
while (bytes) {
unsigned iter_len = bvec_iter_len(bv, *iter);
-- 
2.9.3



[PATCH 6/8] bio-integrity: add bio_integrity_setup helper

2017-03-30 Thread Dmitry Monakhov
Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly. For example retry in
case of ENOMEM.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/blk-core.c |  5 +
 block/blk-mq.c   |  8 ++--
 drivers/nvdimm/blk.c | 13 ++---
 drivers/nvdimm/btt.c | 13 ++---
 include/linux/bio.h  | 25 +
 5 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c22..071a998 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1637,11 +1637,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 
blk_queue_split(q, , q->bio_split);
 
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio->bi_error = -EIO;
-   bio_endio(bio);
+   if (bio_integrity_setup(bio))
return BLK_QC_T_NONE;
-   }
 
if (op_is_flush(bio->bi_opf)) {
spin_lock_irq(q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 08a49c6..a9931ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1489,10 +1489,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
 
blk_queue_bounce(q, );
 
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio_io_error(bio);
+   if (bio_integrity_setup(bio))
return BLK_QC_T_NONE;
-   }
 
blk_queue_split(q, , q->bio_split);
 
@@ -1611,10 +1609,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
 
blk_queue_bounce(q, );
 
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio_io_error(bio);
+   if (bio_integrity_setup(bio))
return BLK_QC_T_NONE;
-   }
 
blk_queue_split(q, , q->bio_split);
 
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 9faaa96..1edb3f3 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -179,16 +179,8 @@ static blk_qc_t nd_blk_make_request(struct request_queue 
*q, struct bio *bio)
int err = 0, rw;
bool do_acct;
 
-   /*
-* bio_integrity_enabled also checks if the bio already has an
-* integrity payload attached. If it does, we *don't* do a
-* bio_integrity_prep here - the payload has been generated by
-* another kernel subsystem, and we just pass it through.
-*/
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio->bi_error = -EIO;
-   goto out;
-   }
+   if (bio_integrity_setup(bio))
+   return BLK_QC_T_NONE;
 
bip = bio_integrity(bio);
nsblk = q->queuedata;
@@ -212,7 +204,6 @@ static blk_qc_t nd_blk_make_request(struct request_queue 
*q, struct bio *bio)
if (do_acct)
nd_iostat_end(bio, start);
 
- out:
bio_endio(bio);
return BLK_QC_T_NONE;
 }
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 368795a..03ded8d 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1158,16 +1158,8 @@ static blk_qc_t btt_make_request(struct request_queue 
*q, struct bio *bio)
int err = 0;
bool do_acct;
 
-   /*
-* bio_integrity_enabled also checks if the bio already has an
-* integrity payload attached. If it does, we *don't* do a
-* bio_integrity_prep here - the payload has been generated by
-* another kernel subsystem, and we just pass it through.
-*/
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio->bi_error = -EIO;
-   goto out;
-   }
+   if (bio_integrity_setup(bio))
+   return BLK_QC_T_NONE;
 
do_acct = nd_iostat_start(bio, );
bio_for_each_segment(bvec, bio, iter) {
@@ -1194,7 +1186,6 @@ static blk_qc_t btt_make_request(struct request_queue *q, 
struct bio *bio)
if (do_acct)
nd_iostat_end(bio, start);
 
-out:
bio_endio(bio);
return BLK_QC_T_NONE;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 350f71d..f477327 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -738,6 +738,26 @@ extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
 extern void bio_integrity_init(void);
 
+static inline int bio_integrity_setup(struct bio *bio)
+{
+   int err = 0;
+
+   /*
+* bio_integrity_enabled also checks if the bio already has an
+* integrity payload attached. If it does, we *don't* do a
+* bio_integrity_prep here - the payload has been generated by
+* another kernel subsystem, and we just pass it through.
+*/
+   if (b

[PATCH 5/8] bio-integrity: fix interface for bio_integrity_trim

2017-03-30 Thread Dmitry Monakhov
bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be in sync with bio's data offset. That is why we have
integrity update hook in bio_advance()

So the only meaningful offset is 0. Let's just remove it completely.

Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 8 +---
 block/bio.c   | 4 ++--
 drivers/md/dm.c   | 2 +-
 include/linux/bio.h   | 5 ++---
 4 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 43a4476..43895a0 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -433,21 +433,15 @@ EXPORT_SYMBOL(bio_integrity_advance);
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:   bio whose integrity vector to update
- * @offset:offset to first data sector
  * @sectors:   number of data sectors
  *
  * Description: Used to trim the integrity vector in a cloned bio.
- * The ivec will be advanced corresponding to 'offset' data sectors
- * and the length will be truncated corresponding to 'len' data
- * sectors.
  */
-void bio_integrity_trim(struct bio *bio, unsigned int offset,
-   unsigned int sectors)
+void bio_integrity_trim(struct bio *bio, unsigned int sectors)
 {
struct bio_integrity_payload *bip = bio_integrity(bio);
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 
-   bio_integrity_advance(bio, offset << 9);
bip->bip_iter.bi_size = bio_integrity_bytes(bi, sectors);
 }
 EXPORT_SYMBOL(bio_integrity_trim);
diff --git a/block/bio.c b/block/bio.c
index fa84323..6895986 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1878,7 +1878,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
split->bi_iter.bi_size = sectors << 9;
 
if (bio_integrity(split))
-   bio_integrity_trim(split, 0, sectors);
+   bio_integrity_trim(split, sectors);
 
bio_advance(bio, split->bi_iter.bi_size);
 
@@ -1909,7 +1909,7 @@ void bio_trim(struct bio *bio, int offset, int size)
bio->bi_iter.bi_size = size;
 
if (bio_integrity(bio))
-   bio_integrity_trim(bio, 0, size);
+   bio_integrity_trim(bio, size);
 
 }
 EXPORT_SYMBOL_GPL(bio_trim);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb7597..e54ecdd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1102,7 +1102,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio 
*bio,
clone->bi_iter.bi_size = to_bytes(len);
 
if (bio_integrity(bio))
-   bio_integrity_trim(clone, 0, len);
+   bio_integrity_trim(clone, len);
 
return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 00b086a..350f71d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -732,7 +732,7 @@ extern bool bio_integrity_enabled(struct bio *bio);
 extern int bio_integrity_prep(struct bio *);
 extern void bio_integrity_endio(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
-extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
+extern void bio_integrity_trim(struct bio *, unsigned int);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
 extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
@@ -782,8 +782,7 @@ static inline void bio_integrity_advance(struct bio *bio,
return;
 }
 
-static inline void bio_integrity_trim(struct bio *bio, unsigned int offset,
- unsigned int sectors)
+static inline void bio_integrity_trim(struct bio *bio, unsigned int sectors)
 {
return;
 }
-- 
2.9.3



[PATCH 1/8] Guard bvec iteration logic

2017-03-30 Thread Dmitry Monakhov
If some one try to attempt advance bvec beyond it's size we simply
dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
This simply means that we endup dereferencing/corrupting random memory
region.

Code was added long time ago here 4550dd6c, luckily no one hit it
in real life :)

Signed-off-by: Dmitry Monakhov 
---
 include/linux/bvec.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 89b65b8..86b914f 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -70,8 +70,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
 struct bvec_iter *iter,
 unsigned bytes)
 {
-   WARN_ONCE(bytes > iter->bi_size,
- "Attempted to advance past end of bvec iter\n");
+   BUG_ON(bytes > iter->bi_size);
 
while (bytes) {
unsigned iter_len = bvec_iter_len(bv, *iter);
-- 
2.9.3



[PATCH 6/8] bio-integrity: add bio_integrity_setup helper

2017-03-30 Thread Dmitry Monakhov
Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly. For example retry in
case of ENOMEM.

Signed-off-by: Dmitry Monakhov 
---
 block/blk-core.c |  5 +
 block/blk-mq.c   |  8 ++--
 drivers/nvdimm/blk.c | 13 ++---
 drivers/nvdimm/btt.c | 13 ++---
 include/linux/bio.h  | 25 +
 5 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c22..071a998 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1637,11 +1637,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 
blk_queue_split(q, , q->bio_split);
 
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio->bi_error = -EIO;
-   bio_endio(bio);
+   if (bio_integrity_setup(bio))
return BLK_QC_T_NONE;
-   }
 
if (op_is_flush(bio->bi_opf)) {
spin_lock_irq(q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 08a49c6..a9931ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1489,10 +1489,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
 
blk_queue_bounce(q, );
 
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio_io_error(bio);
+   if (bio_integrity_setup(bio))
return BLK_QC_T_NONE;
-   }
 
blk_queue_split(q, , q->bio_split);
 
@@ -1611,10 +1609,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
 
blk_queue_bounce(q, );
 
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio_io_error(bio);
+   if (bio_integrity_setup(bio))
return BLK_QC_T_NONE;
-   }
 
blk_queue_split(q, , q->bio_split);
 
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 9faaa96..1edb3f3 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -179,16 +179,8 @@ static blk_qc_t nd_blk_make_request(struct request_queue 
*q, struct bio *bio)
int err = 0, rw;
bool do_acct;
 
-   /*
-* bio_integrity_enabled also checks if the bio already has an
-* integrity payload attached. If it does, we *don't* do a
-* bio_integrity_prep here - the payload has been generated by
-* another kernel subsystem, and we just pass it through.
-*/
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio->bi_error = -EIO;
-   goto out;
-   }
+   if (bio_integrity_setup(bio))
+   return BLK_QC_T_NONE;
 
bip = bio_integrity(bio);
nsblk = q->queuedata;
@@ -212,7 +204,6 @@ static blk_qc_t nd_blk_make_request(struct request_queue 
*q, struct bio *bio)
if (do_acct)
nd_iostat_end(bio, start);
 
- out:
bio_endio(bio);
return BLK_QC_T_NONE;
 }
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 368795a..03ded8d 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1158,16 +1158,8 @@ static blk_qc_t btt_make_request(struct request_queue 
*q, struct bio *bio)
int err = 0;
bool do_acct;
 
-   /*
-* bio_integrity_enabled also checks if the bio already has an
-* integrity payload attached. If it does, we *don't* do a
-* bio_integrity_prep here - the payload has been generated by
-* another kernel subsystem, and we just pass it through.
-*/
-   if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-   bio->bi_error = -EIO;
-   goto out;
-   }
+   if (bio_integrity_setup(bio))
+   return BLK_QC_T_NONE;
 
do_acct = nd_iostat_start(bio, );
bio_for_each_segment(bvec, bio, iter) {
@@ -1194,7 +1186,6 @@ static blk_qc_t btt_make_request(struct request_queue *q, 
struct bio *bio)
if (do_acct)
nd_iostat_end(bio, start);
 
-out:
bio_endio(bio);
return BLK_QC_T_NONE;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 350f71d..f477327 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -738,6 +738,26 @@ extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
 extern void bio_integrity_init(void);
 
+static inline int bio_integrity_setup(struct bio *bio)
+{
+   int err = 0;
+
+   /*
+* bio_integrity_enabled also checks if the bio already has an
+* integrity payload attached. If it does, we *don't* do a
+* bio_integrity_prep here - the payload has been generated by
+* another kernel subsystem, and we just pass it through.
+*/
+   if (bio_int

[PATCH 8/8] tcm_fileio: Prevent information leak for short reads

2017-03-30 Thread Dmitry Monakhov
If we failed to read data from backing file (probably because some one
truncate file under us), we must zerofill cmd's data, otherwise it will
be returned as is. Most likely cmd's data are unitialized pages from
page cache. This result in information leak.

xfstests: generic/420
http://marc.info/?l=linux-scsi=149087996913448=2

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/target/target_core_file.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/target/target_core_file.c 
b/drivers/target/target_core_file.c
index 87aa376..d69908d 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -277,12 +277,11 @@ static int fd_do_rw(struct se_cmd *cmd, struct file *fd,
else
ret = vfs_iter_read(fd, , );
 
-   kfree(bvec);
-
if (is_write) {
if (ret < 0 || ret != data_length) {
pr_err("%s() write returned %d\n", __func__, ret);
-   return (ret < 0 ? ret : -EINVAL);
+   if (ret >= 0)
+   ret = -EINVAL;
}
} else {
/*
@@ -295,17 +294,27 @@ static int fd_do_rw(struct se_cmd *cmd, struct file *fd,
pr_err("%s() returned %d, expecting %u for "
"S_ISBLK\n", __func__, ret,
data_length);
-   return (ret < 0 ? ret : -EINVAL);
+   if (ret >= 0)
+   ret = -EINVAL;
}
} else {
if (ret < 0) {
pr_err("%s() returned %d for non S_ISBLK\n",
__func__, ret);
-   return ret;
+   } else if (ret != data_length) {
+   /*
+* Short read case:
+* Probably some one truncate file under us.
+* We must explicitly zero sg-pages to prevent
+* expose uninizialized pages to userspace.
+*/
+   BUG_ON(ret > data_length);
+   ret += iov_iter_zero(data_length - ret, );
}
}
}
-   return 1;
+   kfree(bvec);
+   return ret;
 }
 
 static sense_reason_t
-- 
2.9.3



[PATCH 8/8] tcm_fileio: Prevent information leak for short reads

2017-03-30 Thread Dmitry Monakhov
If we failed to read data from backing file (probably because some one
truncate file under us), we must zerofill cmd's data, otherwise it will
be returned as is. Most likely cmd's data are unitialized pages from
page cache. This result in information leak.

xfstests: generic/420
http://marc.info/?l=linux-scsi=149087996913448=2

Signed-off-by: Dmitry Monakhov 
---
 drivers/target/target_core_file.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/target/target_core_file.c 
b/drivers/target/target_core_file.c
index 87aa376..d69908d 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -277,12 +277,11 @@ static int fd_do_rw(struct se_cmd *cmd, struct file *fd,
else
ret = vfs_iter_read(fd, , );
 
-   kfree(bvec);
-
if (is_write) {
if (ret < 0 || ret != data_length) {
pr_err("%s() write returned %d\n", __func__, ret);
-   return (ret < 0 ? ret : -EINVAL);
+   if (ret >= 0)
+   ret = -EINVAL;
}
} else {
/*
@@ -295,17 +294,27 @@ static int fd_do_rw(struct se_cmd *cmd, struct file *fd,
pr_err("%s() returned %d, expecting %u for "
"S_ISBLK\n", __func__, ret,
data_length);
-   return (ret < 0 ? ret : -EINVAL);
+   if (ret >= 0)
+   ret = -EINVAL;
}
} else {
if (ret < 0) {
pr_err("%s() returned %d for non S_ISBLK\n",
__func__, ret);
-   return ret;
+   } else if (ret != data_length) {
+   /*
+* Short read case:
+* Probably some one truncate file under us.
+* We must explicitly zero sg-pages to prevent
+* expose uninizialized pages to userspace.
+*/
+   BUG_ON(ret > data_length);
+   ret += iov_iter_zero(data_length - ret, );
}
}
}
-   return 1;
+   kfree(bvec);
+   return ret;
 }
 
 static sense_reason_t
-- 
2.9.3



[PATCH 2/8] bio-integrity: Do not allocate integrity context for bio w/o data

2017-03-30 Thread Dmitry Monakhov
If bio has no data, such as ones from blkdev_issue_flush(),
then we have nothing to protect.

This patch prevent bugon like follows:

kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
kernel BUG at mm/slab.c:2773!
invalid opcode:  [#1] SMP
Modules linked in: bcache
CPU: 0 PID: 4428 Comm: xfs_io Tainted: GW   
4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
task: 880137786440 task.stack: c9ba8000
RIP: 0010:kfree_debugcheck+0x25/0x2a
RSP: 0018:c9babde0 EFLAGS: 00010082
RAX: 0034 RBX: ac1fa1d106742a5a RCX: 0007
RDX:  RSI:  RDI: 88013f3ccb40
RBP: c9babde8 R08:  R09: 
R10: fcb76420 R11: 725172ed R12: 0282
R13: 8150e766 R14: 88013a145e00 R15: 0001
FS:  7fb09384bf40() GS:88013f20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fd0172f9e40 CR3: 000137fa9000 CR4: 06f0
Call Trace:
 kfree+0xc8/0x1b3
 bio_integrity_free+0xc3/0x16b
 bio_free+0x25/0x66
 bio_put+0x14/0x26
 blkdev_issue_flush+0x7a/0x85
 blkdev_fsync+0x35/0x42
 vfs_fsync_range+0x8e/0x9f
 vfs_fsync+0x1c/0x1e
 do_fsync+0x31/0x4a
 SyS_fsync+0x10/0x14
 entry_SYSCALL_64_fastpath+0x1f/0xc2

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 block/bio-integrity.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713..b5009a8 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -175,6 +175,9 @@ bool bio_integrity_enabled(struct bio *bio)
if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
return false;
 
+   if (!bio_sectors(bio))
+   return false;
+
/* Already protected? */
if (bio_integrity(bio))
return false;
-- 
2.9.3



[PATCH 2/8] bio-integrity: Do not allocate integrity context for bio w/o data

2017-03-30 Thread Dmitry Monakhov
If bio has no data, such as ones from blkdev_issue_flush(),
then we have nothing to protect.

This patch prevent bugon like follows:

kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
kernel BUG at mm/slab.c:2773!
invalid opcode:  [#1] SMP
Modules linked in: bcache
CPU: 0 PID: 4428 Comm: xfs_io Tainted: GW   
4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
task: 880137786440 task.stack: c9ba8000
RIP: 0010:kfree_debugcheck+0x25/0x2a
RSP: 0018:c9babde0 EFLAGS: 00010082
RAX: 0034 RBX: ac1fa1d106742a5a RCX: 0007
RDX:  RSI:  RDI: 88013f3ccb40
RBP: c9babde8 R08:  R09: 
R10: fcb76420 R11: 725172ed R12: 0282
R13: 8150e766 R14: 88013a145e00 R15: 0001
FS:  7fb09384bf40() GS:88013f20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fd0172f9e40 CR3: 000137fa9000 CR4: 06f0
Call Trace:
 kfree+0xc8/0x1b3
 bio_integrity_free+0xc3/0x16b
 bio_free+0x25/0x66
 bio_put+0x14/0x26
 blkdev_issue_flush+0x7a/0x85
 blkdev_fsync+0x35/0x42
 vfs_fsync_range+0x8e/0x9f
 vfs_fsync+0x1c/0x1e
 do_fsync+0x31/0x4a
 SyS_fsync+0x10/0x14
 entry_SYSCALL_64_fastpath+0x1f/0xc2

Signed-off-by: Dmitry Monakhov 
---
 block/bio-integrity.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713..b5009a8 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -175,6 +175,9 @@ bool bio_integrity_enabled(struct bio *bio)
if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
return false;
 
+   if (!bio_sectors(bio))
+   return false;
+
/* Already protected? */
if (bio_integrity(bio))
return false;
-- 
2.9.3



[PATCH 0/8] block: T10/DIF Fixes and cleanups

2017-03-30 Thread Dmitry Monakhov
This patch set fix various problems spotted during T10/DIF integrity machinery 
testing.

TOC:
## General bulletproof protection for block layer
0001 Guard bvec iteration logic
## Fix various bugs in T10/DIF/DIX infrastructure
0002 bio integrity: Do not allocate integrity context for 
0003 bio integrity: save original iterator for verify stag
0004 bio integrity: bio_trim should truncate integrity vec
0005 bio integrity: fix interface for bio_integrity_trim
## Cleanup T10/DIF/DIX infrastructure
0006 bio integrity add bio_integrity_setup helper
0007 T10 Move opencoded contants to common header
## Fix tcm_fileio info leak
0008 tcm_fileio: Prevent information leak for short reads

Testcase: http://marc.info/?l=linux-scsi=149087997013452=2



[PATCH 0/8] block: T10/DIF Fixes and cleanups

2017-03-30 Thread Dmitry Monakhov
This patch set fix various problems spotted during T10/DIF integrity machinery 
testing.

TOC:
## General bulletproof protection for block layer
0001 Guard bvec iteration logic
## Fix various bugs in T10/DIF/DIX infrastructure
0002 bio integrity: Do not allocate integrity context for 
0003 bio integrity: save original iterator for verify stag
0004 bio integrity: bio_trim should truncate integrity vec
0005 bio integrity: fix interface for bio_integrity_trim
## Cleanup T10/DIF/DIX infrastructure
0006 bio integrity add bio_integrity_setup helper
0007 T10 Move opencoded contants to common header
## Fix tcm_fileio info leak
0008 tcm_fileio: Prevent information leak for short reads

Testcase: http://marc.info/?l=linux-scsi=149087997013452=2



Re: scsi_debug: shared dev context, BUG or FEATURE?

2017-03-28 Thread Dmitry Monakhov
"Martin K. Petersen" <martin.peter...@oracle.com> writes:

> Dmitry Monakhov <dmonak...@openvz.org> writes:
>
> Dmitry,
>
>> scsi_debug has very strange structure from one point it supports
>> dynamic number of devices but from other point context is common for
>> all devices:
>
>> So basically we may have many devices with single context which refers
>> common data. Are any sane reason to share context between devices?
>> Who use such behaviour?
>
> As the name implies, scsi_debug was conceived to debug the SCSI layer.
> Among other things, the intent was to be able to test hundreds of
> controllers and LUNs without having physical hardware or storage to back
> that up. Plus to have a target whose reporting could easily be tweaked
> to test the SCSI core code.
>
> So that's the reason for the oddball shared buffer setup. scsi_debug
> wasn't really meant to be a "useful" storage target.
>
> If you want something with a per-device backing store I suggest you look
> at the SCSI target subsystem. With tcm_loop and ramdisk you get
> essentially the same thing as scsi_debug. With the added bonus that you
> can use files or block devices if you actually want the data to be
> persistent.
Wow this is really awesome. This is exactly what I need. Thank you.
>
>> IMHO this is a pure bug. Please correct me if I'm wrong, I'll plan to
>> fix that by allocation separate context for each dev. 
>
> I don't have a problem with allowing it as an option as long as the
> original behavior can be preserved. But again, I think target mode is a
> better bet if you actually care about what's being stored on the
> "media".

>
> -- 
> Martin K. PetersenOracle Linux Engineering


Re: scsi_debug: shared dev context, BUG or FEATURE?

2017-03-28 Thread Dmitry Monakhov
"Martin K. Petersen"  writes:

> Dmitry Monakhov  writes:
>
> Dmitry,
>
>> scsi_debug has very strange structure from one point it supports
>> dynamic number of devices but from other point context is common for
>> all devices:
>
>> So basically we may have many devices with single context which refers
>> common data. Are any sane reason to share context between devices?
>> Who use such behaviour?
>
> As the name implies, scsi_debug was conceived to debug the SCSI layer.
> Among other things, the intent was to be able to test hundreds of
> controllers and LUNs without having physical hardware or storage to back
> that up. Plus to have a target whose reporting could easily be tweaked
> to test the SCSI core code.
>
> So that's the reason for the oddball shared buffer setup. scsi_debug
> wasn't really meant to be a "useful" storage target.
>
> If you want something with a per-device backing store I suggest you look
> at the SCSI target subsystem. With tcm_loop and ramdisk you get
> essentially the same thing as scsi_debug. With the added bonus that you
> can use files or block devices if you actually want the data to be
> persistent.
Wow this is really awesome. This is exactly what I need. Thank you.
>
>> IMHO this is a pure bug. Please correct me if I'm wrong, I'll plan to
>> fix that by allocation separate context for each dev. 
>
> I don't have a problem with allowing it as an option as long as the
> original behavior can be preserved. But again, I think target mode is a
> better bet if you actually care about what's being stored on the
> "media".

>
> -- 
> Martin K. PetersenOracle Linux Engineering


scsi_debug: shared dev context, BUG or FEATURE?

2017-03-27 Thread Dmitry Monakhov
Hi scsi_debug has very strange structure
from one point it supports dynamic number of devices
but from other point context is common for all devices:
 - dif_storep (array of t10 dif tuples)
 - map_storep (block map for thinprovision)
 - fake_storep (in memory data storage)
 - sdebug_q_arr (queue array)
So basically we may have many devices with single context which refers
common data. Are any sane reason to share context between devices?
Who use such behaviour?

IMHO this is a pure bug. Please correct me if I'm wrong, I'll plan to
fix that by allocation separate context for each dev. 


scsi_debug: shared dev context, BUG or FEATURE?

2017-03-27 Thread Dmitry Monakhov
Hi scsi_debug has very strange structure
from one point it supports dynamic number of devices
but from other point context is common for all devices:
 - dif_storep (array of t10 dif tuples)
 - map_storep (block map for thinprovision)
 - fake_storep (in memory data storage)
 - sdebug_q_arr (queue array)
So basically we may have many devices with single context which refers
common data. Are any sane reason to share context between devices?
Who use such behaviour?

IMHO this is a pure bug. Please correct me if I'm wrong, I'll plan to
fix that by allocation separate context for each dev. 


[PATCH] block: Invalidate cache on discard v2

2017-03-22 Thread Dmitry Monakhov
It is reasonable drop page cache on discard, otherwise that pages may
be written by writeback second later, so thin provision devices will
not be happy. This seems to be a  security leak in case of secure discard case.

Also add check for queue_discard flag on early stage.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
Reviewed-by: Christoph Hellwig <h...@lst.de>

---
 block/ioctl.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 7b88820..336610d 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -202,10 +202,16 @@ static int blk_ioctl_discard(struct block_device *bdev, 
fmode_t mode,
 {
uint64_t range[2];
uint64_t start, len;
+   struct request_queue *q = bdev_get_queue(bdev);
+   struct address_space *mapping = bdev->bd_inode->i_mapping;
+
 
if (!(mode & FMODE_WRITE))
return -EBADF;
 
+   if (!blk_queue_discard(q))
+   return -EOPNOTSUPP;
+
if (copy_from_user(range, (void __user *)arg, sizeof(range)))
return -EFAULT;
 
@@ -216,12 +222,12 @@ static int blk_ioctl_discard(struct block_device *bdev, 
fmode_t mode,
return -EINVAL;
if (len & 511)
return -EINVAL;
-   start >>= 9;
-   len >>= 9;
 
-   if (start + len > (i_size_read(bdev->bd_inode) >> 9))
+   if (start + len > i_size_read(bdev->bd_inode))
return -EINVAL;
-   return blkdev_issue_discard(bdev, start, len, GFP_KERNEL, flags);
+   truncate_inode_pages_range(mapping, start, start + len);
+   return blkdev_issue_discard(bdev, start >> 9, len >> 9,
+   GFP_KERNEL, flags);
 }
 
 static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
-- 
2.9.3



  1   2   3   4   5   >