Re: [RFC] cleanup bcache bio handling

2018-06-13 Thread Ming Lei
On Wed, Jun 13, 2018 at 10:54:09AM -0400, Kent Overstreet wrote:
> On Wed, Jun 13, 2018 at 03:56:32PM +0200, Christoph Hellwig wrote:
> > On Wed, Jun 13, 2018 at 07:06:41PM +0800, Ming Lei wrote:
> > > > before bio_alloc_pages) that can be switched to something that just 
> > > > creates a
> > > > single bvec.
> > > 
> > > Yes, multipage bvec shouldn't break any driver or fs.
> > 
> > It probably isn't broken, at least I didn't see assumptions of the same
> > number of segments.  However the current poking into the bio internals as
> > a bad idea for a couple of reasons.  First because it requires touching
> > bcache for any of these changes, second because it won't get merging of
> > pages into a single bio segment for bіos built by bch_bio_map or
> > bch_bio_alloc_pages, and third bcache is the last user of
> > bio_for_each_chunk_all in your branch, which I'd like to kill off to
> > keep the number of iterators down.
> 
> Agreed about bio_for_each_chunk_all(), but I just looked at the patch that
> introduces them and it looks to me like there's no need, they should just be
> bio_for_each_segment_all().

Now we can't change the vector with bio_for_each_segment_all(), so
bio_for_each_chunk_all() has to be used. So looks it makes sense to use
bio_add_page() to remove bio_for_each_chunk_all().


Thanks,
Ming


[PATCH] dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-13 Thread Luis R. Rodriguez
Setting up a zoned disks in a generic form is not so trivial. There
is also quite a bit of tribal knowledge with these devices which is not
easy to find.

The currently supplied demo script works but it is not generic enough to be
practical for Linux distributions or even developers which often move
from one kernel to another.

This tries to put a bit of this tribal knowledge into an initial udev
rule for development with the hopes Linux distributions can later
deploy. Three rule are added. One rule is optional for now, it should be
extended later to be more distribution-friendly and then I think this
may be ready for consideration for integration on distributions.

1) scheduler setup
2) backlist f2fs devices
3) run dmsetup for the rest of devices

Note that this udev rule will not work well if you want to use a disk
with f2fs on part of the disk and another filesystem on another part of
the disk. That setup will require manual love so these setups can use
the same backlist on rule 2).

Its not widely known for instance that as of v4.16 it is mandated to use
either deadline or the mq-deadline scheduler for *all* SMR drivers. Its
also been determined that the Linux kernel is not the place to set this up,
so a udev rule *is required* as per latest discussions. This is the
first rule we add.

Furthermore if you are *not* using f2fs you always have to run dmsetup.
dmsetups do not persist, so you currently *always* have to run a custom
sort of script, which is not ideal for Linux distributions. We can invert
this logic into a udev rule to enable users to blacklist disks they know they
want to use f2fs for. This the second optional rule. This blacklisting
can be generalized further in the future with an exception list file, for
instance using INPUT{db} or the like.

The third and final rule added then runs dmsetup for the rest of the disks
using the disk serial number for the new device mapper name.

Note that it is currently easy for users to make a mistake and run mkfs
on the the original disk, not the /dev/mapper/ device for non f2fs
arrangements. If that is done experience shows things can easily fall
apart with alignment *eventually*. We have no generic way today to
error out on this condition and proactively prevent this.

Signed-off-by: Luis R. Rodriguez 
---
 README| 10 +-
 udev/99-zoned-disks.rules | 78 +++
 2 files changed, 87 insertions(+), 1 deletion(-)
 create mode 100644 udev/99-zoned-disks.rules

diff --git a/README b/README
index 65e96c34fd04..f49541eaabc8 100644
--- a/README
+++ b/README
@@ -168,7 +168,15 @@ Options:
  reclaiming random zones if the percentage of
  free random data zones falls below .
 
-V. Example scripts
+V. Udev zone disk deployment
+
+
+A udev rule is provided which enables you to set the IO scheduler, blacklist
+driver to run dmsetup, and runs dmsetup for the rest of the zone drivers.
+If you use this udev rule the below script is not needed. Be sure to mkfs only
+on the resulting /dev/mapper/zone-$serial device you end up with.
+
+VI. Example scripts
 ==
 
 [[
diff --git a/udev/99-zoned-disks.rules b/udev/99-zoned-disks.rules
new file mode 100644
index ..e19b738dcc0e
--- /dev/null
+++ b/udev/99-zoned-disks.rules
@@ -0,0 +1,78 @@
+# To use a zone disks first thing you need to:
+#
+# 1) Enable zone disk support in your kernel
+# 2) Use the deadline or mq-deadline scheduler for it - mandated as of v4.16
+# 3) Blacklist devices dedicated for f2fs as of v4.10
+# 4) Run dmsetup other disks
+# 5) Create the filesystem -- NOTE: use mkfs /dev/mapper/zone-serial if
+#you enabled use dmsetup on the disk.
+# 6) Consider using nofail mount option in case you run an supported kernel
+#
+# You can use this udev rules file for 2) 3) and 4). Further details below.
+#
+# 1) Enable zone disk support in your kernel
+#
+#o CONFIG_BLK_DEV_ZONED
+#o CONFIG_DM_ZONED
+#
+# This will let the kernel actually see these devices, ie, via fdisk /dev/sda
+# for instance. Run:
+#
+#  dmzadm --format /dev/sda
+
+# 2) Set deadline or mq-deadline for all disks which are zoned
+#
+# Zoned disks can only work with the deadline or mq-deadline scheduler. This is
+# mandated for all SMR drives since v4.16. It has been determined this must be
+# done through a udev rule, and the kernel should not set this up for disks.
+# This magic will have to live for *all* zoned disks.
+# XXX: what about distributions that want mq-deadline ? Probably easy for now
+#  to assume deadline and later have a mapping file to enable
+#  mq-deadline for specific serial devices?
+ACTION=="add|change", KERNEL=="sd*[!0-9]", ATTRS{queue/zoned}=="host-managed", 
\
+   ATTR{queue/scheduler}="deadline"
+
+# 3) Blacklist f2fs devices as of v4.10
+# We don't have to run dmsetup on on disks where you want to use f2fs, so you
+# can use this rule to skip 

Re: blktests block/019 lead system hang

2018-06-13 Thread Austin.Bolen
On 6/13/2018 10:41 AM, Keith Busch wrote:
> Thanks for the feedback!
>
> This test does indeed toggle the Link Control Link Disable bit to simulate
> the link failure. The PCIe specification specifically covers this case
> in Section 3.2.1, Data Link Control and Management State Machine Rules:
>
>   If the Link Disable bit has been Set by software, then the subsequent
>   transition to DL_Inactive must not be considered an error.
Forgot to mention... this PCIe requirement to not treat Link Disable = 1
as an error is a requirement on PCIe hardware and not from platform
side.  So if you want to test this PCIe spec requirement then you need a
way to ascertain whether PCIe hardware or platform is causing the
error.  Additionally, setting Link Disable is not what is causing the
error in this specific case.  The error is coming from a subsequent MMIO
that is causing UR since link is down.

-Austin


> So this test should suppress any Suprise Down Error events, but handling
> that particular event wasn't the intent of the test (and as you mentioned,
> it ought not occur anyway since the slot is HP Surprise capable).
>
> The test should not suppress reporting the Data Link Layer State Changed
> slot status. And while this doesn't trigger a Slot PDC status, triggering
> a DLLSC should occur since the Link Status DLLLA should go to 0 when
> state machine goes from DL_Active to DL_Down, regardless of if a Suprise
> Down Error was detected.
>
> The Linux PCIEHP driver handles a DLLSC link-down event the same as
> a presence detect remove event, and that's part of what this test was
> trying to cover.
>



Re: [PATCH] lightnvm: pblk: add asynchronous partial read

2018-06-13 Thread Igor Konopko




On 12.06.2018 10:09, Matias Bjørling wrote:

On 06/12/2018 04:59 PM, Javier Gonzalez wrote:

On 11 Jun 2018, at 22.53, Heiner Litz  wrote:

In the read path, partial reads are currently performed synchronously
which affects performance for workloads that generate many partial
reads. This patch adds an asynchronous partial read path as well as
the required partial read ctx.

Signed-off-by: Heiner Litz 
---
drivers/lightnvm/pblk-read.c | 179 
---

drivers/lightnvm/pblk.h  |  10 +++
2 files changed, 128 insertions(+), 61 deletions(-)

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 7570ff6..026c708 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -231,74 +231,36 @@ static void pblk_end_io_read(struct nvm_rq *rqd)
__pblk_end_io_read(pblk, rqd, true);
}

-static int pblk_partial_read(struct pblk *pblk, struct nvm_rq *rqd,
- struct bio *orig_bio, unsigned int bio_init_idx,
- unsigned long *read_bitmap)
+static void pblk_end_partial_read(struct nvm_rq *rqd)
{
-    struct pblk_sec_meta *meta_list = rqd->meta_list;
-    struct bio *new_bio;
+    struct pblk *pblk = rqd->private;
+    struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
+    struct pblk_pr_ctx *pr_ctx = r_ctx->private;
+    struct bio *new_bio = rqd->bio;
+    struct bio *bio = pr_ctx->orig_bio;
struct bio_vec src_bv, dst_bv;
-    void *ppa_ptr = NULL;
-    void *src_p, *dst_p;
-    dma_addr_t dma_ppa_list = 0;
-    __le64 *lba_list_mem, *lba_list_media;
-    int nr_secs = rqd->nr_ppas;
+    struct pblk_sec_meta *meta_list = rqd->meta_list;
+    int bio_init_idx = pr_ctx->bio_init_idx;
+    unsigned long *read_bitmap = _ctx->bitmap;
+    int nr_secs = pr_ctx->orig_nr_secs;
int nr_holes = nr_secs - bitmap_weight(read_bitmap, nr_secs);
-    int i, ret, hole;
-
-    /* Re-use allocated memory for intermediate lbas */
-    lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size);
-    lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size);
-
-    new_bio = bio_alloc(GFP_KERNEL, nr_holes);
-
-    if (pblk_bio_add_pages(pblk, new_bio, GFP_KERNEL, nr_holes))
-    goto err;
-
-    if (nr_holes != new_bio->bi_vcnt) {
-    pr_err("pblk: malformed bio\n");
-    goto err;
-    }
-
-    for (i = 0; i < nr_secs; i++)
-    lba_list_mem[i] = meta_list[i].lba;
-
-    new_bio->bi_iter.bi_sector = 0; /* internal bio */
-    bio_set_op_attrs(new_bio, REQ_OP_READ, 0);
-
-    rqd->bio = new_bio;
-    rqd->nr_ppas = nr_holes;
-    rqd->flags = pblk_set_read_mode(pblk, PBLK_READ_RANDOM);
-
-    if (unlikely(nr_holes == 1)) {
-    ppa_ptr = rqd->ppa_list;
-    dma_ppa_list = rqd->dma_ppa_list;
-    rqd->ppa_addr = rqd->ppa_list[0];
-    }
-
-    ret = pblk_submit_io_sync(pblk, rqd);
-    if (ret) {
-    bio_put(rqd->bio);
-    pr_err("pblk: sync read IO submission failed\n");
-    goto err;
-    }
-
-    if (rqd->error) {
-    atomic_long_inc(>read_failed);
-#ifdef CONFIG_NVM_DEBUG
-    pblk_print_failed_rqd(pblk, rqd, rqd->error);
-#endif
-    }
+    __le64 *lba_list_mem, *lba_list_media;
+    void *src_p, *dst_p;
+    int hole, i;

if (unlikely(nr_holes == 1)) {
    struct ppa_addr ppa;

    ppa = rqd->ppa_addr;
-    rqd->ppa_list = ppa_ptr;
-    rqd->dma_ppa_list = dma_ppa_list;
+    rqd->ppa_list = pr_ctx->ppa_ptr;
+    rqd->dma_ppa_list = pr_ctx->dma_ppa_list;
    rqd->ppa_list[0] = ppa;
}

+    /* Re-use allocated memory for intermediate lbas */
+    lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size);
+    lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size);
+
for (i = 0; i < nr_secs; i++) {
    lba_list_media[i] = meta_list[i].lba;
    meta_list[i].lba = lba_list_mem[i];
@@ -316,7 +278,7 @@ static int pblk_partial_read(struct pblk *pblk, 
struct nvm_rq *rqd,

    meta_list[hole].lba = lba_list_media[i];

    src_bv = new_bio->bi_io_vec[i++];
-    dst_bv = orig_bio->bi_io_vec[bio_init_idx + hole];
+    dst_bv = bio->bi_io_vec[bio_init_idx + hole];

    src_p = kmap_atomic(src_bv.bv_page);
    dst_p = kmap_atomic(dst_bv.bv_page);
@@ -334,19 +296,107 @@ static int pblk_partial_read(struct pblk 
*pblk, struct nvm_rq *rqd,

} while (hole < nr_secs);

bio_put(new_bio);
+    kfree(pr_ctx);

/* restore original request */
rqd->bio = NULL;
rqd->nr_ppas = nr_secs;

+    bio_endio(bio);
__pblk_end_io_read(pblk, rqd, false);
-    return NVM_IO_DONE;
+}
+
+static int pblk_setup_partial_read(struct pblk *pblk, struct nvm_rq 
*rqd,

+    unsigned int bio_init_idx,
+    unsigned long *read_bitmap,
+    int nr_holes)
+{
+    struct pblk_sec_meta *meta_list = rqd->meta_list;
+    struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
+    struct pblk_pr_ctx *pr_ctx;
+    struct bio *new_bio, *bio = r_ctx->private;
+    __le64 

Re: blktests block/019 lead system hang

2018-06-13 Thread Austin.Bolen
On 6/13/2018 10:41 AM, Keith Busch wrote:
> Thanks for the feedback!
> This test does indeed toggle the Link Control Link Disable bit to simulate
> the link failure. The PCIe specification specifically covers this case
> in Section 3.2.1, Data Link Control and Management State Machine Rules:
>
>   If the Link Disable bit has been Set by software, then the subsequent
>   transition to DL_Inactive must not be considered an error.
>
> So this test should suppress any Suprise Down Error events, but handling
> that particular event wasn't the intent of the test (and as you mentioned,
> it ought not occur anyway since the slot is HP Surprise capable).
>
> The test should not suppress reporting the Data Link Layer State Changed
> slot status. And while this doesn't trigger a Slot PDC status, triggering
> a DLLSC should occur since the Link Status DLLLA should go to 0 when
> state machine goes from DL_Active to DL_Down, regardless of if a Suprise
> Down Error was detected.
>
> The Linux PCIEHP driver handles a DLLSC link-down event the same as
> a presence detect remove event, and that's part of what this test was
> trying to cover.

Yes, the R730 could mask the error if OS sets Data Link Layer State
Changed Enable = 1 and could let the OS handle the hot-plug event
similar to what is done for surprise removal.  Current platform policy
on R730 is to not do that and only suppress errors related to physical
surprise removal (PDS = 0).  We'll probably forgo the option of
suppressing any non-surprise remove link down errors even if OS sets
Data Link Layer State Changed Enable = 1 and go straight to the
containment error recovery model for DPC once the architecture is
finalized to handle these non-surprise remove related error.  In the
meantime, it is expected (though not ideal) that this family of servers
will crash for this particular test.  Ditto for the test that disables
Memory Space Enable bit in the command register.

-Austin




Re: [Backport request] blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers

2018-06-13 Thread Greg KH
On Wed, Jun 13, 2018 at 03:36:42PM +, Bart Van Assche wrote:
> Hello stable kernel maintainers,
> 
> Please backport patch 327ea4adcfa3 ("blkdev_report_zones_ioctl():
> Use vmalloc() to allocate large buffers") to at least the v4.17.x and
> v4.14.y stable kernel series. That patch fixes a bug introduced in
> kernel v4.10. The entire patch is shown below.

It's already in my queue to apply, and I normally wait until after it
shows up in a -rc release before queueing it up.

But I'll go do so now, thanks.

greg k-h


Re: blktests block/019 lead system hang

2018-06-13 Thread Keith Busch
On Tue, Jun 12, 2018 at 04:41:54PM -0700, austin.bo...@dell.com wrote:
> It looks like the test is setting the Link Disable bit.  But this is not
> a good simulation for hot-plug surprise removal testing or surprise link
> down (SLD) testing, if that is the intent.  One reason is that Link
> Disable does not invoke SLD semantics per PCIe spec.  This is somewhat
> of a moot point in this case since the switch has Hot-Plug Surprise bit
> set which also masks the SLD semantics in PCIe.
> 
> Also, the Hot-Plug Capable + Surprise Hot-Plug bits set means the
> platform can tolerate the case where "an adapter present in this slot
> might be removed from the system without any prior notification".  It
> does not mean that a system can survive link down under any other
> circumstances such as setting Link Disable or generating a Secondary Bus
> Reset or a true surprise link down event.  To the earlier point, I also
> do not know of any way the OS can know a priori if the platform can
> handle surprise link down outside of surprise remove case.  We can look
> at standardizing a way to do that if OSes find it useful to know.
> 
> Relative to this particular error, Link Disable doesn't clear Presence
> Detect State which would happen on a real Surprise Hot-Plug removal
> event and this is probably why the system crashes.  What will happen is
> that after the link goes to disabled state, the ongoing I/O will cause
> MMIO accesses on the drive and that will cause a UR which is an
> uncorrectable PCIe error (ERR_FATAL on R730).  The BIOS on the R730 is
> surprise remove aware (Surprise Hot-Plug = 1) and so it will check if
> the device is still present by checking Presence Detect State.  If the
> device is not present it will mask the error and let the OS handle the
> device removal due to hot-plug interrupt(s).  If the device is present,
> as in this case, then the BIOS will escalate to OS as a fatal NMI
> (current R730 platform policy is to only mask errors due to removal).
> 
> For future, these servers may report these sort of errors as recoverable
> via the GHES structures in APEI which will allow the OS to recover from
> this non-surprise remove class of error as well.  In the (hopefully
> near) future, the industry will move to DPC as the framework for this
> sort of generic PCIe error handling/recovery but there are architectural
> changes needed that are currently being defined in the relevant
> standards bodies.  Once the architecture is defined it can be
> implemented and tested to verify these sort of test cases pass.

Thanks for the feedback!

This test does indeed toggle the Link Control Link Disable bit to simulate
the link failure. The PCIe specification specifically covers this case
in Section 3.2.1, Data Link Control and Management State Machine Rules:

  If the Link Disable bit has been Set by software, then the subsequent
  transition to DL_Inactive must not be considered an error.

So this test should suppress any Suprise Down Error events, but handling
that particular event wasn't the intent of the test (and as you mentioned,
it ought not occur anyway since the slot is HP Surprise capable).

The test should not suppress reporting the Data Link Layer State Changed
slot status. And while this doesn't trigger a Slot PDC status, triggering
a DLLSC should occur since the Link Status DLLLA should go to 0 when
state machine goes from DL_Active to DL_Down, regardless of if a Suprise
Down Error was detected.

The Linux PCIEHP driver handles a DLLSC link-down event the same as
a presence detect remove event, and that's part of what this test was
trying to cover.


[Backport request] blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers

2018-06-13 Thread Bart Van Assche
Hello stable kernel maintainers,

Please backport patch 327ea4adcfa3 ("blkdev_report_zones_ioctl():
Use vmalloc() to allocate large buffers") to at least the v4.17.x and
v4.14.y stable kernel series. That patch fixes a bug introduced in
kernel v4.10. The entire patch is shown below.

Thanks,

Bart.


commit cf0110698846fc5a93df89eb20ac7cc70a860c17
Author: Bart Van Assche 
Date:   Tue May 22 08:27:22 2018 -0700

blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers

Avoid that complaints similar to the following appear in the kernel log
if the number of zones is sufficiently large:

  fio: page allocation failure: order:9, 
mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
  Call Trace:
  dump_stack+0x63/0x88
  warn_alloc+0xf5/0x190
  __alloc_pages_slowpath+0x8f0/0xb0d
  __alloc_pages_nodemask+0x242/0x260
  alloc_pages_current+0x6a/0xb0
  kmalloc_order+0x18/0x50
  kmalloc_order_trace+0x26/0xb0
  __kmalloc+0x20e/0x220
  blkdev_report_zones_ioctl+0xa5/0x1a0
  blkdev_ioctl+0x1ba/0x930
  block_ioctl+0x41/0x50
  do_vfs_ioctl+0xaa/0x610
  SyS_ioctl+0x79/0x90
  do_syscall_64+0x79/0x1b0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
Signed-off-by: Bart Van Assche 
Cc: Shaun Tancheff 
Cc: Damien Le Moal 
Cc: Christoph Hellwig 
Cc: Martin K. Petersen 
Cc: Hannes Reinecke 
Cc: 
Signed-off-by: Jens Axboe 

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 08e84ef2bc05..3d08dc84db16 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -328,7 +328,11 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, 
fmode_t mode,
if (!rep.nr_zones)
return -EINVAL;
 
-   zones = kcalloc(rep.nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
+   if (rep.nr_zones > INT_MAX / sizeof(struct blk_zone))
+   return -ERANGE;
+
+   zones = kvmalloc(rep.nr_zones * sizeof(struct blk_zone),
+   GFP_KERNEL | __GFP_ZERO);
if (!zones)
return -ENOMEM;
 
@@ -350,7 +354,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, 
fmode_t mode,
}
 
  out:
-   kfree(zones);
+   kvfree(zones);
 
return ret;
 }

Re: [PATCH] blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers

2018-06-13 Thread Jens Axboe
On 6/13/18 9:20 AM, Bart Van Assche wrote:
> On 05/22/18 10:58, Jens Axboe wrote:
>> On 5/22/18 9:27 AM, Bart Van Assche wrote:
>>> Avoid that complaints similar to the following appear in the kernel log
>>> if the number of zones is sufficiently large:
>>>
>>>fio: page allocation failure: order:9, 
>>> mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
>>>Call Trace:
>>>dump_stack+0x63/0x88
>>>warn_alloc+0xf5/0x190
>>>__alloc_pages_slowpath+0x8f0/0xb0d
>>>__alloc_pages_nodemask+0x242/0x260
>>>alloc_pages_current+0x6a/0xb0
>>>kmalloc_order+0x18/0x50
>>>kmalloc_order_trace+0x26/0xb0
>>>__kmalloc+0x20e/0x220
>>>blkdev_report_zones_ioctl+0xa5/0x1a0
>>>blkdev_ioctl+0x1ba/0x930
>>>block_ioctl+0x41/0x50
>>>do_vfs_ioctl+0xaa/0x610
>>>SyS_ioctl+0x79/0x90
>>>do_syscall_64+0x79/0x1b0
>>>entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>
>> Applied, thanks.
> 
> Thank you Jens. This patch (commit 327ea4adcfa3) is now in Linus tree.
> We would like this patch to appear in the v4.14 and v4.17 kernel series 
> too. Since this patch does not have a stable tag, how do you want us
> to proceed? Do you want to send this patch yourself to Greg or do you
> rather expect us to do that?

You can do it - just send an email go greg/stable saking for that commit
sha to be marked for stable.

-- 
Jens Axboe



Re: [PATCH] blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers

2018-06-13 Thread Bart Van Assche

On 05/22/18 10:58, Jens Axboe wrote:

On 5/22/18 9:27 AM, Bart Van Assche wrote:

Avoid that complaints similar to the following appear in the kernel log
if the number of zones is sufficiently large:

   fio: page allocation failure: order:9, 
mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
   Call Trace:
   dump_stack+0x63/0x88
   warn_alloc+0xf5/0x190
   __alloc_pages_slowpath+0x8f0/0xb0d
   __alloc_pages_nodemask+0x242/0x260
   alloc_pages_current+0x6a/0xb0
   kmalloc_order+0x18/0x50
   kmalloc_order_trace+0x26/0xb0
   __kmalloc+0x20e/0x220
   blkdev_report_zones_ioctl+0xa5/0x1a0
   blkdev_ioctl+0x1ba/0x930
   block_ioctl+0x41/0x50
   do_vfs_ioctl+0xaa/0x610
   SyS_ioctl+0x79/0x90
   do_syscall_64+0x79/0x1b0
   entry_SYSCALL_64_after_hwframe+0x3d/0xa2


Applied, thanks.


Thank you Jens. This patch (commit 327ea4adcfa3) is now in Linus tree.
We would like this patch to appear in the v4.14 and v4.17 kernel series 
too. Since this patch does not have a stable tag, how do you want us

to proceed? Do you want to send this patch yourself to Greg or do you
rather expect us to do that?

Thanks,

Bart.


Re: [RFC] cleanup bcache bio handling

2018-06-13 Thread Kent Overstreet
On Wed, Jun 13, 2018 at 03:56:32PM +0200, Christoph Hellwig wrote:
> On Wed, Jun 13, 2018 at 07:06:41PM +0800, Ming Lei wrote:
> > > before bio_alloc_pages) that can be switched to something that just 
> > > creates a
> > > single bvec.
> > 
> > Yes, multipage bvec shouldn't break any driver or fs.
> 
> It probably isn't broken, at least I didn't see assumptions of the same
> number of segments.  However the current poking into the bio internals as
> a bad idea for a couple of reasons.  First because it requires touching
> bcache for any of these changes, second because it won't get merging of
> pages into a single bio segment for bіos built by bch_bio_map or
> bch_bio_alloc_pages, and third bcache is the last user of
> bio_for_each_chunk_all in your branch, which I'd like to kill off to
> keep the number of iterators down.

Agreed about bio_for_each_chunk_all(), but I just looked at the patch that
introduces them and it looks to me like there's no need, they should just be
bio_for_each_segment_all().

Converting bch_bio_map() and bch_bio_alloc_pages() to bio_add_page() is fine by
me, but your patch series doesn't do any of those actual cleanups: your
description of the patch series does not actually match what it does.


Re: [PATCH 1/6] block: add a bio_reuse helper

2018-06-13 Thread Kent Overstreet
On Wed, Jun 13, 2018 at 03:59:15PM +0200, Christoph Hellwig wrote:
> On Wed, Jun 13, 2018 at 04:54:41AM -0400, Kent Overstreet wrote:
> > bi_size is not immutable though, it will usually be modified by drivers 
> > when you
> > submit a bio.
> > 
> > I see what you're trying to do, but your approach is busted given the way 
> > the
> > block layer works today. You'd have to save bio->bi_iter before submitting 
> > the
> > bio and restore it afterwards for it to work.
> 
> For bi_size, agreed this needs fixing.  bi_sector is always restored
> already by the callers, and the remaining fields are zeroed by bio_reset,
> which does the right thing.

This still shouldn't be a new helper. If the caller is restoring bi_sector they
can restore bi_size as well, and restoring only part of bi_iter should not be
encouraged as it's not safe in general - it is a quite reasonable thing to want
to restore a bio to the state it was pre submission (e.g. for bouncing) and what
you're doing is definitely _not_ safe in general.


[PATCH 1/6] block: add a bio_reuse helper

2018-06-13 Thread Christoph Hellwig
This abstracts out a way to reuse a bio without destroying the
bio vectors containing the data.

Signed-off-by: Christoph Hellwig 
---
 block/bio.c | 19 +++
 include/linux/bio.h |  1 +
 2 files changed, 20 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 70c4e1b6dd45..67393ab9abe0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -308,6 +308,25 @@ void bio_reset(struct bio *bio)
 }
 EXPORT_SYMBOL(bio_reset);
 
+/**
+ * bio_reuse - prepare a bio for reuse
+ * @bio:   bio to reuse
+ * @size:  size of the bio 
+ *
+ * Prepares an already setup and possible used bio for reusing it another
+ * time.  Compared to bio_reset() this preserves the setup of the bio
+ * vectors containing the data.
+ */
+void bio_reuse(struct bio *bio, unsigned size)
+{
+   unsigned short vcnt = bio->bi_vcnt;
+
+   bio_reset(bio);
+   bio->bi_iter.bi_size = size;
+   bio->bi_vcnt = vcnt;
+}
+EXPORT_SYMBOL_GPL(bio_reuse);
+
 static struct bio *__bio_chain_endio(struct bio *bio)
 {
struct bio *parent = bio->bi_private;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index f08f5fe7bd08..d114ccc4bac2 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -475,6 +475,7 @@ extern void bio_init(struct bio *bio, struct bio_vec *table,
 unsigned short max_vecs);
 extern void bio_uninit(struct bio *);
 extern void bio_reset(struct bio *);
+void bio_reuse(struct bio *, unsigned int);
 void bio_chain(struct bio *, struct bio *);
 
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned 
int);
-- 
2.17.1



[PATCH 4/6] bcache: clean up bio reuse for struct dirty_io

2018-06-13 Thread Christoph Hellwig
Instead of reinitializing the bio everytime we can call bio_reuse when
reusing it.  Also moves the private data initialization out of
dirty_init, which is renamed to suit the remaining functionality.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/bcache/writeback.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index ad45ebe1a74b..0b6d07eab87c 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -179,19 +179,12 @@ struct dirty_io {
struct bio  bio;
 };
 
-static void dirty_init(struct keybuf_key *w)
+static void dirty_init_prio(struct keybuf_key *w)
 {
struct dirty_io *io = w->private;
-   struct bio *bio = >bio;
 
-   bio_init(bio, bio->bi_inline_vecs,
-DIV_ROUND_UP(KEY_SIZE(>key), PAGE_SECTORS));
if (!io->dc->writeback_percent)
-   bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
-
-   bio->bi_iter.bi_size= KEY_SIZE(>key) << 9;
-   bio->bi_private = w;
-   bch_bio_map(bio, NULL);
+   bio_set_prio(>bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
 }
 
 static void dirty_io_destructor(struct closure *cl)
@@ -285,10 +278,12 @@ static void write_dirty(struct closure *cl)
 * to clean up.
 */
if (KEY_DIRTY(>key)) {
-   dirty_init(w);
-   bio_set_op_attrs(>bio, REQ_OP_WRITE, 0);
+   bio_reuse(>bio, KEY_SIZE(>key) << 9);
+   dirty_init_prio(w);
+   io->bio.bi_opf = REQ_OP_WRITE;
io->bio.bi_iter.bi_sector = KEY_START(>key);
bio_set_dev(>bio, io->dc->bdev);
+   io->bio.bi_private  = w;
io->bio.bi_end_io   = dirty_endio;
 
/* I/O request sent to backing device */
@@ -399,13 +394,18 @@ static void read_dirty(struct cached_dev *dc)
io->dc  = dc;
io->sequence= sequence++;
 
-   dirty_init(w);
-   bio_set_op_attrs(>bio, REQ_OP_READ, 0);
+   bio_init(>bio, io->bio.bi_inline_vecs,
+   DIV_ROUND_UP(KEY_SIZE(>key), PAGE_SECTORS));
+   dirty_init_prio(w);
+   io->bio.bi_opf = REQ_OP_READ;
+   io->bio.bi_iter.bi_size = KEY_SIZE(>key) << 9;
io->bio.bi_iter.bi_sector = PTR_OFFSET(>key, 0);
bio_set_dev(>bio,
PTR_CACHE(dc->disk.c, >key, 0)->bdev);
+   io->bio.bi_private  = w;
io->bio.bi_end_io   = read_dirty_endio;
 
+   bch_bio_map(>bio, NULL);
if (bch_bio_alloc_pages(>bio, GFP_KERNEL))
goto err_free;
 
-- 
2.17.1



[PATCH 5/6] bcache: don't clone bio in bch_data_verify

2018-06-13 Thread Christoph Hellwig
We immediately overwrite the biovec array, so instead just allocate
a new bio and copy over the disk, setor and size.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/bcache/debug.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index d030ce3025a6..04d146711950 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -110,11 +110,15 @@ void bch_data_verify(struct cached_dev *dc, struct bio 
*bio)
struct bio_vec bv, cbv;
struct bvec_iter iter, citer = { 0 };
 
-   check = bio_clone_kmalloc(bio, GFP_NOIO);
+   check = bio_kmalloc(GFP_NOIO, bio_segments(bio));
if (!check)
return;
+   check->bi_disk = bio->bi_disk;
check->bi_opf = REQ_OP_READ;
+   check->bi_iter.bi_sector = bio->bi_iter.bi_sector;
+   check->bi_iter.bi_size = bio->bi_iter.bi_size;
 
+   bch_bio_map(check, NULL);
if (bch_bio_alloc_pages(check, GFP_NOIO))
goto out_put;
 
-- 
2.17.1



[PATCH 6/6] bcache: use bio_add_page instead of open coded bio manipulation

2018-06-13 Thread Christoph Hellwig
Let bch_bio_alloc_pages and bch_bio_map set up the bio vec information
and bi_size.  This also means no additional bch_bio_map call with
a NULL argument is needed before bch_bio_alloc_pages.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/bcache/btree.c | 16 +++-
 drivers/md/bcache/debug.c |  7 +---
 drivers/md/bcache/journal.c   |  8 +---
 drivers/md/bcache/movinggc.c  |  4 +-
 drivers/md/bcache/request.c   |  5 +--
 drivers/md/bcache/super.c |  8 +---
 drivers/md/bcache/util.c  | 74 +++
 drivers/md/bcache/util.h  |  4 +-
 drivers/md/bcache/writeback.c |  4 +-
 9 files changed, 52 insertions(+), 78 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2a0968c04e21..0f585fa9051f 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -298,12 +298,11 @@ static void bch_btree_node_read(struct btree *b)
closure_init_stack();
 
bio = bch_bbio_alloc(b->c);
-   bio->bi_iter.bi_size = KEY_SIZE(>key) << 9;
bio->bi_end_io  = btree_node_read_endio;
bio->bi_private = 
bio->bi_opf = REQ_OP_READ | REQ_META;
 
-   bch_bio_map(bio, b->keys.set[0].data);
+   bch_bio_map(bio, b->keys.set[0].data, KEY_SIZE(>key) << 9);
 
bch_submit_bbio(bio, b->c, >key, 0);
closure_sync();
@@ -386,19 +385,19 @@ static void do_btree_node_write(struct btree *b)
 {
struct closure *cl = >io;
struct bset *i = btree_bset_last(b);
+   size_t size;
BKEY_PADDED(key) k;
 
i->version  = BCACHE_BSET_VERSION;
i->csum = btree_csum_set(b, i);
 
+   size = roundup(set_bytes(i), block_bytes(b->c));
+
BUG_ON(b->bio);
b->bio = bch_bbio_alloc(b->c);
-
b->bio->bi_end_io   = btree_node_write_endio;
b->bio->bi_private  = cl;
-   b->bio->bi_iter.bi_size = roundup(set_bytes(i), block_bytes(b->c));
b->bio->bi_opf  = REQ_OP_WRITE | REQ_META | REQ_FUA;
-   bch_bio_map(b->bio, i);
 
/*
 * If we're appending to a leaf node, we don't technically need FUA -
@@ -419,7 +418,7 @@ static void do_btree_node_write(struct btree *b)
SET_PTR_OFFSET(, 0, PTR_OFFSET(, 0) +
   bset_sector_offset(>keys, i));
 
-   if (!bch_bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) {
+   if (!bch_bio_alloc_pages(b->bio, size, __GFP_NOWARN | GFP_NOWAIT)) {
int j;
struct bio_vec *bv;
void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
@@ -432,10 +431,7 @@ static void do_btree_node_write(struct btree *b)
 
continue_at(cl, btree_node_write_done, NULL);
} else {
-   /* No problem for multipage bvec since the bio is just 
allocated */
-   b->bio->bi_vcnt = 0;
-   bch_bio_map(b->bio, i);
-
+   bch_bio_map(b->bio, i, size);
bch_submit_bbio(b->bio, b->c, , 0);
 
closure_sync(cl);
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 04d146711950..b089597fb607 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -52,9 +52,8 @@ void bch_btree_verify(struct btree *b)
bio = bch_bbio_alloc(b->c);
bio_set_dev(bio, PTR_CACHE(b->c, >key, 0)->bdev);
bio->bi_iter.bi_sector  = PTR_OFFSET(>key, 0);
-   bio->bi_iter.bi_size= KEY_SIZE(>key) << 9;
bio->bi_opf = REQ_OP_READ | REQ_META;
-   bch_bio_map(bio, sorted);
+   bch_bio_map(bio, sorted, KEY_SIZE(>key) << 9);
 
submit_bio_wait(bio);
bch_bbio_free(bio, b->c);
@@ -116,10 +115,8 @@ void bch_data_verify(struct cached_dev *dc, struct bio 
*bio)
check->bi_disk = bio->bi_disk;
check->bi_opf = REQ_OP_READ;
check->bi_iter.bi_sector = bio->bi_iter.bi_sector;
-   check->bi_iter.bi_size = bio->bi_iter.bi_size;
 
-   bch_bio_map(check, NULL);
-   if (bch_bio_alloc_pages(check, GFP_NOIO))
+   if (bch_bio_alloc_pages(check, bio->bi_iter.bi_size, GFP_NOIO))
goto out_put;
 
submit_bio_wait(check);
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 18f1b5239620..44c3fc5f3b0a 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -55,12 +55,10 @@ reread: left = ca->sb.bucket_size - offset;
bio_reset(bio);
bio->bi_iter.bi_sector  = bucket + offset;
bio_set_dev(bio, ca->bdev);
-   bio->bi_iter.bi_size= len << 9;
-
bio->bi_end_io  = journal_read_endio;
bio->bi_private = 
bio_set_op_attrs(bio, REQ_OP_READ, 0);
-   bch_bio_map(bio, data);
+   bch_bio_map(bio, data, len << 9);
 
closure_bio_submit(ca->set, bio, );
closure_sync();
@@ -652,13 +650,11 @@ static void 

[PATCH 2/6] bcache: use bio_reuse instead of bio_reinit where applicable

2018-06-13 Thread Christoph Hellwig
Use the bio_reuse helper instead of rebuilding the bio_vecs and
size for bios that get reused for the same data.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/bcache/request.c | 5 +
 drivers/md/bcache/super.c   | 6 ++
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ae67f5fa8047..3e699be2e79b 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -810,12 +810,9 @@ static void cached_dev_read_done(struct closure *cl)
 */
 
if (s->iop.bio) {
-   bio_reset(s->iop.bio);
+   bio_reuse(s->iop.bio, s->insert_bio_sectors << 9);
s->iop.bio->bi_iter.bi_sector = 
s->cache_miss->bi_iter.bi_sector;
bio_copy_dev(s->iop.bio, s->cache_miss);
-   s->iop.bio->bi_iter.bi_size = s->insert_bio_sectors << 9;
-   bch_bio_map(s->iop.bio, NULL);
-
bio_copy_data(s->cache_miss, s->iop.bio);
 
bio_put(s->cache_miss);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index a31e55bcc4e5..56692a451aeb 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -205,9 +205,7 @@ static void __write_super(struct cache_sb *sb, struct bio 
*bio)
unsigned i;
 
bio->bi_iter.bi_sector  = SB_SECTOR;
-   bio->bi_iter.bi_size= SB_SIZE;
bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC|REQ_META);
-   bch_bio_map(bio, NULL);
 
out->offset = cpu_to_le64(sb->offset);
out->version= cpu_to_le64(sb->version);
@@ -249,7 +247,7 @@ void bch_write_bdev_super(struct cached_dev *dc, struct 
closure *parent)
down(>sb_write_mutex);
closure_init(cl, parent);
 
-   bio_reset(bio);
+   bio_reuse(bio, SB_SIZE);
bio_set_dev(bio, dc->bdev);
bio->bi_end_io  = write_bdev_super_endio;
bio->bi_private = dc;
@@ -298,7 +296,7 @@ void bcache_write_super(struct cache_set *c)
 
SET_CACHE_SYNC(>sb, CACHE_SYNC(>sb));
 
-   bio_reset(bio);
+   bio_reuse(bio, SB_SIZE);
bio_set_dev(bio, ca->bdev);
bio->bi_end_io  = write_super_endio;
bio->bi_private = ca;
-- 
2.17.1



[RFC] cleanup bcache bio handling v2

2018-06-13 Thread Christoph Hellwig
Hi all,

this series cleans up various places where bcache is way too intimate
with bio internals.  This is intended as a baseline for the multi-page
biovec work, which requires some nasty workarounds for the existing
code.

Note that I do not have a bcache test setup, so this will require
some careful actual testing with whatever test cases are used for
bcache.

Also the new bio_reused helper should be useful at least for MD raid,
but that work is left for later.

Changes since v1:
 - restore bi_size properly


[PATCH 3/6] bcache: clean up bio reuse for struct moving_io

2018-06-13 Thread Christoph Hellwig
Instead of reinitializing the bio everytime we can call bio_reuse when
reusing it.  Also removes the remainder of the moving_init helper
to improve readability.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/bcache/movinggc.c | 40 +---
 1 file changed, 19 insertions(+), 21 deletions(-)

diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
index a24c3a95b2c0..890c055eb589 100644
--- a/drivers/md/bcache/movinggc.c
+++ b/drivers/md/bcache/movinggc.c
@@ -74,29 +74,20 @@ static void read_moving_endio(struct bio *bio)
bch_bbio_endio(io->op.c, bio, bio->bi_status, "reading data to move");
 }
 
-static void moving_init(struct moving_io *io)
-{
-   struct bio *bio = >bio.bio;
-
-   bio_init(bio, bio->bi_inline_vecs,
-DIV_ROUND_UP(KEY_SIZE(>w->key), PAGE_SECTORS));
-   bio_get(bio);
-   bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
-
-   bio->bi_iter.bi_size= KEY_SIZE(>w->key) << 9;
-   bio->bi_private = >cl;
-   bch_bio_map(bio, NULL);
-}
-
 static void write_moving(struct closure *cl)
 {
struct moving_io *io = container_of(cl, struct moving_io, cl);
struct data_insert_op *op = >op;
 
if (!op->status) {
-   moving_init(io);
+   struct bio *bio = >bio.bio;
+
+   bio_reuse(bio, KEY_SIZE(>w->key) << 9);
+   bio_get(bio);
+   bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
+   bio->bi_private = >cl;
+   bio->bi_iter.bi_sector = KEY_START(>w->key);
 
-   io->bio.bio.bi_iter.bi_sector = KEY_START(>w->key);
op->write_prio  = 1;
op->bio = >bio.bio;
 
@@ -156,12 +147,19 @@ static void read_moving(struct cache_set *c)
io->op.c= c;
io->op.wq   = c->moving_gc_wq;
 
-   moving_init(io);
bio = >bio.bio;
-
-   bio_set_op_attrs(bio, REQ_OP_READ, 0);
-   bio->bi_end_io  = read_moving_endio;
-
+   bio_init(bio, bio->bi_inline_vecs,
+   DIV_ROUND_UP(KEY_SIZE(>w->key), PAGE_SECTORS));
+   bio_get(bio);
+
+   bio->bi_iter.bi_size = KEY_SIZE(>w->key) << 9;
+   bio->bi_iter.bi_sector = KEY_START(>w->key);
+   bio->bi_opf = REQ_OP_READ;
+   bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
+   bio->bi_private = >cl;
+   bio->bi_end_io = read_moving_endio;
+
+   bch_bio_map(bio, NULL);
if (bch_bio_alloc_pages(bio, GFP_KERNEL))
goto err;
 
-- 
2.17.1



Re: [PATCH 1/6] block: add a bio_reuse helper

2018-06-13 Thread Christoph Hellwig
On Wed, Jun 13, 2018 at 04:54:41AM -0400, Kent Overstreet wrote:
> bi_size is not immutable though, it will usually be modified by drivers when 
> you
> submit a bio.
> 
> I see what you're trying to do, but your approach is busted given the way the
> block layer works today. You'd have to save bio->bi_iter before submitting the
> bio and restore it afterwards for it to work.

For bi_size, agreed this needs fixing.  bi_sector is always restored
already by the callers, and the remaining fields are zeroed by bio_reset,
which does the right thing.


Re: [RFC] cleanup bcache bio handling

2018-06-13 Thread Christoph Hellwig
On Wed, Jun 13, 2018 at 07:06:41PM +0800, Ming Lei wrote:
> > before bio_alloc_pages) that can be switched to something that just creates 
> > a
> > single bvec.
> 
> Yes, multipage bvec shouldn't break any driver or fs.

It probably isn't broken, at least I didn't see assumptions of the same
number of segments.  However the current poking into the bio internals as
a bad idea for a couple of reasons.  First because it requires touching
bcache for any of these changes, second because it won't get merging of
pages into a single bio segment for bіos built by bch_bio_map or
bch_bio_alloc_pages, and third bcache is the last user of
bio_for_each_chunk_all in your branch, which I'd like to kill off to
keep the number of iterators down.


Re: [RFC] cleanup bcache bio handling

2018-06-13 Thread Ming Lei
On Wed, Jun 13, 2018 at 05:58:01AM -0400, Kent Overstreet wrote:
> On Mon, Jun 11, 2018 at 09:48:00PM +0200, Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series cleans up various places where bcache is way too intimate
> > with bio internals.  This is intended as a baseline for the multi-page
> > biovec work, which requires some nasty workarounds for the existing
> > code.
> > 
> > Note that I do not have a bcache test setup, so this will require
> > some careful actual testing with whatever test cases are used for
> > bcache.
> > 
> > Also the new bio_reused helper should be useful at least for MD raid,
> > but that work is left for later.
> 
> Strong nak on the patch series (except for the patch to not clone in
> bch_data_verify, that patch looks fine).
> 
> Unless something has seriously changed with how multi-page bvecs work since I
> started that code nothing in bcache should be broken by it - Ming, can you
> confirm or deny if that's still correct?

Yes, the multi-page bvecs implementation didn't have big change, and
previously I run xfstests on bcache0, and no regression was observed.

> 
> It is true that bch_bio_map() will be suboptimal when multi page bvecs go in,
> but it won't be broken. The way it's used now is really conflating two 
> different
> things, but where it's used for mapping a bio to a single buffer (i.e. not
> before bio_alloc_pages) that can be switched to something that just creates a
> single bvec.

Yes, multipage bvec shouldn't break any driver or fs.

Thanks,
Ming


Re: [RFC] cleanup bcache bio handling

2018-06-13 Thread Kent Overstreet
On Mon, Jun 11, 2018 at 09:48:00PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series cleans up various places where bcache is way too intimate
> with bio internals.  This is intended as a baseline for the multi-page
> biovec work, which requires some nasty workarounds for the existing
> code.
> 
> Note that I do not have a bcache test setup, so this will require
> some careful actual testing with whatever test cases are used for
> bcache.
> 
> Also the new bio_reused helper should be useful at least for MD raid,
> but that work is left for later.

Strong nak on the patch series (except for the patch to not clone in
bch_data_verify, that patch looks fine).

Unless something has seriously changed with how multi-page bvecs work since I
started that code nothing in bcache should be broken by it - Ming, can you
confirm or deny if that's still correct?

It is true that bch_bio_map() will be suboptimal when multi page bvecs go in,
but it won't be broken. The way it's used now is really conflating two different
things, but where it's used for mapping a bio to a single buffer (i.e. not
before bio_alloc_pages) that can be switched to something that just creates a
single bvec.


Re: [PATCH 1/6] block: add a bio_reuse helper

2018-06-13 Thread Kent Overstreet
On Wed, Jun 13, 2018 at 09:32:04AM +0200, Christoph Hellwig wrote:
> On Tue, Jun 12, 2018 at 02:16:30AM -0400, Kent Overstreet wrote:
> > On Mon, Jun 11, 2018 at 09:48:01PM +0200, Christoph Hellwig wrote:
> > > This abstracts out a way to reuse a bio without destroying the
> > > data pointers.
> > 
> > What is the point of this? What "data pointers" does it not destroy?
> 
> It keeps bi_vcnt and bi_size intact in addition to bi_max_vecs/bi_io_vec
> already kept by the existing bio_reset.
> 
> The befit is that you don't need to rebuild bi_vcnt, bio_io_vec and
> bi_size when reusing the bio.

bi_size is not immutable though, it will usually be modified by drivers when you
submit a bio.

I see what you're trying to do, but your approach is busted given the way the
block layer works today. You'd have to save bio->bi_iter before submitting the
bio and restore it afterwards for it to work.


Re: [PATCH 1/6] block: add a bio_reuse helper

2018-06-13 Thread Christoph Hellwig
On Tue, Jun 12, 2018 at 02:16:30AM -0400, Kent Overstreet wrote:
> On Mon, Jun 11, 2018 at 09:48:01PM +0200, Christoph Hellwig wrote:
> > This abstracts out a way to reuse a bio without destroying the
> > data pointers.
> 
> What is the point of this? What "data pointers" does it not destroy?

It keeps bi_vcnt and bi_size intact in addition to bi_max_vecs/bi_io_vec
already kept by the existing bio_reset.

The befit is that you don't need to rebuild bi_vcnt, bio_io_vec and
bi_size when reusing the bio.