Re: [RFC] cleanup bcache bio handling
On Wed, Jun 13, 2018 at 10:54:09AM -0400, Kent Overstreet wrote: > On Wed, Jun 13, 2018 at 03:56:32PM +0200, Christoph Hellwig wrote: > > On Wed, Jun 13, 2018 at 07:06:41PM +0800, Ming Lei wrote: > > > > before bio_alloc_pages) that can be switched to something that just > > > > creates a > > > > single bvec. > > > > > > Yes, multipage bvec shouldn't break any driver or fs. > > > > It probably isn't broken, at least I didn't see assumptions of the same > > number of segments. However the current poking into the bio internals as > > a bad idea for a couple of reasons. First because it requires touching > > bcache for any of these changes, second because it won't get merging of > > pages into a single bio segment for bіos built by bch_bio_map or > > bch_bio_alloc_pages, and third bcache is the last user of > > bio_for_each_chunk_all in your branch, which I'd like to kill off to > > keep the number of iterators down. > > Agreed about bio_for_each_chunk_all(), but I just looked at the patch that > introduces them and it looks to me like there's no need, they should just be > bio_for_each_segment_all(). Now we can't change the vector with bio_for_each_segment_all(), so bio_for_each_chunk_all() has to be used. So looks it makes sense to use bio_add_page() to remove bio_for_each_chunk_all(). Thanks, Ming
[PATCH] dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup
Setting up a zoned disks in a generic form is not so trivial. There is also quite a bit of tribal knowledge with these devices which is not easy to find. The currently supplied demo script works but it is not generic enough to be practical for Linux distributions or even developers which often move from one kernel to another. This tries to put a bit of this tribal knowledge into an initial udev rule for development with the hopes Linux distributions can later deploy. Three rule are added. One rule is optional for now, it should be extended later to be more distribution-friendly and then I think this may be ready for consideration for integration on distributions. 1) scheduler setup 2) backlist f2fs devices 3) run dmsetup for the rest of devices Note that this udev rule will not work well if you want to use a disk with f2fs on part of the disk and another filesystem on another part of the disk. That setup will require manual love so these setups can use the same backlist on rule 2). Its not widely known for instance that as of v4.16 it is mandated to use either deadline or the mq-deadline scheduler for *all* SMR drivers. Its also been determined that the Linux kernel is not the place to set this up, so a udev rule *is required* as per latest discussions. This is the first rule we add. Furthermore if you are *not* using f2fs you always have to run dmsetup. dmsetups do not persist, so you currently *always* have to run a custom sort of script, which is not ideal for Linux distributions. We can invert this logic into a udev rule to enable users to blacklist disks they know they want to use f2fs for. This the second optional rule. This blacklisting can be generalized further in the future with an exception list file, for instance using INPUT{db} or the like. The third and final rule added then runs dmsetup for the rest of the disks using the disk serial number for the new device mapper name. Note that it is currently easy for users to make a mistake and run mkfs on the the original disk, not the /dev/mapper/ device for non f2fs arrangements. If that is done experience shows things can easily fall apart with alignment *eventually*. We have no generic way today to error out on this condition and proactively prevent this. Signed-off-by: Luis R. Rodriguez --- README| 10 +- udev/99-zoned-disks.rules | 78 +++ 2 files changed, 87 insertions(+), 1 deletion(-) create mode 100644 udev/99-zoned-disks.rules diff --git a/README b/README index 65e96c34fd04..f49541eaabc8 100644 --- a/README +++ b/README @@ -168,7 +168,15 @@ Options: reclaiming random zones if the percentage of free random data zones falls below . -V. Example scripts +V. Udev zone disk deployment + + +A udev rule is provided which enables you to set the IO scheduler, blacklist +driver to run dmsetup, and runs dmsetup for the rest of the zone drivers. +If you use this udev rule the below script is not needed. Be sure to mkfs only +on the resulting /dev/mapper/zone-$serial device you end up with. + +VI. Example scripts == [[ diff --git a/udev/99-zoned-disks.rules b/udev/99-zoned-disks.rules new file mode 100644 index ..e19b738dcc0e --- /dev/null +++ b/udev/99-zoned-disks.rules @@ -0,0 +1,78 @@ +# To use a zone disks first thing you need to: +# +# 1) Enable zone disk support in your kernel +# 2) Use the deadline or mq-deadline scheduler for it - mandated as of v4.16 +# 3) Blacklist devices dedicated for f2fs as of v4.10 +# 4) Run dmsetup other disks +# 5) Create the filesystem -- NOTE: use mkfs /dev/mapper/zone-serial if +#you enabled use dmsetup on the disk. +# 6) Consider using nofail mount option in case you run an supported kernel +# +# You can use this udev rules file for 2) 3) and 4). Further details below. +# +# 1) Enable zone disk support in your kernel +# +#o CONFIG_BLK_DEV_ZONED +#o CONFIG_DM_ZONED +# +# This will let the kernel actually see these devices, ie, via fdisk /dev/sda +# for instance. Run: +# +# dmzadm --format /dev/sda + +# 2) Set deadline or mq-deadline for all disks which are zoned +# +# Zoned disks can only work with the deadline or mq-deadline scheduler. This is +# mandated for all SMR drives since v4.16. It has been determined this must be +# done through a udev rule, and the kernel should not set this up for disks. +# This magic will have to live for *all* zoned disks. +# XXX: what about distributions that want mq-deadline ? Probably easy for now +# to assume deadline and later have a mapping file to enable +# mq-deadline for specific serial devices? +ACTION=="add|change", KERNEL=="sd*[!0-9]", ATTRS{queue/zoned}=="host-managed", \ + ATTR{queue/scheduler}="deadline" + +# 3) Blacklist f2fs devices as of v4.10 +# We don't have to run dmsetup on on disks where you want to use f2fs, so you +# can use this rule to skip
Re: blktests block/019 lead system hang
On 6/13/2018 10:41 AM, Keith Busch wrote: > Thanks for the feedback! > > This test does indeed toggle the Link Control Link Disable bit to simulate > the link failure. The PCIe specification specifically covers this case > in Section 3.2.1, Data Link Control and Management State Machine Rules: > > If the Link Disable bit has been Set by software, then the subsequent > transition to DL_Inactive must not be considered an error. Forgot to mention... this PCIe requirement to not treat Link Disable = 1 as an error is a requirement on PCIe hardware and not from platform side. So if you want to test this PCIe spec requirement then you need a way to ascertain whether PCIe hardware or platform is causing the error. Additionally, setting Link Disable is not what is causing the error in this specific case. The error is coming from a subsequent MMIO that is causing UR since link is down. -Austin > So this test should suppress any Suprise Down Error events, but handling > that particular event wasn't the intent of the test (and as you mentioned, > it ought not occur anyway since the slot is HP Surprise capable). > > The test should not suppress reporting the Data Link Layer State Changed > slot status. And while this doesn't trigger a Slot PDC status, triggering > a DLLSC should occur since the Link Status DLLLA should go to 0 when > state machine goes from DL_Active to DL_Down, regardless of if a Suprise > Down Error was detected. > > The Linux PCIEHP driver handles a DLLSC link-down event the same as > a presence detect remove event, and that's part of what this test was > trying to cover. >
Re: [PATCH] lightnvm: pblk: add asynchronous partial read
On 12.06.2018 10:09, Matias Bjørling wrote: On 06/12/2018 04:59 PM, Javier Gonzalez wrote: On 11 Jun 2018, at 22.53, Heiner Litz wrote: In the read path, partial reads are currently performed synchronously which affects performance for workloads that generate many partial reads. This patch adds an asynchronous partial read path as well as the required partial read ctx. Signed-off-by: Heiner Litz --- drivers/lightnvm/pblk-read.c | 179 --- drivers/lightnvm/pblk.h | 10 +++ 2 files changed, 128 insertions(+), 61 deletions(-) diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c index 7570ff6..026c708 100644 --- a/drivers/lightnvm/pblk-read.c +++ b/drivers/lightnvm/pblk-read.c @@ -231,74 +231,36 @@ static void pblk_end_io_read(struct nvm_rq *rqd) __pblk_end_io_read(pblk, rqd, true); } -static int pblk_partial_read(struct pblk *pblk, struct nvm_rq *rqd, - struct bio *orig_bio, unsigned int bio_init_idx, - unsigned long *read_bitmap) +static void pblk_end_partial_read(struct nvm_rq *rqd) { - struct pblk_sec_meta *meta_list = rqd->meta_list; - struct bio *new_bio; + struct pblk *pblk = rqd->private; + struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd); + struct pblk_pr_ctx *pr_ctx = r_ctx->private; + struct bio *new_bio = rqd->bio; + struct bio *bio = pr_ctx->orig_bio; struct bio_vec src_bv, dst_bv; - void *ppa_ptr = NULL; - void *src_p, *dst_p; - dma_addr_t dma_ppa_list = 0; - __le64 *lba_list_mem, *lba_list_media; - int nr_secs = rqd->nr_ppas; + struct pblk_sec_meta *meta_list = rqd->meta_list; + int bio_init_idx = pr_ctx->bio_init_idx; + unsigned long *read_bitmap = _ctx->bitmap; + int nr_secs = pr_ctx->orig_nr_secs; int nr_holes = nr_secs - bitmap_weight(read_bitmap, nr_secs); - int i, ret, hole; - - /* Re-use allocated memory for intermediate lbas */ - lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size); - lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size); - - new_bio = bio_alloc(GFP_KERNEL, nr_holes); - - if (pblk_bio_add_pages(pblk, new_bio, GFP_KERNEL, nr_holes)) - goto err; - - if (nr_holes != new_bio->bi_vcnt) { - pr_err("pblk: malformed bio\n"); - goto err; - } - - for (i = 0; i < nr_secs; i++) - lba_list_mem[i] = meta_list[i].lba; - - new_bio->bi_iter.bi_sector = 0; /* internal bio */ - bio_set_op_attrs(new_bio, REQ_OP_READ, 0); - - rqd->bio = new_bio; - rqd->nr_ppas = nr_holes; - rqd->flags = pblk_set_read_mode(pblk, PBLK_READ_RANDOM); - - if (unlikely(nr_holes == 1)) { - ppa_ptr = rqd->ppa_list; - dma_ppa_list = rqd->dma_ppa_list; - rqd->ppa_addr = rqd->ppa_list[0]; - } - - ret = pblk_submit_io_sync(pblk, rqd); - if (ret) { - bio_put(rqd->bio); - pr_err("pblk: sync read IO submission failed\n"); - goto err; - } - - if (rqd->error) { - atomic_long_inc(>read_failed); -#ifdef CONFIG_NVM_DEBUG - pblk_print_failed_rqd(pblk, rqd, rqd->error); -#endif - } + __le64 *lba_list_mem, *lba_list_media; + void *src_p, *dst_p; + int hole, i; if (unlikely(nr_holes == 1)) { struct ppa_addr ppa; ppa = rqd->ppa_addr; - rqd->ppa_list = ppa_ptr; - rqd->dma_ppa_list = dma_ppa_list; + rqd->ppa_list = pr_ctx->ppa_ptr; + rqd->dma_ppa_list = pr_ctx->dma_ppa_list; rqd->ppa_list[0] = ppa; } + /* Re-use allocated memory for intermediate lbas */ + lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size); + lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size); + for (i = 0; i < nr_secs; i++) { lba_list_media[i] = meta_list[i].lba; meta_list[i].lba = lba_list_mem[i]; @@ -316,7 +278,7 @@ static int pblk_partial_read(struct pblk *pblk, struct nvm_rq *rqd, meta_list[hole].lba = lba_list_media[i]; src_bv = new_bio->bi_io_vec[i++]; - dst_bv = orig_bio->bi_io_vec[bio_init_idx + hole]; + dst_bv = bio->bi_io_vec[bio_init_idx + hole]; src_p = kmap_atomic(src_bv.bv_page); dst_p = kmap_atomic(dst_bv.bv_page); @@ -334,19 +296,107 @@ static int pblk_partial_read(struct pblk *pblk, struct nvm_rq *rqd, } while (hole < nr_secs); bio_put(new_bio); + kfree(pr_ctx); /* restore original request */ rqd->bio = NULL; rqd->nr_ppas = nr_secs; + bio_endio(bio); __pblk_end_io_read(pblk, rqd, false); - return NVM_IO_DONE; +} + +static int pblk_setup_partial_read(struct pblk *pblk, struct nvm_rq *rqd, + unsigned int bio_init_idx, + unsigned long *read_bitmap, + int nr_holes) +{ + struct pblk_sec_meta *meta_list = rqd->meta_list; + struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd); + struct pblk_pr_ctx *pr_ctx; + struct bio *new_bio, *bio = r_ctx->private; + __le64
Re: blktests block/019 lead system hang
On 6/13/2018 10:41 AM, Keith Busch wrote: > Thanks for the feedback! > This test does indeed toggle the Link Control Link Disable bit to simulate > the link failure. The PCIe specification specifically covers this case > in Section 3.2.1, Data Link Control and Management State Machine Rules: > > If the Link Disable bit has been Set by software, then the subsequent > transition to DL_Inactive must not be considered an error. > > So this test should suppress any Suprise Down Error events, but handling > that particular event wasn't the intent of the test (and as you mentioned, > it ought not occur anyway since the slot is HP Surprise capable). > > The test should not suppress reporting the Data Link Layer State Changed > slot status. And while this doesn't trigger a Slot PDC status, triggering > a DLLSC should occur since the Link Status DLLLA should go to 0 when > state machine goes from DL_Active to DL_Down, regardless of if a Suprise > Down Error was detected. > > The Linux PCIEHP driver handles a DLLSC link-down event the same as > a presence detect remove event, and that's part of what this test was > trying to cover. Yes, the R730 could mask the error if OS sets Data Link Layer State Changed Enable = 1 and could let the OS handle the hot-plug event similar to what is done for surprise removal. Current platform policy on R730 is to not do that and only suppress errors related to physical surprise removal (PDS = 0). We'll probably forgo the option of suppressing any non-surprise remove link down errors even if OS sets Data Link Layer State Changed Enable = 1 and go straight to the containment error recovery model for DPC once the architecture is finalized to handle these non-surprise remove related error. In the meantime, it is expected (though not ideal) that this family of servers will crash for this particular test. Ditto for the test that disables Memory Space Enable bit in the command register. -Austin
Re: [Backport request] blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers
On Wed, Jun 13, 2018 at 03:36:42PM +, Bart Van Assche wrote: > Hello stable kernel maintainers, > > Please backport patch 327ea4adcfa3 ("blkdev_report_zones_ioctl(): > Use vmalloc() to allocate large buffers") to at least the v4.17.x and > v4.14.y stable kernel series. That patch fixes a bug introduced in > kernel v4.10. The entire patch is shown below. It's already in my queue to apply, and I normally wait until after it shows up in a -rc release before queueing it up. But I'll go do so now, thanks. greg k-h
Re: blktests block/019 lead system hang
On Tue, Jun 12, 2018 at 04:41:54PM -0700, austin.bo...@dell.com wrote: > It looks like the test is setting the Link Disable bit. But this is not > a good simulation for hot-plug surprise removal testing or surprise link > down (SLD) testing, if that is the intent. One reason is that Link > Disable does not invoke SLD semantics per PCIe spec. This is somewhat > of a moot point in this case since the switch has Hot-Plug Surprise bit > set which also masks the SLD semantics in PCIe. > > Also, the Hot-Plug Capable + Surprise Hot-Plug bits set means the > platform can tolerate the case where "an adapter present in this slot > might be removed from the system without any prior notification". It > does not mean that a system can survive link down under any other > circumstances such as setting Link Disable or generating a Secondary Bus > Reset or a true surprise link down event. To the earlier point, I also > do not know of any way the OS can know a priori if the platform can > handle surprise link down outside of surprise remove case. We can look > at standardizing a way to do that if OSes find it useful to know. > > Relative to this particular error, Link Disable doesn't clear Presence > Detect State which would happen on a real Surprise Hot-Plug removal > event and this is probably why the system crashes. What will happen is > that after the link goes to disabled state, the ongoing I/O will cause > MMIO accesses on the drive and that will cause a UR which is an > uncorrectable PCIe error (ERR_FATAL on R730). The BIOS on the R730 is > surprise remove aware (Surprise Hot-Plug = 1) and so it will check if > the device is still present by checking Presence Detect State. If the > device is not present it will mask the error and let the OS handle the > device removal due to hot-plug interrupt(s). If the device is present, > as in this case, then the BIOS will escalate to OS as a fatal NMI > (current R730 platform policy is to only mask errors due to removal). > > For future, these servers may report these sort of errors as recoverable > via the GHES structures in APEI which will allow the OS to recover from > this non-surprise remove class of error as well. In the (hopefully > near) future, the industry will move to DPC as the framework for this > sort of generic PCIe error handling/recovery but there are architectural > changes needed that are currently being defined in the relevant > standards bodies. Once the architecture is defined it can be > implemented and tested to verify these sort of test cases pass. Thanks for the feedback! This test does indeed toggle the Link Control Link Disable bit to simulate the link failure. The PCIe specification specifically covers this case in Section 3.2.1, Data Link Control and Management State Machine Rules: If the Link Disable bit has been Set by software, then the subsequent transition to DL_Inactive must not be considered an error. So this test should suppress any Suprise Down Error events, but handling that particular event wasn't the intent of the test (and as you mentioned, it ought not occur anyway since the slot is HP Surprise capable). The test should not suppress reporting the Data Link Layer State Changed slot status. And while this doesn't trigger a Slot PDC status, triggering a DLLSC should occur since the Link Status DLLLA should go to 0 when state machine goes from DL_Active to DL_Down, regardless of if a Suprise Down Error was detected. The Linux PCIEHP driver handles a DLLSC link-down event the same as a presence detect remove event, and that's part of what this test was trying to cover.
[Backport request] blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers
Hello stable kernel maintainers, Please backport patch 327ea4adcfa3 ("blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers") to at least the v4.17.x and v4.14.y stable kernel series. That patch fixes a bug introduced in kernel v4.10. The entire patch is shown below. Thanks, Bart. commit cf0110698846fc5a93df89eb20ac7cc70a860c17 Author: Bart Van Assche Date: Tue May 22 08:27:22 2018 -0700 blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers Avoid that complaints similar to the following appear in the kernel log if the number of zones is sufficiently large: fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null) Call Trace: dump_stack+0x63/0x88 warn_alloc+0xf5/0x190 __alloc_pages_slowpath+0x8f0/0xb0d __alloc_pages_nodemask+0x242/0x260 alloc_pages_current+0x6a/0xb0 kmalloc_order+0x18/0x50 kmalloc_order_trace+0x26/0xb0 __kmalloc+0x20e/0x220 blkdev_report_zones_ioctl+0xa5/0x1a0 blkdev_ioctl+0x1ba/0x930 block_ioctl+0x41/0x50 do_vfs_ioctl+0xaa/0x610 SyS_ioctl+0x79/0x90 do_syscall_64+0x79/0x1b0 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls") Signed-off-by: Bart Van Assche Cc: Shaun Tancheff Cc: Damien Le Moal Cc: Christoph Hellwig Cc: Martin K. Petersen Cc: Hannes Reinecke Cc: Signed-off-by: Jens Axboe diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 08e84ef2bc05..3d08dc84db16 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -328,7 +328,11 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode, if (!rep.nr_zones) return -EINVAL; - zones = kcalloc(rep.nr_zones, sizeof(struct blk_zone), GFP_KERNEL); + if (rep.nr_zones > INT_MAX / sizeof(struct blk_zone)) + return -ERANGE; + + zones = kvmalloc(rep.nr_zones * sizeof(struct blk_zone), + GFP_KERNEL | __GFP_ZERO); if (!zones) return -ENOMEM; @@ -350,7 +354,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode, } out: - kfree(zones); + kvfree(zones); return ret; }
Re: [PATCH] blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers
On 6/13/18 9:20 AM, Bart Van Assche wrote: > On 05/22/18 10:58, Jens Axboe wrote: >> On 5/22/18 9:27 AM, Bart Van Assche wrote: >>> Avoid that complaints similar to the following appear in the kernel log >>> if the number of zones is sufficiently large: >>> >>>fio: page allocation failure: order:9, >>> mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null) >>>Call Trace: >>>dump_stack+0x63/0x88 >>>warn_alloc+0xf5/0x190 >>>__alloc_pages_slowpath+0x8f0/0xb0d >>>__alloc_pages_nodemask+0x242/0x260 >>>alloc_pages_current+0x6a/0xb0 >>>kmalloc_order+0x18/0x50 >>>kmalloc_order_trace+0x26/0xb0 >>>__kmalloc+0x20e/0x220 >>>blkdev_report_zones_ioctl+0xa5/0x1a0 >>>blkdev_ioctl+0x1ba/0x930 >>>block_ioctl+0x41/0x50 >>>do_vfs_ioctl+0xaa/0x610 >>>SyS_ioctl+0x79/0x90 >>>do_syscall_64+0x79/0x1b0 >>>entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >> >> Applied, thanks. > > Thank you Jens. This patch (commit 327ea4adcfa3) is now in Linus tree. > We would like this patch to appear in the v4.14 and v4.17 kernel series > too. Since this patch does not have a stable tag, how do you want us > to proceed? Do you want to send this patch yourself to Greg or do you > rather expect us to do that? You can do it - just send an email go greg/stable saking for that commit sha to be marked for stable. -- Jens Axboe
Re: [PATCH] blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers
On 05/22/18 10:58, Jens Axboe wrote: On 5/22/18 9:27 AM, Bart Van Assche wrote: Avoid that complaints similar to the following appear in the kernel log if the number of zones is sufficiently large: fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null) Call Trace: dump_stack+0x63/0x88 warn_alloc+0xf5/0x190 __alloc_pages_slowpath+0x8f0/0xb0d __alloc_pages_nodemask+0x242/0x260 alloc_pages_current+0x6a/0xb0 kmalloc_order+0x18/0x50 kmalloc_order_trace+0x26/0xb0 __kmalloc+0x20e/0x220 blkdev_report_zones_ioctl+0xa5/0x1a0 blkdev_ioctl+0x1ba/0x930 block_ioctl+0x41/0x50 do_vfs_ioctl+0xaa/0x610 SyS_ioctl+0x79/0x90 do_syscall_64+0x79/0x1b0 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Applied, thanks. Thank you Jens. This patch (commit 327ea4adcfa3) is now in Linus tree. We would like this patch to appear in the v4.14 and v4.17 kernel series too. Since this patch does not have a stable tag, how do you want us to proceed? Do you want to send this patch yourself to Greg or do you rather expect us to do that? Thanks, Bart.
Re: [RFC] cleanup bcache bio handling
On Wed, Jun 13, 2018 at 03:56:32PM +0200, Christoph Hellwig wrote: > On Wed, Jun 13, 2018 at 07:06:41PM +0800, Ming Lei wrote: > > > before bio_alloc_pages) that can be switched to something that just > > > creates a > > > single bvec. > > > > Yes, multipage bvec shouldn't break any driver or fs. > > It probably isn't broken, at least I didn't see assumptions of the same > number of segments. However the current poking into the bio internals as > a bad idea for a couple of reasons. First because it requires touching > bcache for any of these changes, second because it won't get merging of > pages into a single bio segment for bіos built by bch_bio_map or > bch_bio_alloc_pages, and third bcache is the last user of > bio_for_each_chunk_all in your branch, which I'd like to kill off to > keep the number of iterators down. Agreed about bio_for_each_chunk_all(), but I just looked at the patch that introduces them and it looks to me like there's no need, they should just be bio_for_each_segment_all(). Converting bch_bio_map() and bch_bio_alloc_pages() to bio_add_page() is fine by me, but your patch series doesn't do any of those actual cleanups: your description of the patch series does not actually match what it does.
Re: [PATCH 1/6] block: add a bio_reuse helper
On Wed, Jun 13, 2018 at 03:59:15PM +0200, Christoph Hellwig wrote: > On Wed, Jun 13, 2018 at 04:54:41AM -0400, Kent Overstreet wrote: > > bi_size is not immutable though, it will usually be modified by drivers > > when you > > submit a bio. > > > > I see what you're trying to do, but your approach is busted given the way > > the > > block layer works today. You'd have to save bio->bi_iter before submitting > > the > > bio and restore it afterwards for it to work. > > For bi_size, agreed this needs fixing. bi_sector is always restored > already by the callers, and the remaining fields are zeroed by bio_reset, > which does the right thing. This still shouldn't be a new helper. If the caller is restoring bi_sector they can restore bi_size as well, and restoring only part of bi_iter should not be encouraged as it's not safe in general - it is a quite reasonable thing to want to restore a bio to the state it was pre submission (e.g. for bouncing) and what you're doing is definitely _not_ safe in general.
[PATCH 1/6] block: add a bio_reuse helper
This abstracts out a way to reuse a bio without destroying the bio vectors containing the data. Signed-off-by: Christoph Hellwig --- block/bio.c | 19 +++ include/linux/bio.h | 1 + 2 files changed, 20 insertions(+) diff --git a/block/bio.c b/block/bio.c index 70c4e1b6dd45..67393ab9abe0 100644 --- a/block/bio.c +++ b/block/bio.c @@ -308,6 +308,25 @@ void bio_reset(struct bio *bio) } EXPORT_SYMBOL(bio_reset); +/** + * bio_reuse - prepare a bio for reuse + * @bio: bio to reuse + * @size: size of the bio + * + * Prepares an already setup and possible used bio for reusing it another + * time. Compared to bio_reset() this preserves the setup of the bio + * vectors containing the data. + */ +void bio_reuse(struct bio *bio, unsigned size) +{ + unsigned short vcnt = bio->bi_vcnt; + + bio_reset(bio); + bio->bi_iter.bi_size = size; + bio->bi_vcnt = vcnt; +} +EXPORT_SYMBOL_GPL(bio_reuse); + static struct bio *__bio_chain_endio(struct bio *bio) { struct bio *parent = bio->bi_private; diff --git a/include/linux/bio.h b/include/linux/bio.h index f08f5fe7bd08..d114ccc4bac2 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -475,6 +475,7 @@ extern void bio_init(struct bio *bio, struct bio_vec *table, unsigned short max_vecs); extern void bio_uninit(struct bio *); extern void bio_reset(struct bio *); +void bio_reuse(struct bio *, unsigned int); void bio_chain(struct bio *, struct bio *); extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int); -- 2.17.1
[PATCH 4/6] bcache: clean up bio reuse for struct dirty_io
Instead of reinitializing the bio everytime we can call bio_reuse when reusing it. Also moves the private data initialization out of dirty_init, which is renamed to suit the remaining functionality. Signed-off-by: Christoph Hellwig --- drivers/md/bcache/writeback.c | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index ad45ebe1a74b..0b6d07eab87c 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -179,19 +179,12 @@ struct dirty_io { struct bio bio; }; -static void dirty_init(struct keybuf_key *w) +static void dirty_init_prio(struct keybuf_key *w) { struct dirty_io *io = w->private; - struct bio *bio = >bio; - bio_init(bio, bio->bi_inline_vecs, -DIV_ROUND_UP(KEY_SIZE(>key), PAGE_SECTORS)); if (!io->dc->writeback_percent) - bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); - - bio->bi_iter.bi_size= KEY_SIZE(>key) << 9; - bio->bi_private = w; - bch_bio_map(bio, NULL); + bio_set_prio(>bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); } static void dirty_io_destructor(struct closure *cl) @@ -285,10 +278,12 @@ static void write_dirty(struct closure *cl) * to clean up. */ if (KEY_DIRTY(>key)) { - dirty_init(w); - bio_set_op_attrs(>bio, REQ_OP_WRITE, 0); + bio_reuse(>bio, KEY_SIZE(>key) << 9); + dirty_init_prio(w); + io->bio.bi_opf = REQ_OP_WRITE; io->bio.bi_iter.bi_sector = KEY_START(>key); bio_set_dev(>bio, io->dc->bdev); + io->bio.bi_private = w; io->bio.bi_end_io = dirty_endio; /* I/O request sent to backing device */ @@ -399,13 +394,18 @@ static void read_dirty(struct cached_dev *dc) io->dc = dc; io->sequence= sequence++; - dirty_init(w); - bio_set_op_attrs(>bio, REQ_OP_READ, 0); + bio_init(>bio, io->bio.bi_inline_vecs, + DIV_ROUND_UP(KEY_SIZE(>key), PAGE_SECTORS)); + dirty_init_prio(w); + io->bio.bi_opf = REQ_OP_READ; + io->bio.bi_iter.bi_size = KEY_SIZE(>key) << 9; io->bio.bi_iter.bi_sector = PTR_OFFSET(>key, 0); bio_set_dev(>bio, PTR_CACHE(dc->disk.c, >key, 0)->bdev); + io->bio.bi_private = w; io->bio.bi_end_io = read_dirty_endio; + bch_bio_map(>bio, NULL); if (bch_bio_alloc_pages(>bio, GFP_KERNEL)) goto err_free; -- 2.17.1
[PATCH 5/6] bcache: don't clone bio in bch_data_verify
We immediately overwrite the biovec array, so instead just allocate a new bio and copy over the disk, setor and size. Signed-off-by: Christoph Hellwig --- drivers/md/bcache/debug.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index d030ce3025a6..04d146711950 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -110,11 +110,15 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio) struct bio_vec bv, cbv; struct bvec_iter iter, citer = { 0 }; - check = bio_clone_kmalloc(bio, GFP_NOIO); + check = bio_kmalloc(GFP_NOIO, bio_segments(bio)); if (!check) return; + check->bi_disk = bio->bi_disk; check->bi_opf = REQ_OP_READ; + check->bi_iter.bi_sector = bio->bi_iter.bi_sector; + check->bi_iter.bi_size = bio->bi_iter.bi_size; + bch_bio_map(check, NULL); if (bch_bio_alloc_pages(check, GFP_NOIO)) goto out_put; -- 2.17.1
[PATCH 6/6] bcache: use bio_add_page instead of open coded bio manipulation
Let bch_bio_alloc_pages and bch_bio_map set up the bio vec information and bi_size. This also means no additional bch_bio_map call with a NULL argument is needed before bch_bio_alloc_pages. Signed-off-by: Christoph Hellwig --- drivers/md/bcache/btree.c | 16 +++- drivers/md/bcache/debug.c | 7 +--- drivers/md/bcache/journal.c | 8 +--- drivers/md/bcache/movinggc.c | 4 +- drivers/md/bcache/request.c | 5 +-- drivers/md/bcache/super.c | 8 +--- drivers/md/bcache/util.c | 74 +++ drivers/md/bcache/util.h | 4 +- drivers/md/bcache/writeback.c | 4 +- 9 files changed, 52 insertions(+), 78 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 2a0968c04e21..0f585fa9051f 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -298,12 +298,11 @@ static void bch_btree_node_read(struct btree *b) closure_init_stack(); bio = bch_bbio_alloc(b->c); - bio->bi_iter.bi_size = KEY_SIZE(>key) << 9; bio->bi_end_io = btree_node_read_endio; bio->bi_private = bio->bi_opf = REQ_OP_READ | REQ_META; - bch_bio_map(bio, b->keys.set[0].data); + bch_bio_map(bio, b->keys.set[0].data, KEY_SIZE(>key) << 9); bch_submit_bbio(bio, b->c, >key, 0); closure_sync(); @@ -386,19 +385,19 @@ static void do_btree_node_write(struct btree *b) { struct closure *cl = >io; struct bset *i = btree_bset_last(b); + size_t size; BKEY_PADDED(key) k; i->version = BCACHE_BSET_VERSION; i->csum = btree_csum_set(b, i); + size = roundup(set_bytes(i), block_bytes(b->c)); + BUG_ON(b->bio); b->bio = bch_bbio_alloc(b->c); - b->bio->bi_end_io = btree_node_write_endio; b->bio->bi_private = cl; - b->bio->bi_iter.bi_size = roundup(set_bytes(i), block_bytes(b->c)); b->bio->bi_opf = REQ_OP_WRITE | REQ_META | REQ_FUA; - bch_bio_map(b->bio, i); /* * If we're appending to a leaf node, we don't technically need FUA - @@ -419,7 +418,7 @@ static void do_btree_node_write(struct btree *b) SET_PTR_OFFSET(, 0, PTR_OFFSET(, 0) + bset_sector_offset(>keys, i)); - if (!bch_bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) { + if (!bch_bio_alloc_pages(b->bio, size, __GFP_NOWARN | GFP_NOWAIT)) { int j; struct bio_vec *bv; void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1)); @@ -432,10 +431,7 @@ static void do_btree_node_write(struct btree *b) continue_at(cl, btree_node_write_done, NULL); } else { - /* No problem for multipage bvec since the bio is just allocated */ - b->bio->bi_vcnt = 0; - bch_bio_map(b->bio, i); - + bch_bio_map(b->bio, i, size); bch_submit_bbio(b->bio, b->c, , 0); closure_sync(cl); diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index 04d146711950..b089597fb607 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -52,9 +52,8 @@ void bch_btree_verify(struct btree *b) bio = bch_bbio_alloc(b->c); bio_set_dev(bio, PTR_CACHE(b->c, >key, 0)->bdev); bio->bi_iter.bi_sector = PTR_OFFSET(>key, 0); - bio->bi_iter.bi_size= KEY_SIZE(>key) << 9; bio->bi_opf = REQ_OP_READ | REQ_META; - bch_bio_map(bio, sorted); + bch_bio_map(bio, sorted, KEY_SIZE(>key) << 9); submit_bio_wait(bio); bch_bbio_free(bio, b->c); @@ -116,10 +115,8 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio) check->bi_disk = bio->bi_disk; check->bi_opf = REQ_OP_READ; check->bi_iter.bi_sector = bio->bi_iter.bi_sector; - check->bi_iter.bi_size = bio->bi_iter.bi_size; - bch_bio_map(check, NULL); - if (bch_bio_alloc_pages(check, GFP_NOIO)) + if (bch_bio_alloc_pages(check, bio->bi_iter.bi_size, GFP_NOIO)) goto out_put; submit_bio_wait(check); diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index 18f1b5239620..44c3fc5f3b0a 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -55,12 +55,10 @@ reread: left = ca->sb.bucket_size - offset; bio_reset(bio); bio->bi_iter.bi_sector = bucket + offset; bio_set_dev(bio, ca->bdev); - bio->bi_iter.bi_size= len << 9; - bio->bi_end_io = journal_read_endio; bio->bi_private = bio_set_op_attrs(bio, REQ_OP_READ, 0); - bch_bio_map(bio, data); + bch_bio_map(bio, data, len << 9); closure_bio_submit(ca->set, bio, ); closure_sync(); @@ -652,13 +650,11 @@ static void
[PATCH 2/6] bcache: use bio_reuse instead of bio_reinit where applicable
Use the bio_reuse helper instead of rebuilding the bio_vecs and size for bios that get reused for the same data. Signed-off-by: Christoph Hellwig --- drivers/md/bcache/request.c | 5 + drivers/md/bcache/super.c | 6 ++ 2 files changed, 3 insertions(+), 8 deletions(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index ae67f5fa8047..3e699be2e79b 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -810,12 +810,9 @@ static void cached_dev_read_done(struct closure *cl) */ if (s->iop.bio) { - bio_reset(s->iop.bio); + bio_reuse(s->iop.bio, s->insert_bio_sectors << 9); s->iop.bio->bi_iter.bi_sector = s->cache_miss->bi_iter.bi_sector; bio_copy_dev(s->iop.bio, s->cache_miss); - s->iop.bio->bi_iter.bi_size = s->insert_bio_sectors << 9; - bch_bio_map(s->iop.bio, NULL); - bio_copy_data(s->cache_miss, s->iop.bio); bio_put(s->cache_miss); diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index a31e55bcc4e5..56692a451aeb 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -205,9 +205,7 @@ static void __write_super(struct cache_sb *sb, struct bio *bio) unsigned i; bio->bi_iter.bi_sector = SB_SECTOR; - bio->bi_iter.bi_size= SB_SIZE; bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC|REQ_META); - bch_bio_map(bio, NULL); out->offset = cpu_to_le64(sb->offset); out->version= cpu_to_le64(sb->version); @@ -249,7 +247,7 @@ void bch_write_bdev_super(struct cached_dev *dc, struct closure *parent) down(>sb_write_mutex); closure_init(cl, parent); - bio_reset(bio); + bio_reuse(bio, SB_SIZE); bio_set_dev(bio, dc->bdev); bio->bi_end_io = write_bdev_super_endio; bio->bi_private = dc; @@ -298,7 +296,7 @@ void bcache_write_super(struct cache_set *c) SET_CACHE_SYNC(>sb, CACHE_SYNC(>sb)); - bio_reset(bio); + bio_reuse(bio, SB_SIZE); bio_set_dev(bio, ca->bdev); bio->bi_end_io = write_super_endio; bio->bi_private = ca; -- 2.17.1
[RFC] cleanup bcache bio handling v2
Hi all, this series cleans up various places where bcache is way too intimate with bio internals. This is intended as a baseline for the multi-page biovec work, which requires some nasty workarounds for the existing code. Note that I do not have a bcache test setup, so this will require some careful actual testing with whatever test cases are used for bcache. Also the new bio_reused helper should be useful at least for MD raid, but that work is left for later. Changes since v1: - restore bi_size properly
[PATCH 3/6] bcache: clean up bio reuse for struct moving_io
Instead of reinitializing the bio everytime we can call bio_reuse when reusing it. Also removes the remainder of the moving_init helper to improve readability. Signed-off-by: Christoph Hellwig --- drivers/md/bcache/movinggc.c | 40 +--- 1 file changed, 19 insertions(+), 21 deletions(-) diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c index a24c3a95b2c0..890c055eb589 100644 --- a/drivers/md/bcache/movinggc.c +++ b/drivers/md/bcache/movinggc.c @@ -74,29 +74,20 @@ static void read_moving_endio(struct bio *bio) bch_bbio_endio(io->op.c, bio, bio->bi_status, "reading data to move"); } -static void moving_init(struct moving_io *io) -{ - struct bio *bio = >bio.bio; - - bio_init(bio, bio->bi_inline_vecs, -DIV_ROUND_UP(KEY_SIZE(>w->key), PAGE_SECTORS)); - bio_get(bio); - bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); - - bio->bi_iter.bi_size= KEY_SIZE(>w->key) << 9; - bio->bi_private = >cl; - bch_bio_map(bio, NULL); -} - static void write_moving(struct closure *cl) { struct moving_io *io = container_of(cl, struct moving_io, cl); struct data_insert_op *op = >op; if (!op->status) { - moving_init(io); + struct bio *bio = >bio.bio; + + bio_reuse(bio, KEY_SIZE(>w->key) << 9); + bio_get(bio); + bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); + bio->bi_private = >cl; + bio->bi_iter.bi_sector = KEY_START(>w->key); - io->bio.bio.bi_iter.bi_sector = KEY_START(>w->key); op->write_prio = 1; op->bio = >bio.bio; @@ -156,12 +147,19 @@ static void read_moving(struct cache_set *c) io->op.c= c; io->op.wq = c->moving_gc_wq; - moving_init(io); bio = >bio.bio; - - bio_set_op_attrs(bio, REQ_OP_READ, 0); - bio->bi_end_io = read_moving_endio; - + bio_init(bio, bio->bi_inline_vecs, + DIV_ROUND_UP(KEY_SIZE(>w->key), PAGE_SECTORS)); + bio_get(bio); + + bio->bi_iter.bi_size = KEY_SIZE(>w->key) << 9; + bio->bi_iter.bi_sector = KEY_START(>w->key); + bio->bi_opf = REQ_OP_READ; + bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); + bio->bi_private = >cl; + bio->bi_end_io = read_moving_endio; + + bch_bio_map(bio, NULL); if (bch_bio_alloc_pages(bio, GFP_KERNEL)) goto err; -- 2.17.1
Re: [PATCH 1/6] block: add a bio_reuse helper
On Wed, Jun 13, 2018 at 04:54:41AM -0400, Kent Overstreet wrote: > bi_size is not immutable though, it will usually be modified by drivers when > you > submit a bio. > > I see what you're trying to do, but your approach is busted given the way the > block layer works today. You'd have to save bio->bi_iter before submitting the > bio and restore it afterwards for it to work. For bi_size, agreed this needs fixing. bi_sector is always restored already by the callers, and the remaining fields are zeroed by bio_reset, which does the right thing.
Re: [RFC] cleanup bcache bio handling
On Wed, Jun 13, 2018 at 07:06:41PM +0800, Ming Lei wrote: > > before bio_alloc_pages) that can be switched to something that just creates > > a > > single bvec. > > Yes, multipage bvec shouldn't break any driver or fs. It probably isn't broken, at least I didn't see assumptions of the same number of segments. However the current poking into the bio internals as a bad idea for a couple of reasons. First because it requires touching bcache for any of these changes, second because it won't get merging of pages into a single bio segment for bіos built by bch_bio_map or bch_bio_alloc_pages, and third bcache is the last user of bio_for_each_chunk_all in your branch, which I'd like to kill off to keep the number of iterators down.
Re: [RFC] cleanup bcache bio handling
On Wed, Jun 13, 2018 at 05:58:01AM -0400, Kent Overstreet wrote: > On Mon, Jun 11, 2018 at 09:48:00PM +0200, Christoph Hellwig wrote: > > Hi all, > > > > this series cleans up various places where bcache is way too intimate > > with bio internals. This is intended as a baseline for the multi-page > > biovec work, which requires some nasty workarounds for the existing > > code. > > > > Note that I do not have a bcache test setup, so this will require > > some careful actual testing with whatever test cases are used for > > bcache. > > > > Also the new bio_reused helper should be useful at least for MD raid, > > but that work is left for later. > > Strong nak on the patch series (except for the patch to not clone in > bch_data_verify, that patch looks fine). > > Unless something has seriously changed with how multi-page bvecs work since I > started that code nothing in bcache should be broken by it - Ming, can you > confirm or deny if that's still correct? Yes, the multi-page bvecs implementation didn't have big change, and previously I run xfstests on bcache0, and no regression was observed. > > It is true that bch_bio_map() will be suboptimal when multi page bvecs go in, > but it won't be broken. The way it's used now is really conflating two > different > things, but where it's used for mapping a bio to a single buffer (i.e. not > before bio_alloc_pages) that can be switched to something that just creates a > single bvec. Yes, multipage bvec shouldn't break any driver or fs. Thanks, Ming
Re: [RFC] cleanup bcache bio handling
On Mon, Jun 11, 2018 at 09:48:00PM +0200, Christoph Hellwig wrote: > Hi all, > > this series cleans up various places where bcache is way too intimate > with bio internals. This is intended as a baseline for the multi-page > biovec work, which requires some nasty workarounds for the existing > code. > > Note that I do not have a bcache test setup, so this will require > some careful actual testing with whatever test cases are used for > bcache. > > Also the new bio_reused helper should be useful at least for MD raid, > but that work is left for later. Strong nak on the patch series (except for the patch to not clone in bch_data_verify, that patch looks fine). Unless something has seriously changed with how multi-page bvecs work since I started that code nothing in bcache should be broken by it - Ming, can you confirm or deny if that's still correct? It is true that bch_bio_map() will be suboptimal when multi page bvecs go in, but it won't be broken. The way it's used now is really conflating two different things, but where it's used for mapping a bio to a single buffer (i.e. not before bio_alloc_pages) that can be switched to something that just creates a single bvec.
Re: [PATCH 1/6] block: add a bio_reuse helper
On Wed, Jun 13, 2018 at 09:32:04AM +0200, Christoph Hellwig wrote: > On Tue, Jun 12, 2018 at 02:16:30AM -0400, Kent Overstreet wrote: > > On Mon, Jun 11, 2018 at 09:48:01PM +0200, Christoph Hellwig wrote: > > > This abstracts out a way to reuse a bio without destroying the > > > data pointers. > > > > What is the point of this? What "data pointers" does it not destroy? > > It keeps bi_vcnt and bi_size intact in addition to bi_max_vecs/bi_io_vec > already kept by the existing bio_reset. > > The befit is that you don't need to rebuild bi_vcnt, bio_io_vec and > bi_size when reusing the bio. bi_size is not immutable though, it will usually be modified by drivers when you submit a bio. I see what you're trying to do, but your approach is busted given the way the block layer works today. You'd have to save bio->bi_iter before submitting the bio and restore it afterwards for it to work.
Re: [PATCH 1/6] block: add a bio_reuse helper
On Tue, Jun 12, 2018 at 02:16:30AM -0400, Kent Overstreet wrote: > On Mon, Jun 11, 2018 at 09:48:01PM +0200, Christoph Hellwig wrote: > > This abstracts out a way to reuse a bio without destroying the > > data pointers. > > What is the point of this? What "data pointers" does it not destroy? It keeps bi_vcnt and bi_size intact in addition to bi_max_vecs/bi_io_vec already kept by the existing bio_reset. The befit is that you don't need to rebuild bi_vcnt, bio_io_vec and bi_size when reusing the bio.