Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
On 5/25/18 4:18 PM, Jeff Moyer wrote: > Hi, Jens, > > Jens Axboe writes: > >> On 5/25/18 3:14 PM, Jeff Moyer wrote: >>> Bryan Gurney reported I/O errors when using dm-zoned with a host-managed >>> SMR device. It turns out he was using CFQ, which is the default. >>> Unfortunately, as of v4.16, only the deadline schedulers work well with >>> host-managed SMR devices. This series aatempts to switch the elevator >>> to deadline for those devices. >>> >>> NOTE: I'm not super happy with setting up one iosched and then >>> immediately tearing it down. I'm open to suggestions on better ways >>> to accomplish this goal. >> >> Let's please not do this, a few years ago I finally managed to kill >> drivers changing the scheduler manually. Why can't this go into a >> udev (or similar) rule? That's where it belongs, imho. > > We could do that. The downside is that distros will have to pick up > udev rules, which they haven't done yet, and the udev rules will have to > be kernel version dependent. And then later, when this restriction is > lifted, we'll have to update the udev rules. That also sounds awful to > me. They only have to be feature dependent, which isn't that hard. And if I had to pick between a kernel upgrade and a udev rule package update, the choice is pretty clear. > I understand why you don't like this patch set, but I happen to think > the alternative is worse. FYI, in Bryan's case, his system actually got > bricked (likely due to buggy firmware). I disagree, I think the rule approach is much easier. If the wrong write location bricked the drive, then I think that user has much larger issues... That seems like a trivial issue that should have been caught in basic testing, I would not trust that drive with any data if it bricks that easily. -- Jens Axboe
[PATCH v3] block/loop: Serialize ioctl operations.
syzbot is reporting NULL pointer dereference [1] which is caused by race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other loop devices without holding corresponding locks. syzbot is also reporting circular locking dependency between bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling blkdev_reread_part() with lock held. Since ioctl() request on loop devices is not frequent operation, we don't need fine grained locking. Let's use global lock and simplify the locking order. Strategy is that the global lock is held upon entry of ioctl() request, and release it before either starting operations which might deadlock or leaving ioctl() request. After the global lock is released, current thread no longer uses "struct loop_device" memory because it might be modified by next ioctl() request which was waiting for current ioctl() request. In order to enforce this strategy, this patch inversed loop_reread_partitions() and loop_unprepare_queue() in loop_clr_fd(). I don't know whether it breaks something, but I don't have testcases. Since this patch serializes using global lock, race bugs should no longer exist. Thus, it will be easy to test whether this patch broke something. [1] https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3 [2] https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889 Signed-off-by: Tetsuo Handa Reported-by: syzbot Reported-by: syzbot Cc: Jens Axboe --- drivers/block/loop.c | 231 --- drivers/block/loop.h | 1 - 2 files changed, 128 insertions(+), 104 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 55cf554..feb9fa7 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -81,11 +81,50 @@ #include static DEFINE_IDR(loop_index_idr); -static DEFINE_MUTEX(loop_index_mutex); +static DEFINE_MUTEX(loop_mutex); +static void *loop_mutex_owner; /* == __mutex_owner(&loop_mutex) */ static int max_part; static int part_shift; +/* + * lock_loop - Lock loop_mutex. + */ +static void lock_loop(void) +{ + mutex_lock(&loop_mutex); + loop_mutex_owner = current; +} + +/* + * lock_loop_killable - Lock loop_mutex unless killed. + */ +static int lock_loop_killable(void) +{ + int err = mutex_lock_killable(&loop_mutex); + + if (err) + return err; + loop_mutex_owner = current; + return 0; +} + +/* + * unlock_loop - Unlock loop_mutex as needed. + * + * Explicitly call this function before calling fput() or blkdev_reread_part() + * in order to avoid circular lock dependency. After this function is called, + * current thread is no longer allowed to access "struct loop_device" memory, + * for another thread would access that memory as soon as loop_mutex is held. + */ +static void unlock_loop(void) +{ + if (loop_mutex_owner == current) { + loop_mutex_owner = NULL; + mutex_unlock(&loop_mutex); + } +} + static int transfer_xor(struct loop_device *lo, int cmd, struct page *raw_page, unsigned raw_off, struct page *loop_page, unsigned loop_off, @@ -626,7 +665,12 @@ static void loop_reread_partitions(struct loop_device *lo, struct block_device *bdev) { int rc; + /* Save variables which might change after unlock_loop() is called. */ + char filename[sizeof(lo->lo_file_name)]; + const int num = lo->lo_number; + memmove(filename, lo->lo_file_name, sizeof(filename)); + unlock_loop(); /* * bd_mutex has been held already in release path, so don't * acquire it if this function is called in such case. @@ -641,7 +685,7 @@ static void loop_reread_partitions(struct loop_device *lo, rc = blkdev_reread_part(bdev); if (rc) pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n", - __func__, lo->lo_number, lo->lo_file_name, rc); + __func__, num, filename, rc); } /* @@ -659,6 +703,7 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev, struct inode*inode; int error; + lockdep_assert_held(&loop_mutex); error = -ENXIO; if (lo->lo_state != Lo_bound) goto out; @@ -695,12 +740,14 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev, loop_update_dio(lo); blk_mq_unfreeze_queue(lo->lo_queue); - fput(old_file); if (lo->lo_flags & LO_FLAGS_PARTSCAN) - loop_reread_partitions(lo, bdev); + loop_reread_partitions(lo, bdev); /* calls unlock_loop() */ + unlock_loop(); + fput(old_file); return 0; out_putf: + unlock_loop(); fput(file); out: return error; @@ -884,6 +93
Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO
Hi Johannes, I tried your previous memdelay patches before this new set was posted and results were promising for predicting when Android system is close to OOM. I'm definitely going to try this one after I backport it to 4.9. On Mon, May 7, 2018 at 2:01 PM, Johannes Weiner wrote: > Hi, > > I previously submitted a version of this patch set called "memdelay", > which translated delays from reclaim, swap-in, thrashing page cache > into a pressure percentage of lost walltime. I've since extended this > code to aggregate all delay states tracked by delayacct in order to > have generalized pressure/overcommit levels for CPU, memory, and IO. > > There was feedback from Peter on the previous version that I have > incorporated as much as possible and as it still applies to this code: > > - got rid of the extra lock in the sched callbacks; all task > state changes we care about serialize through rq->lock > > - got rid of ktime_get() inside the sched callbacks and > switched time measuring to rq_clock() > > - got rid of all divisions inside the sched callbacks, > tracking everything natively in ns now > > I also moved this stuff into existing sched/stat.h callbacks, so it > doesn't get in the way in sched/core.c, and of course moved the whole > thing behind CONFIG_PSI since not everyone is going to want it. Would it make sense to split CONFIG_PSI into CONFIG_PSI_CPU, CONFIG_PSI_MEM and CONFIG_PSI_IO since one might need only specific subset of this feature? > > Real-world applications > > Since the last posting, we've begun using the data collected by this > code quite extensively at Facebook, and with several success stories. > > First we used it on systems that frequently locked up in low memory > situations. The reason this happens is that the OOM killer is > triggered by reclaim not being able to make forward progress, but with > fast flash devices there is *always* some clean and uptodate cache to > reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of > the time faulting executables. There is no situation where this ever > makes sense in practice. We wrote a <100 line POC python script to > monitor memory pressure and kill stuff manually, way before such > pathological thrashing. > > We've since extended the python script into a more generic oomd that > we use all over the place, not just to avoid livelocks but also to > guarantee latency and throughput SLAs, since they're usually violated > way before the kernel OOM killer would ever kick in. > > We also use the memory pressure info for loadshedding. Our batch job > infrastructure used to refuse new requests on heuristics based on RSS > and other existing VM metrics in an attempt to avoid OOM kills and > maximize utilization. Since it was still plagued by frequent OOM > kills, we switched it to shed load on psi memory pressure, which has > turned out to be a much better bellwether, and we managed to reduce > OOM kills drastically. Reducing the rate of OOM outages from the > worker pool raised its aggregate productivity, and we were able to > switch that service to smaller machines. > > Lastly, we use cgroups to isolate a machine's main workload from > maintenance crap like package upgrades, logging, configuration, as > well as to prevent multiple workloads on a machine from stepping on > each others' toes. We were not able to do this properly without the > pressure metrics; we would see latency or bandwidth drops, but it > would often be hard to impossible to rootcause it post-mortem. We now > log and graph the pressure metrics for all containers in our fleet and > can trivially link service drops to resource pressure after the fact. > > How do you use this? > > A kernel with CONFIG_PSI=y will create a /proc/pressure directory with > 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have > cpu.pressure, memory.pressure and io.pressure files, which simply > calculate pressure at the cgroup level instead of system-wide. > > The cpu file contains one line: > > some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722 > > The averages give the percentage of walltime in which some tasks are > delayed on the runqueue while another task has the CPU. They're recent > averages over 10s, 1m, 5m windows, so you can tell short term trends > from long term ones, similarly to the load average. > > What to make of this number? If CPU utilization is at 100% and CPU > pressure is 0, it means the system is perfectly utilized, with one > runnable thread per CPU and nobody waiting. At two or more runnable > tasks per CPU, the system is 100% overcommitted and the pressure > average will indicate as much. From a utilization perspective this is > a great state of course: no CPU cycles are being wasted, even when 50% > of the threads were to go idle (and most workloads do vary). From the > perspective of the individual job it's not great, however, and they > might do better with more resou
Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
Hi, Jens, Jens Axboe writes: > On 5/25/18 3:14 PM, Jeff Moyer wrote: >> Bryan Gurney reported I/O errors when using dm-zoned with a host-managed >> SMR device. It turns out he was using CFQ, which is the default. >> Unfortunately, as of v4.16, only the deadline schedulers work well with >> host-managed SMR devices. This series aatempts to switch the elevator >> to deadline for those devices. >> >> NOTE: I'm not super happy with setting up one iosched and then >> immediately tearing it down. I'm open to suggestions on better ways >> to accomplish this goal. > > Let's please not do this, a few years ago I finally managed to kill > drivers changing the scheduler manually. Why can't this go into a > udev (or similar) rule? That's where it belongs, imho. We could do that. The downside is that distros will have to pick up udev rules, which they haven't done yet, and the udev rules will have to be kernel version dependent. And then later, when this restriction is lifted, we'll have to update the udev rules. That also sounds awful to me. I understand why you don't like this patch set, but I happen to think the alternative is worse. FYI, in Bryan's case, his system actually got bricked (likely due to buggy firmware). Ultimately, it's your call of course. Cheers, Jeff
Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
On 5/25/18 3:14 PM, Jeff Moyer wrote: > Bryan Gurney reported I/O errors when using dm-zoned with a host-managed > SMR device. It turns out he was using CFQ, which is the default. > Unfortunately, as of v4.16, only the deadline schedulers work well with > host-managed SMR devices. This series aatempts to switch the elevator > to deadline for those devices. > > NOTE: I'm not super happy with setting up one iosched and then > immediately tearing it down. I'm open to suggestions on better ways > to accomplish this goal. Let's please not do this, a few years ago I finally managed to kill drivers changing the scheduler manually. Why can't this go into a udev (or similar) rule? That's where it belongs, imho. -- Jens Axboe
[PATCH 2/2] block: default to deadline for host-managed SMR devices
Bryan Gurney reported a series of I/O errors resulting from using CFQ with a host-managed SMR disk. After commit 39051dd85f28 ("scsi: sd: Remove zone write locking), which was merged in 4.16, users should use one of the deadline I/O schedulers (unless applications are very careful about only submitting 1 I/O per zone). Change our defaults to provide a working configuration. Reported-by: Bryan Gurney Signed-off-by: Jeff Moyer --- block/blk-sysfs.c | 24 1 file changed, 24 insertions(+) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index d00d1b0..ec2f29b 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -908,6 +908,30 @@ int blk_register_queue(struct gendisk *disk) return ret; } } + + /* +* Temporary work-around for host-managed SMR devices: right now, +* the deadline I/O schedulers are the only ones that are SMR- +* aware. Use of other schedulers will almost always result +* in I/O errors due to out-of-order writes to zones. For now, +* force deadline if we can. The eventual goal is to get rid +* of this in favor of a dispatch layer that would work with +* all I/O schedulers. +*/ + if (blk_queue_zoned_model(q) == BLK_ZONED_HM) { + ret = __elevator_change(q, "deadline", false); + if (ret == 0) + printk("%s: switched to deadline I/O scheduler for " + "host-managed SMR device %s\n", __func__, + disk->disk_name); + else { + printk("%s: warning: unable to switch to SMR-aware " + "deadline I/O scheduler for host-managed SMR " + "device %s\n", __func__, disk->disk_name); + printk("Consider building deadline and mq-deadline " + "into your kernel (not as modules)\n"); + } + } ret = 0; unlock: mutex_unlock(&q->sysfs_lock); -- 2.8.2.335.g4bb51ae
[PATCH 0/2][RFC] block: default to deadline for SMR devices
Bryan Gurney reported I/O errors when using dm-zoned with a host-managed SMR device. It turns out he was using CFQ, which is the default. Unfortunately, as of v4.16, only the deadline schedulers work well with host-managed SMR devices. This series aatempts to switch the elevator to deadline for those devices. NOTE: I'm not super happy with setting up one iosched and then immediately tearing it down. I'm open to suggestions on better ways to accomplish this goal. [PATCH 1/2] block: __elevator_change: add try_loading parameter [PATCH 2/2] block: default to deadline for host-managed SMR devices
[PATCH 1/2] block: __elevator_change: add try_loading parameter
The next patch will add a caller that can't trigger module loads. Also export this function for that caller. Signed-off-by: Jeff Moyer --- block/blk.h | 2 ++ block/elevator.c | 7 --- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/block/blk.h b/block/blk.h index b034fd2..f6c3cc1 100644 --- a/block/blk.h +++ b/block/blk.h @@ -233,6 +233,8 @@ static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq int elv_register_queue(struct request_queue *q); void elv_unregister_queue(struct request_queue *q); +int __elevator_change(struct request_queue *q, const char *name, + bool try_loading); struct hd_struct *__disk_get_part(struct gendisk *disk, int partno); diff --git a/block/elevator.c b/block/elevator.c index e87e9b4..5a91a43 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -1075,7 +1075,8 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e) /* * Switch this queue to the given IO scheduler. */ -static int __elevator_change(struct request_queue *q, const char *name) +int __elevator_change(struct request_queue *q, const char *name, + bool try_loading) { char elevator_name[ELV_NAME_MAX]; struct elevator_type *e; @@ -1091,7 +1092,7 @@ static int __elevator_change(struct request_queue *q, const char *name) return elevator_switch(q, NULL); strlcpy(elevator_name, name, sizeof(elevator_name)); - e = elevator_get(q, strstrip(elevator_name), true); + e = elevator_get(q, strstrip(elevator_name), try_loading); if (!e) return -EINVAL; @@ -1119,7 +1120,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name, if (!(q->mq_ops || q->request_fn) || !elv_support_iosched(q)) return count; - ret = __elevator_change(q, name); + ret = __elevator_change(q, name, true); if (!ret) return count; -- 2.8.2.335.g4bb51ae
Re: [PATCH 2/2] xfs: add support for sub-pagesize writeback without buffer_heads
On Wed, May 23, 2018 at 04:46:46PM +0200, Christoph Hellwig wrote: > Switch to using the iomap_page structure for checking sub-page uptodate > status and track sub-page I/O completion status, and remove large > quantities of boilerplate code working around buffer heads. > > Signed-off-by: Christoph Hellwig > --- > fs/xfs/xfs_aops.c | 536 +++-- > fs/xfs/xfs_buf.h | 1 - > fs/xfs/xfs_iomap.c | 3 - > fs/xfs/xfs_super.c | 2 +- > fs/xfs/xfs_trace.h | 18 +- > 5 files changed, 79 insertions(+), 481 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index efa2cbb27d67..d279929e53fb 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c ... > @@ -768,7 +620,7 @@ xfs_aops_discard_page( > int error; > > if (XFS_FORCED_SHUTDOWN(mp)) > - goto out_invalidate; > + goto out; > > xfs_alert(mp, > "page discard on page "PTR_FMT", inode 0x%llx, offset %llu.", > @@ -778,15 +630,15 @@ xfs_aops_discard_page( > PAGE_SIZE / i_blocksize(inode)); > if (error && !XFS_FORCED_SHUTDOWN(mp)) > xfs_alert(mp, "page discard unable to remove delalloc > mapping."); > -out_invalidate: > - xfs_vm_invalidatepage(page, 0, PAGE_SIZE); > +out: > + iomap_invalidatepage(page, 0, PAGE_SIZE); All this does is lose the tracepoint. I don't think this call needs to change. The rest looks Ok to me, but I still need to run some tests on the whole thing. Brian > } > > /* > * We implement an immediate ioend submission policy here to avoid needing to > * chain multiple ioends and hence nest mempool allocations which can violate > * forward progress guarantees we need to provide. The current ioend we are > - * adding buffers to is cached on the writepage context, and if the new > buffer > + * adding blocks to is cached on the writepage context, and if the new block > * does not append to the cached ioend it will create a new ioend and cache > that > * instead. > * > @@ -807,41 +659,28 @@ xfs_writepage_map( > uint64_tend_offset) > { > LIST_HEAD(submit_list); > + struct iomap_page *iop = to_iomap_page(page); > + unsignedlen = i_blocksize(inode); > struct xfs_ioend*ioend, *next; > - struct buffer_head *bh = NULL; > - ssize_t len = i_blocksize(inode); > - int error = 0; > - int count = 0; > - loff_t file_offset;/* file offset of page */ > - unsignedpoffset;/* offset into page */ > + int error = 0, count = 0, i; > + u64 file_offset;/* file offset of page */ > > - if (page_has_buffers(page)) > - bh = page_buffers(page); > + ASSERT(iop || i_blocksize(inode) == PAGE_SIZE); > + ASSERT(!iop || atomic_read(&iop->write_count) == 0); > > /* > - * Walk the blocks on the page, and we we run off then end of the > - * current map or find the current map invalid, grab a new one. > - * We only use bufferheads here to check per-block state - they no > - * longer control the iteration through the page. This allows us to > - * replace the bufferhead with some other state tracking mechanism in > - * future. > + * Walk through the page to find areas to write back. If we run off the > + * end of the current map or find the current map invalid, grab a new > + * one. >*/ > - for (poffset = 0, file_offset = page_offset(page); > - poffset < PAGE_SIZE; > - poffset += len, file_offset += len) { > - /* past the range we are writing, so nothing more to write. */ > - if (file_offset >= end_offset) > - break; > - > + for (i = 0, file_offset = page_offset(page); > + i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset; > + i++, file_offset += len) { > /* >* Block does not contain valid data, skip it. >*/ > - if (bh && !buffer_uptodate(bh)) { > - if (PageUptodate(page)) > - ASSERT(buffer_mapped(bh)); > - bh = bh->b_this_page; > + if (iop && !test_bit(i, iop->uptodate)) > continue; > - } > > /* >* If we don't have a valid map, now it's time to get a new one > @@ -854,52 +693,33 @@ xfs_writepage_map( > error = xfs_map_blocks(inode, file_offset, &wpc->imap, >&wpc->io_type); > if (error) > - goto out; > + break; > } > > - if (wpc->io_type == XFS_IO_HOLE) { > -
Re: [PATCH 1/2] iomap: add support for sub-pagesize buffered I/O without buffer heads
On Wed, May 23, 2018 at 04:46:45PM +0200, Christoph Hellwig wrote: > After already supporting a simple implementation of buffered writes for > the blocksize == PAGE_SIZE case in the last commit this adds full support > even for smaller block sizes. There are three bits of per-block > information in the buffer_head structure that really matter for the iomap > read and write path: > > - uptodate status (BH_uptodate) > - marked as currently under read I/O (BH_Async_Read) > - marked as currently under write I/O (BH_Async_Write) > > Instead of having new per-block structures this now adds a per-page > structure called struct iomap_page to track this information in a slightly > different form: > > - a bitmap for the per-block uptodate status. For worst case of a 64k >page size system this bitmap needs to contain 128 bits. For the >typical 4k page size case it only needs 8 bits, although we still >need a full unsigned long due to the way the atomic bitmap API works. > - two atomic_t counters are used to track the outstanding read and write >counts > > There is quite a bit of boilerplate code as the buffered I/O path uses > various helper methods, but the actual code is very straight forward. > > Signed-off-by: Christoph Hellwig > --- > fs/iomap.c| 247 +++--- > include/linux/iomap.h | 31 ++ > 2 files changed, 260 insertions(+), 18 deletions(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index debb859a8a14..ea746e0287f9 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c ... > @@ -104,6 +105,107 @@ iomap_sector(struct iomap *iomap, loff_t pos) > return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT; > } > > +static struct iomap_page * > +iomap_page_create(struct inode *inode, struct page *page) > +{ > + struct iomap_page *iop = to_iomap_page(page); > + > + if (iop || i_blocksize(inode) == PAGE_SIZE) > + return iop; > + > + iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL); > + atomic_set(&iop->read_count, 0); > + atomic_set(&iop->write_count, 0); > + bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE); > + set_page_private(page, (unsigned long)iop); > + SetPagePrivate(page); The buffer head implementation does a get/put page when the private state is set. I'm not quite sure why that is tbh, but do you know whether we need that here or not? > + return iop; > +} > + ... > @@ -142,18 +244,19 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, > loff_t length, void *data, > { > struct iomap_readpage_ctx *ctx = data; > struct page *page = ctx->cur_page; > - unsigned poff = pos & (PAGE_SIZE - 1); > - unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length); > + struct iomap_page *iop = iomap_page_create(inode, page); > bool is_contig = false; > + loff_t orig_pos = pos; > + unsigned poff, plen; > sector_t sector; > > - /* we don't support blocksize < PAGE_SIZE quite yet: */ > - WARN_ON_ONCE(pos != page_offset(page)); > - WARN_ON_ONCE(plen != PAGE_SIZE); > + iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen); > + if (plen == 0) > + goto done; > > if (iomap->type != IOMAP_MAPPED || pos >= i_size_read(inode)) { > zero_user(page, poff, plen); > - SetPageUptodate(page); > + iomap_set_range_uptodate(page, poff, plen); > goto done; > } > > @@ -169,6 +272,14 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, > loff_t length, void *data, > is_contig = true; > } > > + /* > + * If we start a new segment we need to increase the read count, and we > + * need to do so before submitting any previous full bio to make sure > + * that we don't prematurely unlock the page. > + */ > + if (iop) > + atomic_inc(&iop->read_count); > + > if (!ctx->bio || !is_contig || bio_full(ctx->bio)) { > if (ctx->bio) > submit_bio(ctx->bio); > @@ -177,7 +288,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, > loff_t length, void *data, > > __bio_add_page(ctx->bio, page, plen, poff); > done: > - return plen; > + return pos - orig_pos + plen; A brief comment here (or above the adjust_read_range() call) to explain the final length calculation would be helpful. E.g., it looks like leading uptodate blocks are part of the read while trailing uptodate blocks can be truncated by the above call. > } > > int > @@ -188,8 +299,6 @@ iomap_readpage(struct page *page, const struct iomap_ops > *ops) > unsigned poff; > loff_t ret; > > - WARN_ON_ONCE(page_has_buffers(page)); > - > for (poff = 0; poff < PAGE_SIZE; poff += ret) { > ret = iomap_apply(inode, page_offset(page) + poff, > PAGE_SIZE - poff, 0, ops, &ctx, > @@ -295,6 +404,92 @@ iomap_readpages(struct add
Re: [RESEND PATCH V5 00/33] block: support multipage bvec
On 5/24/18 10:53 PM, Kent Overstreet wrote: > On Fri, May 25, 2018 at 11:45:48AM +0800, Ming Lei wrote: >> Hi, >> >> This patchset brings multipage bvec into block layer: > > patch series looks sane to me. goddamn that's a lot of renaming. Indeed... I actually objected to some of the segment -> page renaming, but it's still in there. The foo2() temporary functions also concern me, we all know there's nothing more permanent than a temporary fixup. > Things are going to get interesting when we start sticking compound pages in > the > page cache, there'll be some interesting questions of semantics to deal with > then but I think getting this will only help w.r.t. plumbing that through and > not dealing with 4k pages unnecessarily - but I think even if we were to > decide > that merging in bio_add_page() is not the way to go when the upper layers are > passing compound pages around already, this patch series helps because > regardless at some point everything under generic_make_request() is going to > have to deal with segments that are more than one page, and this patch series > makes that happen. So incremental progress. > > Jens, any objections to getting this in? I like most of it, but I'd much rather get this way earlier in the series. We're basically just one week away from the merge window, it needs more simmer and testing time than that. On top of that, it hasn't received much review yet. So as far as I'm concerned, we can kick off the 4.19 block branch with iterated versions of this patchset. -- Jens Axboe
Re: [RESEND PATCH V5 33/33] block: document usage of bio iterator helpers
On 05/24/2018 08:46 PM, Ming Lei wrote: > Now multipage bvec is supported, and some helpers may return page by > page, and some may return segment by segment, this patch documents the > usage for helping us use them correctly. > > Signed-off-by: Ming Lei > --- > Documentation/block/biovecs.txt | 32 > 1 file changed, 32 insertions(+) > > diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt > index b4d238b8d9fc..32a6643caeca 100644 > --- a/Documentation/block/biovecs.txt > +++ b/Documentation/block/biovecs.txt > @@ -117,3 +117,35 @@ Other implications: > size limitations and the limitations of the underlying devices. Thus > there's no need to define ->merge_bvec_fn() callbacks for individual block > drivers. > + > +Usage of helpers: > += > + > +* The following helpers which name has suffix of "_all" can only be used on * The following helpers, whose names have the suffix "_all", can only be used on > +non-BIO_CLONED bio, and ususally they are used by filesystem code, and driver usually > +shouldn't use them becasue bio may have been splitted before they got to the because split > +driver: > + > + bio_for_each_segment_all() > + bio_for_each_page_all() > + bio_pages_all() > + bio_first_bvec_all() > + bio_first_page_all() > + bio_last_bvec_all() > + segment_for_each_page_all() > + > +* The following helpers iterate bio page by page, and the local variable of > +'struct bio_vec' or the reference records single page io vector during the > +itearation: iteration: > + > + bio_for_each_page() > + bio_for_each_page_all() > + segment_for_each_page_all() > + > +* The following helpers iterate bio segment by segment, and each segment may > +include multiple physically contiguous pages, and the local variable of > +'struct bio_vec' or the reference records multi page io vector during the > +itearation: iteration: > + > + bio_for_each_segment() > + bio_for_each_segment_all() > -- ~Randy
Re: [PATCH] Block: null_blk: add blocking description and remove lightnvm
On 5/25/18 8:40 AM, Liu Bo wrote: > - The description of 'blocking' is missing in null_blk.txt > > - The 'lightnvm' parameter has been removed in null_blk.c > > This updates both in null_blk.txt. Thanks, applied. -- Jens Axboe
Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI
On Thu, May 24, 2018 at 10:06:59AM -0700, Kees Cook wrote: > On Thu, May 24, 2018 at 1:00 AM, Christoph Hellwig wrote: > > On Wed, May 23, 2018 at 03:14:19PM -0600, Jens Axboe wrote: > >> Ugh, so that would necessitate a change there too. As I said before, > >> I don't really care where it lives. I know the SCSI folks seem bothered > >> by moving it, but in reality, it's not like this stuff will likely ever > >> really change. Of the two choices (select entire SCSI stack, or just move > >> this little bit), I know what I would consider the saner option... > > > > Oh well. How about something like this respin of Kees' series? > > > > http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/sense-cleanup > > Does the CONFIG_PCMCIA in drivers/scsi/Makefile now get exposed in > weird config cases? That just includes drivers/scsi/pcmcia/Makefile, which only has three conditional modules in it, so it should be fine. > Otherwise, yeah, looks good to me. Thanks! Can you pick up my tweaks and ressend?
[PATCH] Block: null_blk: add blocking description and remove lightnvm
- The description of 'blocking' is missing in null_blk.txt - The 'lightnvm' parameter has been removed in null_blk.c This updates both in null_blk.txt. Signed-off-by: Liu Bo --- Documentation/block/null_blk.txt | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt index 733927a7b501..07f147381f32 100644 --- a/Documentation/block/null_blk.txt +++ b/Documentation/block/null_blk.txt @@ -71,13 +71,16 @@ use_per_node_hctx=[0/1]: Default: 0 1: The multi-queue block layer is instantiated with a hardware dispatch queue for each CPU node in the system. -use_lightnvm=[0/1]: Default: 0 - Register device with LightNVM. Requires blk-mq and CONFIG_NVM to be enabled. - no_sched=[0/1]: Default: 0 0: nullb* use default blk-mq io scheduler. 1: nullb* doesn't use io scheduler. +blocking=[0/1]: Default: 0 + 0: Register as a non-blocking blk-mq driver device. + 1: Register as a blocking blk-mq driver device, null_blk will set + the BLK_MQ_F_BLOCKING flag, indicating that it sometimes/always + needs to block in its ->queue_rq() function. + shared_tags=[0/1]: Default: 0 0: Tag set is not shared. 1: Tag set shared between devices for blk-mq. Only makes sense with -- 1.8.3.1
Re: [PATCH] bdi: Fix oops in wb_workfn()
Jan Kara wrote: > > void delayed_work_timer_fn(struct timer_list *t) > > { > > struct delayed_work *dwork = from_timer(dwork, t, timer); > > > > /* should have been called from irqsafe timer with irq already off */ > > __queue_work(dwork->cpu, dwork->wq, &dwork->work); > > } > > > > Then, wb_workfn() is after all scheduled even if we check for > > WB_registered bit, isn't it? > > It can be queued after WB_registered bit is cleared but it cannot be queued > after mod_delayed_work(bdi_wq, &wb->dwork, 0) has finished. That function > deletes the pending timer (the timer cannot be armed again because > WB_registered is cleared) and queues what should be the last round of > wb_workfn(). mod_delayed_work() deletes the pending timer but does not wait for already invoked timer handler to complete because it is using del_timer() rather than del_timer_sync(). Then, what happens if __queue_work() is almost concurrently executed from two CPUs, one from mod_delayed_work(bdi_wq, &wb->dwork, 0) from wb_shutdown() path (which is called without spin_lock_bh(&wb->work_lock)) and the other from delayed_work_timer_fn() path (which is called without checking WB_registered bit under spin_lock_bh(&wb->work_lock)) ? wb_wakeup_delayed() { spin_lock_bh(&wb->work_lock); if (test_bit(WB_registered, &wb->state)) // succeeds queue_delayed_work(bdi_wq, &wb->d_work, timeout) { queue_delayed_work_on(WORK_CPU_UNBOUND, bdi_wq, &wb->d_work, timeout) { if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&wb->d_work.work))) { // succeeds __queue_delayed_work(WORK_CPU_UNBOUND, bdi_wq, &wb->d_work, timeout) { add_timer(timer); // schedules for delayed_work_timer_fn() } } } } spin_unlock_bh(&wb->work_lock); } delayed_work_timer_fn() { // del_timer() already returns false at this point because this timer // is already inside handler. But something took long here enough to // wait for __queue_work() from wb_shutdown() path to finish? __queue_work(WORK_CPU_UNBOUND, bdi_wq, &wb->d_work.work) { insert_work(pwq, work, worklist, work_flags); } } wb_shutdown() { mod_delayed_work(bdi_wq, &wb->dwork, 0) { mod_delayed_work_on(WORK_CPU_UNBOUND, bdi_wq, &wb->dwork, 0) { ret = try_to_grab_pending(&wb->dwork.work, true, &flags) { if (likely(del_timer(&wb->dwork.timer))) // fails because already in delayed_work_timer_fn() return 1; if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&wb->dwork.work))) // fails because already set by queue_delayed_work() return 0; // Returns 1 or -ENOENT after doing something? } if (ret >= 0) __queue_delayed_work(WORK_CPU_UNBOUND, bdi_wq, &wb->dwork, 0) { __queue_work(WORK_CPU_UNBOUND, bdi_wq, &wb->dwork.work) { insert_work(pwq, work, worklist, work_flags); } } } } }