Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-25 Thread Jens Axboe
On 5/25/18 4:18 PM, Jeff Moyer wrote:
> Hi, Jens,
> 
> Jens Axboe  writes:
> 
>> On 5/25/18 3:14 PM, Jeff Moyer wrote:
>>> Bryan Gurney reported I/O errors when using dm-zoned with a host-managed
>>> SMR device.  It turns out he was using CFQ, which is the default.
>>> Unfortunately, as of v4.16, only the deadline schedulers work well with
>>> host-managed SMR devices.  This series aatempts to switch the elevator
>>> to deadline for those devices.
>>>
>>> NOTE: I'm not super happy with setting up one iosched and then
>>> immediately tearing it down.  I'm open to suggestions on better ways
>>> to accomplish this goal.
>>
>> Let's please not do this, a few years ago I finally managed to kill
>> drivers changing the scheduler manually. Why can't this go into a
>> udev (or similar) rule? That's where it belongs, imho.
> 
> We could do that.  The downside is that distros will have to pick up
> udev rules, which they haven't done yet, and the udev rules will have to
> be kernel version dependent.  And then later, when this restriction is
> lifted, we'll have to update the udev rules.  That also sounds awful to
> me.

They only have to be feature dependent, which isn't that hard. And if I
had to pick between a kernel upgrade and a udev rule package update, the
choice is pretty clear.

> I understand why you don't like this patch set, but I happen to think
> the alternative is worse.  FYI, in Bryan's case, his system actually got
> bricked (likely due to buggy firmware).

I disagree, I think the rule approach is much easier. If the wrong write
location bricked the drive, then I think that user has much larger
issues... That seems like a trivial issue that should have been caught
in basic testing, I would not trust that drive with any data if it
bricks that easily.

-- 
Jens Axboe



[PATCH v3] block/loop: Serialize ioctl operations.

2018-05-25 Thread Tetsuo Handa
syzbot is reporting NULL pointer dereference [1] which is caused by
race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
loop devices without holding corresponding locks.

syzbot is also reporting circular locking dependency between bdev->bd_mutex
and lo->lo_ctl_mutex [2] which is caused by calling blkdev_reread_part()
with lock held.

Since ioctl() request on loop devices is not frequent operation, we don't
need fine grained locking. Let's use global lock and simplify the locking
order.

Strategy is that the global lock is held upon entry of ioctl() request,
and release it before either starting operations which might deadlock or
leaving ioctl() request. After the global lock is released, current thread
no longer uses "struct loop_device" memory because it might be modified
by next ioctl() request which was waiting for current ioctl() request.

In order to enforce this strategy, this patch inversed
loop_reread_partitions() and loop_unprepare_queue() in loop_clr_fd().
I don't know whether it breaks something, but I don't have testcases.

Since this patch serializes using global lock, race bugs should no longer
exist. Thus, it will be easy to test whether this patch broke something.

[1] 
https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
[2] 
https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Reported-by: syzbot 

Cc: Jens Axboe 
---
 drivers/block/loop.c | 231 ---
 drivers/block/loop.h |   1 -
 2 files changed, 128 insertions(+), 104 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 55cf554..feb9fa7 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -81,11 +81,50 @@
 #include 
 
 static DEFINE_IDR(loop_index_idr);
-static DEFINE_MUTEX(loop_index_mutex);
+static DEFINE_MUTEX(loop_mutex);
+static void *loop_mutex_owner; /* == __mutex_owner(&loop_mutex) */
 
 static int max_part;
 static int part_shift;
 
+/*
+ * lock_loop - Lock loop_mutex.
+ */
+static void lock_loop(void)
+{
+   mutex_lock(&loop_mutex);
+   loop_mutex_owner = current;
+}
+
+/*
+ * lock_loop_killable - Lock loop_mutex unless killed.
+ */
+static int lock_loop_killable(void)
+{
+   int err = mutex_lock_killable(&loop_mutex);
+
+   if (err)
+   return err;
+   loop_mutex_owner = current;
+   return 0;
+}
+
+/*
+ * unlock_loop - Unlock loop_mutex as needed.
+ *
+ * Explicitly call this function before calling fput() or blkdev_reread_part()
+ * in order to avoid circular lock dependency. After this function is called,
+ * current thread is no longer allowed to access "struct loop_device" memory,
+ * for another thread would access that memory as soon as loop_mutex is held.
+ */
+static void unlock_loop(void)
+{
+   if (loop_mutex_owner == current) {
+   loop_mutex_owner = NULL;
+   mutex_unlock(&loop_mutex);
+   }
+}
+
 static int transfer_xor(struct loop_device *lo, int cmd,
struct page *raw_page, unsigned raw_off,
struct page *loop_page, unsigned loop_off,
@@ -626,7 +665,12 @@ static void loop_reread_partitions(struct loop_device *lo,
   struct block_device *bdev)
 {
int rc;
+   /* Save variables which might change after unlock_loop() is called. */
+   char filename[sizeof(lo->lo_file_name)];
+   const int num = lo->lo_number;
 
+   memmove(filename, lo->lo_file_name, sizeof(filename));
+   unlock_loop();
/*
 * bd_mutex has been held already in release path, so don't
 * acquire it if this function is called in such case.
@@ -641,7 +685,7 @@ static void loop_reread_partitions(struct loop_device *lo,
rc = blkdev_reread_part(bdev);
if (rc)
pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
-   __func__, lo->lo_number, lo->lo_file_name, rc);
+   __func__, num, filename, rc);
 }
 
 /*
@@ -659,6 +703,7 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
struct inode*inode;
int error;
 
+   lockdep_assert_held(&loop_mutex);
error = -ENXIO;
if (lo->lo_state != Lo_bound)
goto out;
@@ -695,12 +740,14 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
loop_update_dio(lo);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-   fput(old_file);
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
-   loop_reread_partitions(lo, bdev);
+   loop_reread_partitions(lo, bdev); /* calls unlock_loop() */
+   unlock_loop();
+   fput(old_file);
return 0;
 
  out_putf:
+   unlock_loop();
fput(file);
  out:
return error;
@@ -884,6 +93

Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-25 Thread Suren Baghdasaryan
Hi Johannes,
I tried your previous memdelay patches before this new set was posted
and results were promising for predicting when Android system is close
to OOM. I'm definitely going to try this one after I backport it to
4.9.

On Mon, May 7, 2018 at 2:01 PM, Johannes Weiner  wrote:
> Hi,
>
> I previously submitted a version of this patch set called "memdelay",
> which translated delays from reclaim, swap-in, thrashing page cache
> into a pressure percentage of lost walltime. I've since extended this
> code to aggregate all delay states tracked by delayacct in order to
> have generalized pressure/overcommit levels for CPU, memory, and IO.
>
> There was feedback from Peter on the previous version that I have
> incorporated as much as possible and as it still applies to this code:
>
> - got rid of the extra lock in the sched callbacks; all task
>   state changes we care about serialize through rq->lock
>
> - got rid of ktime_get() inside the sched callbacks and
>   switched time measuring to rq_clock()
>
> - got rid of all divisions inside the sched callbacks,
>   tracking everything natively in ns now
>
> I also moved this stuff into existing sched/stat.h callbacks, so it
> doesn't get in the way in sched/core.c, and of course moved the whole
> thing behind CONFIG_PSI since not everyone is going to want it.

Would it make sense to split CONFIG_PSI into CONFIG_PSI_CPU,
CONFIG_PSI_MEM and CONFIG_PSI_IO since one might need only specific
subset of this feature?

>
> Real-world applications
>
> Since the last posting, we've begun using the data collected by this
> code quite extensively at Facebook, and with several success stories.
>
> First we used it on systems that frequently locked up in low memory
> situations. The reason this happens is that the OOM killer is
> triggered by reclaim not being able to make forward progress, but with
> fast flash devices there is *always* some clean and uptodate cache to
> reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of
> the time faulting executables. There is no situation where this ever
> makes sense in practice. We wrote a <100 line POC python script to
> monitor memory pressure and kill stuff manually, way before such
> pathological thrashing.
>
> We've since extended the python script into a more generic oomd that
> we use all over the place, not just to avoid livelocks but also to
> guarantee latency and throughput SLAs, since they're usually violated
> way before the kernel OOM killer would ever kick in.
>
> We also use the memory pressure info for loadshedding. Our batch job
> infrastructure used to refuse new requests on heuristics based on RSS
> and other existing VM metrics in an attempt to avoid OOM kills and
> maximize utilization. Since it was still plagued by frequent OOM
> kills, we switched it to shed load on psi memory pressure, which has
> turned out to be a much better bellwether, and we managed to reduce
> OOM kills drastically. Reducing the rate of OOM outages from the
> worker pool raised its aggregate productivity, and we were able to
> switch that service to smaller machines.
>
> Lastly, we use cgroups to isolate a machine's main workload from
> maintenance crap like package upgrades, logging, configuration, as
> well as to prevent multiple workloads on a machine from stepping on
> each others' toes. We were not able to do this properly without the
> pressure metrics; we would see latency or bandwidth drops, but it
> would often be hard to impossible to rootcause it post-mortem. We now
> log and graph the pressure metrics for all containers in our fleet and
> can trivially link service drops to resource pressure after the fact.
>
> How do you use this?
>
> A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
> 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
> cpu.pressure, memory.pressure and io.pressure files, which simply
> calculate pressure at the cgroup level instead of system-wide.
>
> The cpu file contains one line:
>
> some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
>
> The averages give the percentage of walltime in which some tasks are
> delayed on the runqueue while another task has the CPU. They're recent
> averages over 10s, 1m, 5m windows, so you can tell short term trends
> from long term ones, similarly to the load average.
>
> What to make of this number? If CPU utilization is at 100% and CPU
> pressure is 0, it means the system is perfectly utilized, with one
> runnable thread per CPU and nobody waiting. At two or more runnable
> tasks per CPU, the system is 100% overcommitted and the pressure
> average will indicate as much. From a utilization perspective this is
> a great state of course: no CPU cycles are being wasted, even when 50%
> of the threads were to go idle (and most workloads do vary). From the
> perspective of the individual job it's not great, however, and they
> might do better with more resou

Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-25 Thread Jeff Moyer
Hi, Jens,

Jens Axboe  writes:

> On 5/25/18 3:14 PM, Jeff Moyer wrote:
>> Bryan Gurney reported I/O errors when using dm-zoned with a host-managed
>> SMR device.  It turns out he was using CFQ, which is the default.
>> Unfortunately, as of v4.16, only the deadline schedulers work well with
>> host-managed SMR devices.  This series aatempts to switch the elevator
>> to deadline for those devices.
>> 
>> NOTE: I'm not super happy with setting up one iosched and then
>> immediately tearing it down.  I'm open to suggestions on better ways
>> to accomplish this goal.
>
> Let's please not do this, a few years ago I finally managed to kill
> drivers changing the scheduler manually. Why can't this go into a
> udev (or similar) rule? That's where it belongs, imho.

We could do that.  The downside is that distros will have to pick up
udev rules, which they haven't done yet, and the udev rules will have to
be kernel version dependent.  And then later, when this restriction is
lifted, we'll have to update the udev rules.  That also sounds awful to
me.

I understand why you don't like this patch set, but I happen to think
the alternative is worse.  FYI, in Bryan's case, his system actually got
bricked (likely due to buggy firmware).

Ultimately, it's your call of course.

Cheers,
Jeff


Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-25 Thread Jens Axboe
On 5/25/18 3:14 PM, Jeff Moyer wrote:
> Bryan Gurney reported I/O errors when using dm-zoned with a host-managed
> SMR device.  It turns out he was using CFQ, which is the default.
> Unfortunately, as of v4.16, only the deadline schedulers work well with
> host-managed SMR devices.  This series aatempts to switch the elevator
> to deadline for those devices.
> 
> NOTE: I'm not super happy with setting up one iosched and then
> immediately tearing it down.  I'm open to suggestions on better ways
> to accomplish this goal.

Let's please not do this, a few years ago I finally managed to kill
drivers changing the scheduler manually. Why can't this go into a
udev (or similar) rule? That's where it belongs, imho.

-- 
Jens Axboe



[PATCH 2/2] block: default to deadline for host-managed SMR devices

2018-05-25 Thread Jeff Moyer
Bryan Gurney reported a series of I/O errors resulting from using
CFQ with a host-managed SMR disk.  After commit 39051dd85f28 ("scsi:
sd: Remove zone write locking), which was merged in 4.16, users
should use one of the deadline I/O schedulers (unless applications
are very careful about only submitting 1 I/O per zone).  Change our
defaults to provide a working configuration.

Reported-by: Bryan Gurney 
Signed-off-by: Jeff Moyer 
---
 block/blk-sysfs.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d00d1b0..ec2f29b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -908,6 +908,30 @@ int blk_register_queue(struct gendisk *disk)
return ret;
}
}
+
+   /*
+* Temporary work-around for host-managed SMR devices: right now,
+* the deadline I/O schedulers are the only ones that are SMR-
+* aware.  Use of other schedulers will almost always result
+* in I/O errors due to out-of-order writes to zones.  For now,
+* force deadline if we can.  The eventual goal is to get rid
+* of this in favor of a dispatch layer that would work with
+* all I/O schedulers.
+*/
+   if (blk_queue_zoned_model(q) == BLK_ZONED_HM) {
+   ret = __elevator_change(q, "deadline", false);
+   if (ret == 0)
+   printk("%s: switched to deadline I/O scheduler for "
+  "host-managed SMR device %s\n", __func__,
+  disk->disk_name);
+   else {
+   printk("%s: warning: unable to switch to SMR-aware "
+  "deadline I/O scheduler for host-managed SMR "
+  "device %s\n", __func__, disk->disk_name);
+   printk("Consider building deadline and mq-deadline "
+  "into your kernel (not as modules)\n");
+   }
+   }
ret = 0;
 unlock:
mutex_unlock(&q->sysfs_lock);
-- 
2.8.2.335.g4bb51ae



[PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-25 Thread Jeff Moyer
Bryan Gurney reported I/O errors when using dm-zoned with a host-managed
SMR device.  It turns out he was using CFQ, which is the default.
Unfortunately, as of v4.16, only the deadline schedulers work well with
host-managed SMR devices.  This series aatempts to switch the elevator
to deadline for those devices.

NOTE: I'm not super happy with setting up one iosched and then
immediately tearing it down.  I'm open to suggestions on better ways
to accomplish this goal.

[PATCH 1/2] block: __elevator_change: add try_loading parameter
[PATCH 2/2] block: default to deadline for host-managed SMR devices


[PATCH 1/2] block: __elevator_change: add try_loading parameter

2018-05-25 Thread Jeff Moyer
The next patch will add a caller that can't trigger module loads.
Also export this function for that caller.

Signed-off-by: Jeff Moyer 
---
 block/blk.h  | 2 ++
 block/elevator.c | 7 ---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index b034fd2..f6c3cc1 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -233,6 +233,8 @@ static inline void elv_deactivate_rq(struct request_queue 
*q, struct request *rq
 
 int elv_register_queue(struct request_queue *q);
 void elv_unregister_queue(struct request_queue *q);
+int __elevator_change(struct request_queue *q, const char *name,
+ bool try_loading);
 
 struct hd_struct *__disk_get_part(struct gendisk *disk, int partno);
 
diff --git a/block/elevator.c b/block/elevator.c
index e87e9b4..5a91a43 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -1075,7 +1075,8 @@ static int elevator_switch(struct request_queue *q, 
struct elevator_type *new_e)
 /*
  * Switch this queue to the given IO scheduler.
  */
-static int __elevator_change(struct request_queue *q, const char *name)
+int __elevator_change(struct request_queue *q, const char *name,
+ bool try_loading)
 {
char elevator_name[ELV_NAME_MAX];
struct elevator_type *e;
@@ -1091,7 +1092,7 @@ static int __elevator_change(struct request_queue *q, 
const char *name)
return elevator_switch(q, NULL);
 
strlcpy(elevator_name, name, sizeof(elevator_name));
-   e = elevator_get(q, strstrip(elevator_name), true);
+   e = elevator_get(q, strstrip(elevator_name), try_loading);
if (!e)
return -EINVAL;
 
@@ -1119,7 +1120,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const 
char *name,
if (!(q->mq_ops || q->request_fn) || !elv_support_iosched(q))
return count;
 
-   ret = __elevator_change(q, name);
+   ret = __elevator_change(q, name, true);
if (!ret)
return count;
 
-- 
2.8.2.335.g4bb51ae



Re: [PATCH 2/2] xfs: add support for sub-pagesize writeback without buffer_heads

2018-05-25 Thread Brian Foster
On Wed, May 23, 2018 at 04:46:46PM +0200, Christoph Hellwig wrote:
> Switch to using the iomap_page structure for checking sub-page uptodate
> status and track sub-page I/O completion status, and remove large
> quantities of boilerplate code working around buffer heads.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/xfs/xfs_aops.c  | 536 +++--
>  fs/xfs/xfs_buf.h   |   1 -
>  fs/xfs/xfs_iomap.c |   3 -
>  fs/xfs/xfs_super.c |   2 +-
>  fs/xfs/xfs_trace.h |  18 +-
>  5 files changed, 79 insertions(+), 481 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index efa2cbb27d67..d279929e53fb 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
...
> @@ -768,7 +620,7 @@ xfs_aops_discard_page(
>   int error;
>  
>   if (XFS_FORCED_SHUTDOWN(mp))
> - goto out_invalidate;
> + goto out;
>  
>   xfs_alert(mp,
>   "page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
> @@ -778,15 +630,15 @@ xfs_aops_discard_page(
>   PAGE_SIZE / i_blocksize(inode));
>   if (error && !XFS_FORCED_SHUTDOWN(mp))
>   xfs_alert(mp, "page discard unable to remove delalloc 
> mapping.");
> -out_invalidate:
> - xfs_vm_invalidatepage(page, 0, PAGE_SIZE);
> +out:
> + iomap_invalidatepage(page, 0, PAGE_SIZE);

All this does is lose the tracepoint. I don't think this call needs to
change. The rest looks Ok to me, but I still need to run some tests on
the whole thing.

Brian

>  }
>  
>  /*
>   * We implement an immediate ioend submission policy here to avoid needing to
>   * chain multiple ioends and hence nest mempool allocations which can violate
>   * forward progress guarantees we need to provide. The current ioend we are
> - * adding buffers to is cached on the writepage context, and if the new 
> buffer
> + * adding blocks to is cached on the writepage context, and if the new block
>   * does not append to the cached ioend it will create a new ioend and cache 
> that
>   * instead.
>   *
> @@ -807,41 +659,28 @@ xfs_writepage_map(
>   uint64_tend_offset)
>  {
>   LIST_HEAD(submit_list);
> + struct iomap_page   *iop = to_iomap_page(page);
> + unsignedlen = i_blocksize(inode);
>   struct xfs_ioend*ioend, *next;
> - struct buffer_head  *bh = NULL;
> - ssize_t len = i_blocksize(inode);
> - int error = 0;
> - int count = 0;
> - loff_t  file_offset;/* file offset of page */
> - unsignedpoffset;/* offset into page */
> + int error = 0, count = 0, i;
> + u64 file_offset;/* file offset of page */
>  
> - if (page_has_buffers(page))
> - bh = page_buffers(page);
> + ASSERT(iop || i_blocksize(inode) == PAGE_SIZE);
> + ASSERT(!iop || atomic_read(&iop->write_count) == 0);
>  
>   /*
> -  * Walk the blocks on the page, and we we run off then end of the
> -  * current map or find the current map invalid, grab a new one.
> -  * We only use bufferheads here to check per-block state - they no
> -  * longer control the iteration through the page. This allows us to
> -  * replace the bufferhead with some other state tracking mechanism in
> -  * future.
> +  * Walk through the page to find areas to write back. If we run off the
> +  * end of the current map or find the current map invalid, grab a new
> +  * one.
>*/
> - for (poffset = 0, file_offset = page_offset(page);
> -  poffset < PAGE_SIZE;
> -  poffset += len, file_offset += len) {
> - /* past the range we are writing, so nothing more to write. */
> - if (file_offset >= end_offset)
> - break;
> -
> + for (i = 0, file_offset = page_offset(page);
> +  i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
> +  i++, file_offset += len) {
>   /*
>* Block does not contain valid data, skip it.
>*/
> - if (bh && !buffer_uptodate(bh)) {
> - if (PageUptodate(page))
> - ASSERT(buffer_mapped(bh));
> - bh = bh->b_this_page;
> + if (iop && !test_bit(i, iop->uptodate))
>   continue;
> - }
>  
>   /*
>* If we don't have a valid map, now it's time to get a new one
> @@ -854,52 +693,33 @@ xfs_writepage_map(
>   error = xfs_map_blocks(inode, file_offset, &wpc->imap,
>&wpc->io_type);
>   if (error)
> - goto out;
> + break;
>   }
>  
> - if (wpc->io_type == XFS_IO_HOLE) {
> -

Re: [PATCH 1/2] iomap: add support for sub-pagesize buffered I/O without buffer heads

2018-05-25 Thread Brian Foster
On Wed, May 23, 2018 at 04:46:45PM +0200, Christoph Hellwig wrote:
> After already supporting a simple implementation of buffered writes for
> the blocksize == PAGE_SIZE case in the last commit this adds full support
> even for smaller block sizes.   There are three bits of per-block
> information in the buffer_head structure that really matter for the iomap
> read and write path:
> 
>  - uptodate status (BH_uptodate)
>  - marked as currently under read I/O (BH_Async_Read)
>  - marked as currently under write I/O (BH_Async_Write)
> 
> Instead of having new per-block structures this now adds a per-page
> structure called struct iomap_page to track this information in a slightly
> different form:
> 
>  - a bitmap for the per-block uptodate status.  For worst case of a 64k
>page size system this bitmap needs to contain 128 bits.  For the
>typical 4k page size case it only needs 8 bits, although we still
>need a full unsigned long due to the way the atomic bitmap API works.
>  - two atomic_t counters are used to track the outstanding read and write
>counts
> 
> There is quite a bit of boilerplate code as the buffered I/O path uses
> various helper methods, but the actual code is very straight forward.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap.c| 247 +++---
>  include/linux/iomap.h |  31 ++
>  2 files changed, 260 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index debb859a8a14..ea746e0287f9 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
...
> @@ -104,6 +105,107 @@ iomap_sector(struct iomap *iomap, loff_t pos)
>   return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
>  }
>  
> +static struct iomap_page *
> +iomap_page_create(struct inode *inode, struct page *page)
> +{
> + struct iomap_page *iop = to_iomap_page(page);
> +
> + if (iop || i_blocksize(inode) == PAGE_SIZE)
> + return iop;
> +
> + iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
> + atomic_set(&iop->read_count, 0);
> + atomic_set(&iop->write_count, 0);
> + bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
> + set_page_private(page, (unsigned long)iop);
> + SetPagePrivate(page);

The buffer head implementation does a get/put page when the private
state is set. I'm not quite sure why that is tbh, but do you know
whether we need that here or not?

> + return iop;
> +}
> +
...
> @@ -142,18 +244,19 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>  {
>   struct iomap_readpage_ctx *ctx = data;
>   struct page *page = ctx->cur_page;
> - unsigned poff = pos & (PAGE_SIZE - 1);
> - unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
> + struct iomap_page *iop = iomap_page_create(inode, page);
>   bool is_contig = false;
> + loff_t orig_pos = pos;
> + unsigned poff, plen;
>   sector_t sector;
>  
> - /* we don't support blocksize < PAGE_SIZE quite yet: */
> - WARN_ON_ONCE(pos != page_offset(page));
> - WARN_ON_ONCE(plen != PAGE_SIZE);
> + iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
> + if (plen == 0)
> + goto done;
>  
>   if (iomap->type != IOMAP_MAPPED || pos >= i_size_read(inode)) {
>   zero_user(page, poff, plen);
> - SetPageUptodate(page);
> + iomap_set_range_uptodate(page, poff, plen);
>   goto done;
>   }
>  
> @@ -169,6 +272,14 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   is_contig = true;
>   }
>  
> + /*
> +  * If we start a new segment we need to increase the read count, and we
> +  * need to do so before submitting any previous full bio to make sure
> +  * that we don't prematurely unlock the page.
> +  */
> + if (iop)
> + atomic_inc(&iop->read_count);
> +
>   if (!ctx->bio || !is_contig || bio_full(ctx->bio)) {
>   if (ctx->bio)
>   submit_bio(ctx->bio);
> @@ -177,7 +288,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>  
>   __bio_add_page(ctx->bio, page, plen, poff);
>  done:
> - return plen;
> + return pos - orig_pos + plen;

A brief comment here (or above the adjust_read_range() call) to explain
the final length calculation would be helpful. E.g., it looks like
leading uptodate blocks are part of the read while trailing uptodate
blocks can be truncated by the above call.

>  }
>  
>  int
> @@ -188,8 +299,6 @@ iomap_readpage(struct page *page, const struct iomap_ops 
> *ops)
>   unsigned poff;
>   loff_t ret;
>  
> - WARN_ON_ONCE(page_has_buffers(page));
> -
>   for (poff = 0; poff < PAGE_SIZE; poff += ret) {
>   ret = iomap_apply(inode, page_offset(page) + poff,
>   PAGE_SIZE - poff, 0, ops, &ctx,
> @@ -295,6 +404,92 @@ iomap_readpages(struct add

Re: [RESEND PATCH V5 00/33] block: support multipage bvec

2018-05-25 Thread Jens Axboe
On 5/24/18 10:53 PM, Kent Overstreet wrote:
> On Fri, May 25, 2018 at 11:45:48AM +0800, Ming Lei wrote:
>> Hi,
>>
>> This patchset brings multipage bvec into block layer:
> 
> patch series looks sane to me. goddamn that's a lot of renaming.

Indeed... I actually objected to some of the segment -> page renaming,
but it's still in there. The foo2() temporary functions also concern me,
we all know there's nothing more permanent than a temporary fixup.

> Things are going to get interesting when we start sticking compound pages in 
> the
> page cache, there'll be some interesting questions of semantics to deal with
> then but I think getting this will only help w.r.t. plumbing that through and
> not dealing with 4k pages unnecessarily - but I think even if we were to 
> decide
> that merging in bio_add_page() is not the way to go when the upper layers are
> passing compound pages around already, this patch series helps because
> regardless at some point everything under generic_make_request() is going to
> have to deal with segments that are more than one page, and this patch series
> makes that happen. So incremental progress.
> 
> Jens, any objections to getting this in?

I like most of it, but I'd much rather get this way earlier in the series.
We're basically just one week away from the merge window, it needs more simmer
and testing time than that. On top of that, it hasn't received much review
yet.

So as far as I'm concerned, we can kick off the 4.19 block branch with
iterated versions of this patchset.

-- 
Jens Axboe



Re: [RESEND PATCH V5 33/33] block: document usage of bio iterator helpers

2018-05-25 Thread Randy Dunlap
On 05/24/2018 08:46 PM, Ming Lei wrote:
> Now multipage bvec is supported, and some helpers may return page by
> page, and some may return segment by segment, this patch documents the
> usage for helping us use them correctly.
> 
> Signed-off-by: Ming Lei 
> ---
>  Documentation/block/biovecs.txt | 32 
>  1 file changed, 32 insertions(+)
> 
> diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
> index b4d238b8d9fc..32a6643caeca 100644
> --- a/Documentation/block/biovecs.txt
> +++ b/Documentation/block/biovecs.txt
> @@ -117,3 +117,35 @@ Other implications:
> size limitations and the limitations of the underlying devices. Thus
> there's no need to define ->merge_bvec_fn() callbacks for individual block
> drivers.
> +
> +Usage of helpers:
> +=
> +
> +* The following helpers which name has suffix of "_all" can only be used on

   * The following helpers, whose names have the suffix "_all", can only be 
used on

> +non-BIO_CLONED bio, and ususally they are used by filesystem code, and driver

   usually

> +shouldn't use them becasue bio may have been splitted before they got to the

  because   split

> +driver:
> +
> + bio_for_each_segment_all()
> + bio_for_each_page_all()
> + bio_pages_all()
> + bio_first_bvec_all()
> + bio_first_page_all()
> + bio_last_bvec_all()
> + segment_for_each_page_all()
> +
> +* The following helpers iterate bio page by page, and the local variable of
> +'struct bio_vec' or the reference records single page io vector during the
> +itearation:

   iteration:

> +
> + bio_for_each_page()
> + bio_for_each_page_all()
> + segment_for_each_page_all()
> +
> +* The following helpers iterate bio segment by segment, and each segment may
> +include multiple physically contiguous pages, and the local variable of
> +'struct bio_vec' or the reference records multi page io vector during the
> +itearation:

   iteration:

> +
> + bio_for_each_segment()
> + bio_for_each_segment_all()
> 


-- 
~Randy


Re: [PATCH] Block: null_blk: add blocking description and remove lightnvm

2018-05-25 Thread Jens Axboe
On 5/25/18 8:40 AM, Liu Bo wrote:
> - The description of 'blocking' is missing in null_blk.txt
> 
> - The 'lightnvm' parameter has been removed in null_blk.c
> 
> This updates both in null_blk.txt.

Thanks, applied.

-- 
Jens Axboe



Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-25 Thread Christoph Hellwig
On Thu, May 24, 2018 at 10:06:59AM -0700, Kees Cook wrote:
> On Thu, May 24, 2018 at 1:00 AM, Christoph Hellwig  wrote:
> > On Wed, May 23, 2018 at 03:14:19PM -0600, Jens Axboe wrote:
> >> Ugh, so that would necessitate a change there too. As I said before,
> >> I don't really care where it lives. I know the SCSI folks seem bothered
> >> by moving it, but in reality, it's not like this stuff will likely ever
> >> really change. Of the two choices (select entire SCSI stack, or just move
> >> this little bit), I know what I would consider the saner option...
> >
> > Oh well.  How about something like this respin of Kees' series?
> >
> > http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/sense-cleanup
> 
> Does the CONFIG_PCMCIA in drivers/scsi/Makefile now get exposed in
> weird config cases?

That just includes drivers/scsi/pcmcia/Makefile, which only
has three conditional modules in it, so it should be fine.

> Otherwise, yeah, looks good to me. Thanks!

Can you pick up my tweaks and ressend?


[PATCH] Block: null_blk: add blocking description and remove lightnvm

2018-05-25 Thread Liu Bo
- The description of 'blocking' is missing in null_blk.txt

- The 'lightnvm' parameter has been removed in null_blk.c

This updates both in null_blk.txt.

Signed-off-by: Liu Bo 
---
 Documentation/block/null_blk.txt | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt
index 733927a7b501..07f147381f32 100644
--- a/Documentation/block/null_blk.txt
+++ b/Documentation/block/null_blk.txt
@@ -71,13 +71,16 @@ use_per_node_hctx=[0/1]: Default: 0
   1: The multi-queue block layer is instantiated with a hardware dispatch
  queue for each CPU node in the system.
 
-use_lightnvm=[0/1]: Default: 0
-  Register device with LightNVM. Requires blk-mq and CONFIG_NVM to be enabled.
-
 no_sched=[0/1]: Default: 0
   0: nullb* use default blk-mq io scheduler.
   1: nullb* doesn't use io scheduler.
 
+blocking=[0/1]: Default: 0
+  0: Register as a non-blocking blk-mq driver device.
+  1: Register as a blocking blk-mq driver device, null_blk will set
+ the BLK_MQ_F_BLOCKING flag, indicating that it sometimes/always
+ needs to block in its ->queue_rq() function.
+
 shared_tags=[0/1]: Default: 0
   0: Tag set is not shared.
   1: Tag set shared between devices for blk-mq. Only makes sense with
-- 
1.8.3.1



Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-25 Thread Tetsuo Handa
Jan Kara wrote:
> > void delayed_work_timer_fn(struct timer_list *t)
> > {
> > struct delayed_work *dwork = from_timer(dwork, t, timer);
> > 
> > /* should have been called from irqsafe timer with irq already off */
> > __queue_work(dwork->cpu, dwork->wq, &dwork->work);
> > }
> > 
> > Then, wb_workfn() is after all scheduled even if we check for
> > WB_registered bit, isn't it?
> 
> It can be queued after WB_registered bit is cleared but it cannot be queued
> after mod_delayed_work(bdi_wq, &wb->dwork, 0) has finished. That function
> deletes the pending timer (the timer cannot be armed again because
> WB_registered is cleared) and queues what should be the last round of
> wb_workfn().

mod_delayed_work() deletes the pending timer but does not wait for already
invoked timer handler to complete because it is using del_timer() rather than
del_timer_sync(). Then, what happens if __queue_work() is almost concurrently
executed from two CPUs, one from mod_delayed_work(bdi_wq, &wb->dwork, 0) from
wb_shutdown() path (which is called without spin_lock_bh(&wb->work_lock)) and
the other from delayed_work_timer_fn() path (which is called without checking
WB_registered bit under spin_lock_bh(&wb->work_lock)) ?

wb_wakeup_delayed() {
  spin_lock_bh(&wb->work_lock);
  if (test_bit(WB_registered, &wb->state)) // succeeds
queue_delayed_work(bdi_wq, &wb->d_work, timeout) {
  queue_delayed_work_on(WORK_CPU_UNBOUND, bdi_wq, &wb->d_work, timeout) {
 if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, 
work_data_bits(&wb->d_work.work))) { // succeeds
   __queue_delayed_work(WORK_CPU_UNBOUND, bdi_wq, &wb->d_work, timeout) 
{
 add_timer(timer); // schedules for delayed_work_timer_fn()
   }
 }
  }
}
  spin_unlock_bh(&wb->work_lock);
}

delayed_work_timer_fn() {
  // del_timer() already returns false at this point because this timer
  // is already inside handler. But something took long here enough to
  // wait for __queue_work() from wb_shutdown() path to finish?
  __queue_work(WORK_CPU_UNBOUND, bdi_wq, &wb->d_work.work) {
insert_work(pwq, work, worklist, work_flags);
  }
}

wb_shutdown() {
  mod_delayed_work(bdi_wq, &wb->dwork, 0) {
mod_delayed_work_on(WORK_CPU_UNBOUND, bdi_wq, &wb->dwork, 0) {
  ret = try_to_grab_pending(&wb->dwork.work, true, &flags) {
if (likely(del_timer(&wb->dwork.timer))) // fails because already in 
delayed_work_timer_fn()
  return 1;
if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, 
work_data_bits(&wb->dwork.work))) // fails because already set by 
queue_delayed_work()
  return 0;
// Returns 1 or -ENOENT after doing something?
  }
  if (ret >= 0)
__queue_delayed_work(WORK_CPU_UNBOUND, bdi_wq, &wb->dwork, 0) {
  __queue_work(WORK_CPU_UNBOUND, bdi_wq, &wb->dwork.work) {
insert_work(pwq, work, worklist, work_flags);
  }
}
}
  }
}