Re: [RFC] synchronous readpage for buffer_heads

2020-10-23 Thread Matthew Wilcox
On Fri, Oct 23, 2020 at 08:22:07AM +0200, Hannes Reinecke wrote:
> On 10/22/20 5:22 PM, Matthew Wilcox wrote:
> Hmm. You are aware, of course, that hch et al are working on replacing bhs
> with iomap, right?

$ git shortlog --author=Wilcox origin/master -- fs/iomap |head -1
Matthew Wilcox (Oracle) (17):

But actually, I don't see anyone working on a mass migration of
filesystems from either using BHs directly or using the mpage code to
using iomap.  I have a summer student for next summer who I'm going to
let loose on this problem, but I fear buffer_heads will be with us for
a long time to come.

I mean, who's going to convert reiserfs to iomap?
$ git log --no-merges --since=2015-01-01 origin/master fs/reiserfs |grep -c 
^comm
130

Not exactly a thriving development community.  It doesn't even support
fallocate.

> So wouldn't it be more useful to concentrate on the iomap code, and ensure
> that _that_ is working correctly?

Did that one first, then did mpage_readpage(), now I've moved on to
block_read_full_page().  Now I'm going to go back and redo iomap
with everything I learned doing block_read_full_page().  It's going
to use blk_completion.


Re: [RFC] synchronous readpage for buffer_heads

2020-10-23 Thread Hannes Reinecke

On 10/22/20 5:22 PM, Matthew Wilcox wrote:

I'm working on making readpage synchronous so that it can actually return
errors instead of futilely setting PageError.  Something that's common
between most of the block based filesystems is the need to submit N
I/Os and wait for them to all complete (some filesystems don't support
sub-page block size, so they don't have this problem).

I ended up coming up with a fairly nice data structure which I've called
the blk_completion.  It waits for 'n' events to happen, then wakes the
task that cares, unless the task has got bored and wandered off to do
something else.

block_read_full_page() then uses this data structure to submit 'n' buffer
heads and wait for them to all complete.  The fscrypt code doesn't work
in this scheme, so I bailed on that for now.  I have ideas for fixing it,
but they can wait.

Obviously this all needs documentation, but I'd like feedback on the
idea before I do that.  I have given it some light testing, but there
aren't too many filesystems left that use block_read_full_page() so I
haven't done a proper xfstests run.

Hmm. You are aware, of course, that hch et al are working on replacing 
bhs with iomap, right?
So wouldn't it be more useful to concentrate on the iomap code, and 
ensure that _that_ is working correctly?


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


[RFC] synchronous readpage for buffer_heads

2020-10-22 Thread Matthew Wilcox
I'm working on making readpage synchronous so that it can actually return
errors instead of futilely setting PageError.  Something that's common
between most of the block based filesystems is the need to submit N
I/Os and wait for them to all complete (some filesystems don't support
sub-page block size, so they don't have this problem).

I ended up coming up with a fairly nice data structure which I've called
the blk_completion.  It waits for 'n' events to happen, then wakes the
task that cares, unless the task has got bored and wandered off to do
something else.

block_read_full_page() then uses this data structure to submit 'n' buffer
heads and wait for them to all complete.  The fscrypt code doesn't work
in this scheme, so I bailed on that for now.  I have ideas for fixing it,
but they can wait.

Obviously this all needs documentation, but I'd like feedback on the
idea before I do that.  I have given it some light testing, but there
aren't too many filesystems left that use block_read_full_page() so I
haven't done a proper xfstests run.

diff --git a/include/linux/bio.h b/include/linux/bio.h
index f254bc79bb3a..0bde05f5548c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -814,4 +814,15 @@ static inline void bio_set_polled(struct bio *bio, struct 
kiocb *kiocb)
bio->bi_opf |= REQ_NOWAIT;
 }
 
+struct blk_completion {
+   struct task_struct *cmpl_task;
+   spinlock_t cmpl_lock;
+   int cmpl_count;
+   blk_status_t cmpl_status;
+};
+
+void blk_completion_init(struct blk_completion *, int n);
+int blk_completion_sub(struct blk_completion *, blk_status_t status, int n);
+int blk_completion_wait_killable(struct blk_completion *);
+
 #endif /* __LINUX_BIO_H */
diff --git a/block/blk-core.c b/block/blk-core.c
index 10c08ac50697..2892246f2176 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1900,6 +1900,67 @@ void blk_io_schedule(void)
 }
 EXPORT_SYMBOL_GPL(blk_io_schedule);
 
+void blk_completion_init(struct blk_completion *cmpl, int n)
+{
+   spin_lock_init(>cmpl_lock);
+   cmpl->cmpl_count = n;
+   cmpl->cmpl_task = current;
+   cmpl->cmpl_status = BLK_STS_OK;
+}
+
+int blk_completion_sub(struct blk_completion *cmpl, blk_status_t status, int n)
+{
+   int ret = 0;
+
+   spin_lock_bh(>cmpl_lock);
+   if (cmpl->cmpl_status == BLK_STS_OK && status != BLK_STS_OK)
+   cmpl->cmpl_status = status;
+   cmpl->cmpl_count -= n;
+   BUG_ON(cmpl->cmpl_count < 0);
+   if (cmpl->cmpl_count == 0) {
+   if (cmpl->cmpl_task)
+   wake_up_process(cmpl->cmpl_task);
+   else
+   ret = -EINTR;
+   }
+   spin_unlock_bh(>cmpl_lock);
+   if (ret < 0)
+   kfree(cmpl);
+   return ret;
+}
+
+int blk_completion_wait_killable(struct blk_completion *cmpl)
+{
+   int err = 0;
+
+   for (;;) {
+   set_current_state(TASK_KILLABLE);
+   spin_lock_bh(>cmpl_lock);
+   if (cmpl->cmpl_count == 0)
+   break;
+   spin_unlock_bh(>cmpl_lock);
+   blk_io_schedule();
+   if (fatal_signal_pending(current)) {
+   spin_lock_bh(>cmpl_lock);
+   cmpl->cmpl_task = NULL;
+   if (cmpl->cmpl_count != 0) {
+   spin_unlock_bh(>cmpl_lock);
+   cmpl = NULL;
+   }
+   err = -ERESTARTSYS;
+   break;
+   }
+   }
+   set_current_state(TASK_RUNNING);
+   if (cmpl) {
+   spin_unlock_bh(>cmpl_lock);
+   err = blk_status_to_errno(cmpl->cmpl_status);
+   kfree(cmpl);
+   }
+
+   return err;
+}
+
 int __init blk_dev_init(void)
 {
BUILD_BUG_ON(REQ_OP_LAST >= (1 << REQ_OP_BITS));
diff --git a/fs/buffer.c b/fs/buffer.c
index 1d5337517dcd..ccb90081117c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2249,6 +2249,87 @@ int block_is_partially_uptodate(struct page *page, 
unsigned long from,
 }
 EXPORT_SYMBOL(block_is_partially_uptodate);
 
+static void readpage_end_bio(struct bio *bio)
+{
+   struct bio_vec *bvec;
+   struct page *page;
+   struct buffer_head *bh;
+   int i, nr = 0;
+
+   bio_for_each_bvec_all(bvec, bio, i) {
+   size_t offset = 0;
+   size_t max = bvec->bv_offset + bvec->bv_len;
+
+   page = bvec->bv_page;
+   bh = page_buffers(page);
+
+   for (offset = 0; offset < max; offset += bh->b_size,
+   bh = bh->b_this_page) {
+   if (offset < bvec->bv_offset)
+   continue;
+   BUG_ON(bh_offset(bh) != offset);
+   nr++;
+   if (unlikely(bio_flagged(bio, BIO_QUIET)))
+   set_bit(BH_Quiet,