Re: [Cluster-devel] [PATCH 7/9] gfs2: update ctime when quota is updated

2023-07-05 Thread Andreas Grünbacher
Am Mi., 5. Juli 2023 um 23:51 Uhr schrieb Jeff Layton :
>
> On Wed, 2023-07-05 at 22:25 +0200, Andreas Gruenbacher wrote:
> > On Mon, Jun 12, 2023 at 12:36 PM Jeff Layton  wrote:
> > > On Fri, 2023-06-09 at 18:44 +0200, Andreas Gruenbacher wrote:
> > > > Jeff,
> > > >
> > > > On Fri, Jun 9, 2023 at 2:50 PM Jeff Layton  wrote:
> > > > > Signed-off-by: Jeff Layton 
> > > > > ---
> > > > >  fs/gfs2/quota.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> > > > > index 1ed17226d9ed..6d283e071b90 100644
> > > > > --- a/fs/gfs2/quota.c
> > > > > +++ b/fs/gfs2/quota.c
> > > > > @@ -869,7 +869,7 @@ static int gfs2_adjust_quota(struct gfs2_inode 
> > > > > *ip, loff_t loc,
> > > > > size = loc + sizeof(struct gfs2_quota);
> > > > > if (size > inode->i_size)
> > > > > i_size_write(inode, size);
> > > > > -   inode->i_mtime = inode->i_atime = current_time(inode);
> > > > > +   inode->i_mtime = inode->i_atime = inode->i_ctime = 
> > > > > current_time(inode);
> > > >
> > > > I don't think we need to worry about the ctime of the quota inode as
> > > > that inode is internal to the filesystem only.
> > > >
> > >
> > > Thanks Andreas.  I'll plan to drop this patch from the series for now.
> > >
> > > Does updating the mtime and atime here serve any purpose, or should
> > > those also be removed? If you plan to keep the a/mtime updates then I'd
> > > still suggest updating the ctime for consistency's sake. It shouldn't
> > > cost anything extra to do so since you're dirtying the inode below
> > > anyway.
> >
> > Yes, good point actually, we should keep things consistent for simplicity.
> >
> > Would you add this back in if you do another posting?
> >
>
> I just re-posted the other patches in this as part of the ctime accessor
> conversion. If I post again though, I can resurrect the gfs2 patch. If
> not, we can do a follow-on fix later.

Sure, not a big deal.

> Since we're discussing it, it may be more correct to remove the atime
> update there. gfs2_adjust_quota sounds like a "modify" operation, not a
> "read", so I don't see a reason to update the atime.
>
> In general, the only time you only want to set the atime, ctime and
> mtime in lockstep is when the inode is brand new.

Yes, that makes sense, too.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v2 07/14] buffer: Convert block_page_mkwrite() to use a folio

2023-06-06 Thread Andreas Grünbacher
Am Mi., 7. Juni 2023 um 00:48 Uhr schrieb Matthew Wilcox (Oracle)
:
> If any page in a folio is dirtied, dirty the entire folio.  Removes a
> number of hidden calls to compound_head() and references to page->mapping
> and page->index.
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/buffer.c | 27 +--
>  1 file changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d8c2c000676b..f34ed29b1085 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2564,38 +2564,37 @@ EXPORT_SYMBOL(block_commit_write);
>  int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>  get_block_t get_block)
>  {
> -   struct page *page = vmf->page;
> +   struct folio *folio = page_folio(vmf->page);
> struct inode *inode = file_inode(vma->vm_file);
> unsigned long end;
> loff_t size;
> int ret;
>
> -   lock_page(page);
> +   folio_lock(folio);
> size = i_size_read(inode);
> -   if ((page->mapping != inode->i_mapping) ||
> -   (page_offset(page) > size)) {
> +   if ((folio->mapping != inode->i_mapping) ||
> +   (folio_pos(folio) > size)) {

This should probably also be 'folio_pos(folio) >= size', but this was
wrong before this patch already.

> /* We overload EFAULT to mean page got truncated */
> ret = -EFAULT;
> goto out_unlock;
> }
>
> -   /* page is wholly or partially inside EOF */
> -   if (((page->index + 1) << PAGE_SHIFT) > size)
> -   end = size & ~PAGE_MASK;
> -   else
> -   end = PAGE_SIZE;
> +   end = folio_size(folio);
> +   /* folio is wholly or partially inside EOF */
> +   if (folio_pos(folio) + end > size)
> +   end = size - folio_pos(folio);
>
> -   ret = __block_write_begin(page, 0, end, get_block);
> +   ret = __block_write_begin_int(folio, 0, end, get_block, NULL);
> if (!ret)
> -   ret = block_commit_write(page, 0, end);
> +   ret = block_commit_write(&folio->page, 0, end);
>
> if (unlikely(ret < 0))
> goto out_unlock;
> -   set_page_dirty(page);
> -   wait_for_stable_page(page);
> +   folio_set_dirty(folio);
> +   folio_wait_stable(folio);
> return 0;
>  out_unlock:
> -   unlock_page(page);
> +   folio_unlock(folio);
> return ret;
>  }
>  EXPORT_SYMBOL(block_page_mkwrite);
> --
> 2.39.2
>

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v2 06/14] buffer: Make block_write_full_page() handle large folios correctly

2023-06-06 Thread Andreas Grünbacher
Am Mi., 7. Juni 2023 um 00:41 Uhr schrieb Matthew Wilcox (Oracle)
:
> Keep the interface as struct page, but work entirely on the folio
> internally.  Removes several PAGE_SIZE assumptions and removes
> some references to page->index and page->mapping.
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> Tested-by: Bob Peterson 
> Reviewed-by: Bob Peterson 
> ---
>  fs/buffer.c | 22 ++
>  1 file changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 4d518df50fab..d8c2c000676b 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2678,33 +2678,31 @@ int block_write_full_page(struct page *page, 
> get_block_t *get_block,
> struct writeback_control *wbc)
>  {
> struct folio *folio = page_folio(page);
> -   struct inode * const inode = page->mapping->host;
> +   struct inode * const inode = folio->mapping->host;
> loff_t i_size = i_size_read(inode);
> -   const pgoff_t end_index = i_size >> PAGE_SHIFT;
> -   unsigned offset;
>
> -   /* Is the page fully inside i_size? */
> -   if (page->index < end_index)
> +   /* Is the folio fully inside i_size? */
> +   if (folio_pos(folio) + folio_size(folio) <= i_size)
> return __block_write_full_folio(inode, folio, get_block, wbc,
>end_buffer_async_write);
>
> -   /* Is the page fully outside i_size? (truncate in progress) */
> -   offset = i_size & (PAGE_SIZE-1);
> -   if (page->index >= end_index+1 || !offset) {
> +   /* Is the folio fully outside i_size? (truncate in progress) */
> +   if (folio_pos(folio) > i_size) {

The folio is also fully outside i_size if folio_pos(folio) == i_size.

> folio_unlock(folio);
> return 0; /* don't care */
> }
>
> /*
> -* The page straddles i_size.  It must be zeroed out on each and every
> +* The folio straddles i_size.  It must be zeroed out on each and 
> every
>  * writepage invocation because it may be mmapped.  "A file is mapped
>  * in multiples of the page size.  For a file that is not a multiple 
> of
> -* the  page size, the remaining memory is zeroed when mapped, and
> +* the page size, the remaining memory is zeroed when mapped, and
>  * writes to that region are not written out to the file."
>  */
> -   zero_user_segment(page, offset, PAGE_SIZE);
> +   folio_zero_segment(folio, offset_in_folio(folio, i_size),
> +   folio_size(folio));
> return __block_write_full_folio(inode, folio, get_block, wbc,
> -   
> end_buffer_async_write);
> +   end_buffer_async_write);
>  }
>  EXPORT_SYMBOL(block_write_full_page);
>
> --
> 2.39.2
>

Thanks,
Andreas



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Andreas Grünbacher
Am Fr., 26. Mai 2023 um 01:20 Uhr schrieb Kent Overstreet
:
> On Fri, May 26, 2023 at 12:25:31AM +0200, Andreas Grünbacher wrote:
> > Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig 
> > :
> > > On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a 
> > > > different
> > > > way (by prefaulting pages from the iter before grabbing the problematic
> > > > lock and then disabling page faults for the iomap_dio_rw() call). I 
> > > > guess
> > > > we should somehow unify these schemes so that we don't have two 
> > > > mechanisms
> > > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> > > >
> > > > Also good that you've written a fstest for this, that is definitely a 
> > > > useful
> > > > addition, although I suspect GFS2 guys added a test for this not so long
> > > > ago when testing their stuff. Maybe they have a pointer handy?
> > >
> > > generic/708 is the btrfs version of this.
> > >
> > > But I think all of the file systems that have this deadlock are actually
> > > fundamentally broken because they have a mess up locking hierarchy
> > > where page faults take the same lock that is held over the the direct I/
> > > operation.  And the right thing is to fix this.  I have work in progress
> > > for btrfs, and something similar should apply to gfs2, with the added
> > > complication that it probably means a revision to their network
> > > protocol.
> >
> > We do disable page faults, and there can be deadlocks in page fault
> > handlers while no page faults are allowed.
> >
> > I'm roughly aware of the locking hierarchy that other filesystems use,
> > and that's something we want to avoid because of two reasons: (1) it
> > would be an incompatible change, and (2) we want to avoid cluster-wide
> > locking operations as much as possible because they are very slow.
> >
> > These kinds of locking conflicts are so rare in practice that the
> > theoretical inefficiency of having to retry the operation doesn't
> > matter.
>
> Would you be willing to expand on that? I'm wondering if this would
> simplify things for gfs2, but you mention locking heirarchy being an
> incompatible change - how does that work?

Oh, it's just that gfs2 uses one dlm lock per inode to control access
to that inode. In the code, this is called the "inode glock" ---
glocks being an abstraction above dlm locks --- but it boils down to
dlm locks in the end. An additional layer of locking will only work
correctly if all cluster nodes use the new locks consistently, so old
cluster nodes will become incompatible. Those kinds of changes are
hard.

But the additional lock taking would also hurt performance, forever,
and I'd really like to avoid taking that hit.

It may not be obvious to everyone, but allowing page faults during
reads and writes (i.e., while holding dlm locks) can lead to
distributed deadlocks. We cannot just pretend to be a local
filesystem.

Thanks,
Andreas

> > > I'm absolutely not in favour to add workarounds for thes kind of locking
> > > problems to the core kernel.  I already feel bad for allowing the
> > > small workaround in iomap for btrfs, as just fixing the locking back
> > > then would have avoid massive ratholing.
> >
> > Please let me know when those btrfs changes are in a presentable shape ...
>
> I would also be curious to know what btrfs needs and what the approach
> is there.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Andreas Grünbacher
Am Do., 25. Mai 2023 um 10:56 Uhr schrieb Jan Kara :
> On Tue 23-05-23 12:49:06, Kent Overstreet wrote:
> > > > No, that's definitely handled (and you can see it in the code I linked),
> > > > and I wrote a torture test for fstests as well.
> > >
> > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > > way (by prefaulting pages from the iter before grabbing the problematic
> > > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > > we should somehow unify these schemes so that we don't have two mechanisms
> > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> >
> > Oof, that sounds a bit sketchy. What happens if the dio call passes in
> > an address from the same address space?
>
> If we submit direct IO that uses mapped file F at offset O as a buffer for
> direct IO from file F, offset O, it will currently livelock in an
> indefinite retry loop. It should rather return error or fall back to
> buffered IO. But that should be fixable. Andreas?

Yes, I guess. Thanks for the heads-up.

Andreas

> But if the buffer and direct IO range does not overlap, it will just
> happily work - iomap_dio_rw() invalidates only the range direct IO is done
> to.
>
> > What happens if we race with the pages we faulted in being evicted?
>
> We fault them in again and retry.
>
> > > Also good that you've written a fstest for this, that is definitely a 
> > > useful
> > > addition, although I suspect GFS2 guys added a test for this not so long
> > > ago when testing their stuff. Maybe they have a pointer handy?
> >
> > More tests more good.
> >
> > So if we want to lift this scheme to the VFS layer, we'd start by
> > replacing the lock you added (grepping for it, the name escapes me) with
> > a different type of lock - two_state_shared_lock in my code, it's like a
> > rw lock except writers don't exclude other writers. That way the DIO
> > path can use it without singlethreading writes to a single file.
>
> Yes, I've noticed that you are introducing in bcachefs a lock with very
> similar semantics to mapping->invalidate_lock, just with this special lock
> type. What I'm kind of worried about with two_state_shared_lock as
> implemented in bcachefs is the fairness. AFAICS so far if someone is e.g.
> heavily faulting pages on a file, direct IO to that file can be starved
> indefinitely. That is IMHO not a good thing and I would not like to use
> this type of lock in VFS until this problem is resolved. But it should be
> fixable e.g. by introducing some kind of deadline for a waiter after which
> it will block acquisitions of the other lock state.
>
> Honza
> --
> Jan Kara 
> SUSE Labs, CR



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Andreas Grünbacher
Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig 
:
> On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > way (by prefaulting pages from the iter before grabbing the problematic
> > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > we should somehow unify these schemes so that we don't have two mechanisms
> > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> >
> > Also good that you've written a fstest for this, that is definitely a useful
> > addition, although I suspect GFS2 guys added a test for this not so long
> > ago when testing their stuff. Maybe they have a pointer handy?
>
> generic/708 is the btrfs version of this.
>
> But I think all of the file systems that have this deadlock are actually
> fundamentally broken because they have a mess up locking hierarchy
> where page faults take the same lock that is held over the the direct I/
> operation.  And the right thing is to fix this.  I have work in progress
> for btrfs, and something similar should apply to gfs2, with the added
> complication that it probably means a revision to their network
> protocol.

We do disable page faults, and there can be deadlocks in page fault
handlers while no page faults are allowed.

I'm roughly aware of the locking hierarchy that other filesystems use,
and that's something we want to avoid because of two reasons: (1) it
would be an incompatible change, and (2) we want to avoid cluster-wide
locking operations as much as possible because they are very slow.

These kinds of locking conflicts are so rare in practice that the
theoretical inefficiency of having to retry the operation doesn't
matter.

> I'm absolutely not in favour to add workarounds for thes kind of locking
> problems to the core kernel.  I already feel bad for allowing the
> small workaround in iomap for btrfs, as just fixing the locking back
> then would have avoid massive ratholing.

Please let me know when those btrfs changes are in a presentable shape ...

Thanks,
Andreas



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Andreas Grünbacher
Am Di., 23. Mai 2023 um 15:37 Uhr schrieb Jan Kara :
> On Wed 10-05-23 02:18:45, Kent Overstreet wrote:
> > On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote:
> > > On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> > > > From: Kent Overstreet 
> > > >
> > > > This is used by bcachefs to fix a page cache coherency issue with
> > > > O_DIRECT writes.
> > > >
> > > > Also relevant: mapping->invalidate_lock, see below.
> > > >
> > > > O_DIRECT writes (and other filesystem operations that modify file data
> > > > while bypassing the page cache) need to shoot down ranges of the page
> > > > cache - and additionally, need locking to prevent those pages from
> > > > pulled back in.
> > > >
> > > > But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> > > > and the page fault handler will need to take that same lock - this is a
> > > > classic recursive deadlock if userspace has mmaped the file they're DIO
> > > > writing to and uses those pages for the buffer to write from, and it's a
> > > > lock ordering deadlock in general.
> > > >
> > > > Thus we need a way to signal from the dio code to the page fault handler
> > > > when we already are holding the pagecache add lock on an address space -
> > > > this patch just adds a member to task_struct for this purpose. For now
> > > > only bcachefs is implementing this locking, though it may be moved out
> > > > of bcachefs and made available to other filesystems in the future.
> > >
> > > It would be nice to have at least a link to the code that's actually using
> > > the field you are adding.
> >
> > Bit of a trick to link to a _later_ patch in the series from a commit
> > message, but...
> >
> > https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975
> > https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454
>
> Thanks and I'm sorry for the delay.
>
> > > Also I think we were already through this discussion [1] and we ended up
> > > agreeing that your scheme actually solves only the AA deadlock but a
> > > malicious userspace can easily create AB BA deadlock by running direct IO
> > > to file A using mapped file B as a buffer *and* direct IO to file B using
> > > mapped file A as a buffer.
> >
> > No, that's definitely handled (and you can see it in the code I linked),
> > and I wrote a torture test for fstests as well.
>
> I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> way (by prefaulting pages from the iter before grabbing the problematic
> lock and then disabling page faults for the iomap_dio_rw() call). I guess
> we should somehow unify these schemes so that we don't have two mechanisms
> for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
>
> Also good that you've written a fstest for this, that is definitely a useful
> addition, although I suspect GFS2 guys added a test for this not so long
> ago when testing their stuff. Maybe they have a pointer handy?

Ah yes, that's xfstests commit d3cbdabf ("generic: Test page faults
during read and write").

Thanks,
Andreas

> Honza
> --
> Jan Kara 
> SUSE Labs, CR



Re: [Cluster-devel] [GIT PULL] gfs2 fix for v6.3-rc4

2023-03-23 Thread Andreas Grünbacher
Am Do., 23. März 2023 um 22:54 Uhr schrieb Linus Torvalds
:
> On Thu, Mar 23, 2023 at 11:45 AM Andreas Gruenbacher  
> wrote:
> >
> > From: Linus Torvalds 
>
> Wat?

Hmm, that's weird, you ended up in the From: header by accident. Sorry for that.

> >   git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
> > gfs2-v6.3-rc3-fix
>
> -ENOSUCHTAG
>
> > for you to fetch changes up to 260595b439776c473cc248f0de63fe78d964d849:
>
> .. and no such commit available in any other tag or branch either.
>
> Did you forget to push out?

Yes, I did forget to push the tag out and I've missed the warning "git
request-pull" has spit out. Sorry again.

I've pushed the tag out now; should I resend the pull request?

Thanks,
Andreas



Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler

2023-01-18 Thread Andreas Grünbacher
Am Mi., 18. Jan. 2023 um 20:04 Uhr schrieb Darrick J. Wong :
>
> On Tue, Jan 17, 2023 at 11:21:38PM -0800, Christoph Hellwig wrote:
> > On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> > > I don't have any objections to pulling everything except patches 8 and
> > > 10 for testing this week.
> >
> > That would be great.  I now have a series to return the ERR_PTR
> > from __filemap_get_folio which will cause a minor conflict, but
> > I think that's easy enough for Linux to handle.
>
> Ok, done.
>
> > >
> > > 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> > > don't think it does, but OTOH zone pointer management might complicate
> > > that.
> >
> > Adding Damien.
> >
> > > 2. How about porting the writeback iomap validation to use this
> > > mechanism?  (I suspect Dave might already be working on this...)
> >
> > What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> > ?   writeback calls into ->map_blocks for every block while under the
> > folio lock, so the validation can (and for XFS currently is) done
> > in that.  Moving it out into a separate method with extra indirect
> > functiona call overhead and interactions between the methods seems
> > like a retrograde step to me.
>
> Sorry, I should've been more specific -- can xfs writeback use the
> validity cookie in struct iomap and thereby get rid of struct
> xfs_writepage_ctx entirely?

Already asked and answered in the same thread:

https://lore.kernel.org/linux-fsdevel/20230109225453.gq1971...@dread.disaster.area/

> > > 2. Do we need to revalidate mappings for directio writes?  I think the
> > > answer is no (for xfs) because the ->iomap_begin call will allocate
> > > whatever blocks are needed and truncate/punch/reflink block on the
> > > iolock while the directio writes are pending, so you'll never end up
> > > with a stale mapping.
> >
> > Yes.
>
> Er... yes as in "Yes, we *do* need to revalidate directio writes", or
> "Yes, your reasoning is correct"?
>
> --D



Re: [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper

2023-01-10 Thread Andreas Grünbacher
Am Di., 10. Jan. 2023 um 09:52 Uhr schrieb Christoph Hellwig
:
> On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > checking for that in iomap_get_folio().  Your patch then turns into
> > the below.
>
> Exactly.  And as I already pointed out in reply to Dave's original
> patch what we really should be doing is returning an ERR_PTR from
> __filemap_get_folio instead of reverse-engineering the expected
> error code.
>
> The only big question is if we should apply Dave's patch first as a bug
> fix before this series, and I suspect we should do that.

Sounds fine. I assume Dave is going to send an update.

Thanks,
Andreas



Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler

2023-01-09 Thread Andreas Grünbacher
Am Mo., 9. Jan. 2023 um 23:58 Uhr schrieb Dave Chinner :
> On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote:
> > On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner  wrote:
> > > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > > > handler and validating the mapping there.
> > > >
> > > > Signed-off-by: Andreas Gruenbacher 
> > >
> > > I think this is wrong.
> > >
> > > The ->iomap_valid() function handles a fundamental architectural
> > > issue with cached iomaps: the iomap can become stale at any time
> > > whilst it is in use by the iomap core code.
> > >
> > > The current problem it solves in the iomap_write_begin() path has to
> > > do with writeback and memory reclaim races over unwritten extents,
> > > but the general case is that we must be able to check the iomap
> > > at any point in time to assess it's validity.
> > >
> > > Indeed, we also have this same "iomap valid check" functionality in the
> > > writeback code as cached iomaps can become stale due to racing
> > > writeback, truncated, etc. But you wouldn't know it by looking at the 
> > > iomap
> > > writeback code - this is currently hidden by XFS by embedding
> > > the checks into the iomap writeback ->map_blocks function.
> > >
> > > That is, the first thing that xfs_map_blocks() does is check if the
> > > cached iomap is valid, and if it is valid it returns immediately and
> > > the iomap writeback code uses it without question.
> > >
> > > The reason that this is embedded like this is that the iomap did not
> > > have a validity cookie field in it, and so the validity information
> > > was wrapped around the outside of the iomap_writepage_ctx and the
> > > filesystem has to decode it from that private wrapping structure.
> > >
> > > However, the validity information iin the structure wrapper is
> > > indentical to the iomap validity cookie,
> >
> > Then could that part of the xfs code be converted to use
> > iomap->validity_cookie so that struct iomap_writepage_ctx can be
> > eliminated?
>
> Yes, that is the plan.
>
> >
> > > and so the direction I've
> > > been working towards is to replace this implicit, hidden cached
> > > iomap validity check with an explicit ->iomap_valid call and then
> > > only call ->map_blocks if the validity check fails (or is not
> > > implemented).
> > >
> > > I want to use the same code for all the iomap validity checks in all
> > > the iomap core code - this is an iomap issue, the conditions where
> > > we need to check for iomap validity are different for depending on
> > > the iomap context being run, and the checks are not necessarily
> > > dependent on first having locked a folio.
> > >
> > > Yes, the validity cookie needs to be decoded by the filesystem, but
> > > that does not dictate where the validity checking needs to be done
> > > by the iomap core.
> > >
> > > Hence I think removing ->iomap_valid is a big step backwards for the
> > > iomap core code - the iomap core needs to be able to formally verify
> > > the iomap is valid at any point in time, not just at the point in
> > > time a folio in the page cache has been locked...
> >
> > We don't need to validate an iomap "at any time". It's two specific
> > places in the code in which we need to check, and we're not going to
> > end up with ten more such places tomorrow.
>
> Not immediately, but that doesn't change the fact this is not a
> filesystem specific issue - it's an inherent characteristic of
> cached iomaps and unsynchronised extent state changes that occur
> outside exclusive inode->i_rwsem IO context (e.g. in writeback and
> IO completion contexts).
>
> Racing mmap + buffered writes can expose these state changes as the
> iomap bufferred write IO path is not serialised against the iomap
> mmap IO path except via folio locks. Hence a mmap page fault can
> invalidate a cached buffered write iomap by causing a hole ->
> unwritten, hole -> delalloc or hole -> written conversion in the
> middle of the buffered write range. The buffered write still has a
> hole mapping cached for that entire range, and it is now incorrect.
>
> If the mmap write happens to change extent state at the trailing
> edge of a partial buffered write, data corruption will occur if we
> race just right with writeback and memory reclaim. I'm pretty sure
> that this corruption can be reporduced on gfs2 if we try hard enough
> - generic/346 triggers the mmap/write race condition, all that is
> needed from that point is for writeback and reclaiming pages at
> exactly the right time...
>
> > I'd prefer to keep those
> > filesystem internals in the filesystem specific code instead of
> > exposing them to the iomap layer. But that's just me ...
>
> My point is that there is nothing XFS specific about these stale
> cached iomap race conditions, nor is it specifically related to
> folio locking. The folio locking inversions w.r.t. iomap caching and
> the interacti

Re: [Cluster-devel] [PATCH v5 7/9] iomap/xfs: Eliminate the iomap_valid handler

2023-01-04 Thread Andreas Grünbacher
Am Mi., 4. Jan. 2023 um 19:00 Uhr schrieb Darrick J. Wong :
> On Sat, Dec 31, 2022 at 04:09:17PM +0100, Andreas Gruenbacher wrote:
> > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > handler and validating the mapping there.
> >
> > Signed-off-by: Andreas Gruenbacher 
> > ---
> >  fs/iomap/buffered-io.c | 25 +
> >  fs/xfs/xfs_iomap.c | 37 ++---
> >  include/linux/iomap.h  | 22 +-
> >  3 files changed, 36 insertions(+), 48 deletions(-)
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 4f363d42dbaf..df6fca11f18c 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -630,7 +630,7 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> > loff_t pos,
> >   const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> >   const struct iomap *srcmap = iomap_iter_srcmap(iter);
> >   struct folio *folio;
> > - int status = 0;
> > + int status;
> >
> >   BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
> >   if (srcmap != &iter->iomap)
> > @@ -646,27 +646,12 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> > loff_t pos,
> >   folio = page_ops->get_folio(iter, pos, len);
> >   else
> >   folio = iomap_get_folio(iter, pos);
> > - if (IS_ERR(folio))
> > - return PTR_ERR(folio);
> > -
> > - /*
> > -  * Now we have a locked folio, before we do anything with it we need 
> > to
> > -  * check that the iomap we have cached is not stale. The inode extent
> > -  * mapping can change due to concurrent IO in flight (e.g.
> > -  * IOMAP_UNWRITTEN state can change and memory reclaim could have
> > -  * reclaimed a previously partially written page at this index after 
> > IO
> > -  * completion before this write reaches this file offset) and hence we
> > -  * could do the wrong thing here (zero a page range incorrectly or 
> > fail
> > -  * to zero) and corrupt data.
> > -  */
> > - if (page_ops && page_ops->iomap_valid) {
> > - bool iomap_valid = page_ops->iomap_valid(iter->inode,
> > - &iter->iomap);
> > - if (!iomap_valid) {
> > + if (IS_ERR(folio)) {
> > + if (folio == ERR_PTR(-ESTALE)) {
> >   iter->iomap.flags |= IOMAP_F_STALE;
> > - status = 0;
> > - goto out_unlock;
> > + return 0;
> >   }
> > + return PTR_ERR(folio);
>
> I wonder if this should be reworked a bit to reduce indenting:
>
> if (PTR_ERR(folio) == -ESTALE) {
> iter->iomap.flags |= IOMAP_F_STALE;
> return 0;
> }
> if (IS_ERR(folio))
> return PTR_ERR(folio);
>
> But I don't have any strong opinions about that.

This is a relatively hot path; the compiler would have to convert the
flattened version back to the nested version for the best possible
result to avoid a redundant check. Not sure it would always do that.

> >   }
> >
> >   if (pos + len > folio_pos(folio) + folio_size(folio))
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 669c1bc5c3a7..d0bf99539180 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -62,29 +62,44 @@ xfs_iomap_inode_sequence(
> >   return cookie | READ_ONCE(ip->i_df.if_seq);
> >  }
> >
> > -/*
> > - * Check that the iomap passed to us is still valid for the given offset 
> > and
> > - * length.
> > - */
> > -static bool
> > -xfs_iomap_valid(
> > - struct inode*inode,
> > - const struct iomap  *iomap)
> > +static struct folio *
> > +xfs_get_folio(
> > + struct iomap_iter   *iter,
> > + loff_t  pos,
> > + unsignedlen)
> >  {
> > + struct inode*inode = iter->inode;
> > + struct iomap*iomap = &iter->iomap;
> >   struct xfs_inode*ip = XFS_I(inode);
> > + struct folio*folio;
> >
> > + folio = iomap_get_folio(iter, pos);
> > + if (IS_ERR(folio))
> > + return folio;
> > +
> > + /*
> > +  * Now that we have a locked folio, we need to check that the iomap we
> > +  * have cached is not stale.  The inode extent mapping can change due 
> > to
> > +  * concurrent IO in flight (e.g., IOMAP_UNWRITTEN state can change and
> > +  * memory reclaim could have reclaimed a previously partially written
> > +  * page at this index after IO completion before this write reaches
> > +  * this file offset) and hence we could do the wrong thing here (zero 
> > a
> > +  * page range incorrectly or fail to zero) and corrupt data.
> > +  */
> >   if (iomap->validity_cookie !=
> >   xfs_iomap_inode_sequence(ip, iomap->flags)) {
> >   t

Re: [Cluster-devel] [PATCH v5 3/9] iomap: Rename page_done handler to put_folio

2023-01-04 Thread Andreas Grünbacher
Am Mi., 4. Jan. 2023 um 18:47 Uhr schrieb Darrick J. Wong :
>
> On Sat, Dec 31, 2022 at 04:09:13PM +0100, Andreas Gruenbacher wrote:
> > The ->page_done() handler in struct iomap_page_ops is now somewhat
> > misnamed in that it mainly deals with unlocking and putting a folio, so
> > rename it to ->put_folio().
> >
> > Signed-off-by: Andreas Gruenbacher 
> > ---
> >  fs/gfs2/bmap.c |  4 ++--
> >  fs/iomap/buffered-io.c |  4 ++--
> >  include/linux/iomap.h  | 10 +-
> >  3 files changed, 9 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> > index 46206286ad42..0c041459677b 100644
> > --- a/fs/gfs2/bmap.c
> > +++ b/fs/gfs2/bmap.c
> > @@ -967,7 +967,7 @@ static int gfs2_iomap_page_prepare(struct inode *inode, 
> > loff_t pos,
> >   return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> >  }
> >
> > -static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
> > +static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
> >unsigned copied, struct folio *folio)
> >  {
> >   struct gfs2_trans *tr = current->journal_info;
> > @@ -994,7 +994,7 @@ static void gfs2_iomap_page_done(struct inode *inode, 
> > loff_t pos,
> >
> >  static const struct iomap_page_ops gfs2_iomap_page_ops = {
> >   .page_prepare = gfs2_iomap_page_prepare,
> > - .page_done = gfs2_iomap_page_done,
> > + .put_folio = gfs2_iomap_put_folio,
> >  };
> >
> >  static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index e13d5694e299..2a9bab4f3c79 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -580,8 +580,8 @@ static void iomap_put_folio(struct iomap_iter *iter, 
> > loff_t pos, size_t ret,
> >  {
> >   const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> >
> > - if (page_ops && page_ops->page_done) {
> > - page_ops->page_done(iter->inode, pos, ret, folio);
> > + if (page_ops && page_ops->put_folio) {
> > + page_ops->put_folio(iter->inode, pos, ret, folio);
> >   } else if (folio) {
> >   folio_unlock(folio);
> >   folio_put(folio);
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 743e2a909162..10ec36f373f4 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -126,18 +126,18 @@ static inline bool iomap_inline_data_valid(const 
> > struct iomap *iomap)
> >
> >  /*
> >   * When a filesystem sets page_ops in an iomap mapping it returns, 
> > page_prepare
> > - * and page_done will be called for each page written to.  This only 
> > applies to
> > + * and put_folio will be called for each page written to.  This only 
> > applies to
>
> "...for each folio written to."

Ah, yes.

> With that fixed,
> Reviewed-by: Darrick J. Wong 
>
> --D
>
>
> >   * buffered writes as unbuffered writes will not typically have pages

And here, it should be "folios" as well I'd say.

Thanks,
Andreas

> >   * associated with them.
> >   *
> > - * When page_prepare succeeds, page_done will always be called to do any
> > - * cleanup work necessary.  In that page_done call, @folio will be NULL if 
> > the
> > - * associated folio could not be obtained.  When folio is not NULL, 
> > page_done
> > + * When page_prepare succeeds, put_folio will always be called to do any
> > + * cleanup work necessary.  In that put_folio call, @folio will be NULL if 
> > the
> > + * associated folio could not be obtained.  When folio is not NULL, 
> > put_folio
> >   * is responsible for unlocking and putting the folio.
> >   */
> >  struct iomap_page_ops {
> >   int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len);
> > - void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
> > + void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
> >   struct folio *folio);
> >
> >   /*
> > --
> > 2.38.1
> >



Re: [Cluster-devel] [RFC v3 1/7] fs: Add folio_may_straddle_isize helper

2022-12-23 Thread Andreas Grünbacher
Am Fr., 23. Dez. 2022 um 16:06 Uhr schrieb Christoph Hellwig
:
> On Fri, Dec 16, 2022 at 04:06:20PM +0100, Andreas Gruenbacher wrote:
> > Add a folio_may_straddle_isize() helper as a replacement for
> > pagecache_isize_extended() when we have a locked folio.
>
> I find the naming very confusing.  Any good reason to not follow
> the naming of pagecache_isize_extended an call it
> folio_isize_extended?

A good reason for a different name is because
folio_may_straddle_isize() requires a locked folio, while
pagecache_isize_extended() will fail if the folio is still locked. So
this doesn't follow the usual "replace 'page' with 'folio'" pattern.

> > Use the new helper in generic_write_end(), iomap_write_end(),
> > ext4_write_end(), and ext4_journalled_write_end().
>
> Please split this into a patch per caller in addition to the one
> adding the helper, and write commit logs explaining the rationale
> for the helper.  The obious ones I'm trying to guess are that
> the new helper avoid a page cache radix tree lookup and a lock
> page/folio cycle, but I'd rather hear that from the horses mouth
> in the commit log.

Yes, that's what the horse says.

> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2164,16 +2164,15 @@ int generic_write_end(struct file *file, struct 
> > address_space *mapping,
> >* But it's important to update i_size while still holding page lock:
> >* page writeout could otherwise come in and zero beyond i_size.
> >*/
> > - if (pos + copied > inode->i_size) {
> > + if (pos + copied > old_size) {
>
> This is and unrelated and undocument (but useful) change.  Please split
> it out as well.
>
> > + * This function must be called while we still hold i_rwsem - this not only
> > + * makes sure i_size is stable but also that userspace cannot observe the 
> > new
> > + * i_size value before we are prepared to handle mmap writes there.
>
> Please add a lockdep_assert_held_write to enforce that.
>
> > +void folio_may_straddle_isize(struct inode *inode, struct folio *folio,
> > +   loff_t old_size, loff_t start)
> > +{
> > + unsigned int blocksize = i_blocksize(inode);
> > +
> > + if (round_up(old_size, blocksize) >= round_down(start, blocksize))
> > + return;
> > +
> > + /*
> > +  * See clear_page_dirty_for_io() for details why folio_set_dirty()
> > +  * is needed.
> > +  */
> > + if (folio_mkclean(folio))
> > + folio_set_dirty(folio);
>
> Should pagecache_isize_extended be rewritten to use this helper,
> i.e. turn this into a factoring out of a helper?

I'm not really sure about that. The boundary conditions in the two
functions are not identical. I think the logic in
folio_may_straddle_isize() is sufficient for the
extending-write-under-folio-lock case, but I'd still need confirmation
for that. If the same logic would also be enough in
pagecache_isize_extended() is more unclear to me.

> > +EXPORT_SYMBOL(folio_may_straddle_isize);
>
> Please make this an EXPORT_SYMBOL_GPL just like folio_mkclean.

Thanks,
Andreas



Re: [Cluster-devel] [RFC v3 4/7] iomap: Add iomap_folio_prepare helper

2022-12-23 Thread Andreas Grünbacher
Am Fr., 23. Dez. 2022 um 16:22 Uhr schrieb Christoph Hellwig
:
> > +struct folio *iomap_folio_prepare(struct iomap_iter *iter, loff_t pos)
> > +{
> > + unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | 
> > FGP_NOFS;
> > +
> > + if (iter->flags & IOMAP_NOWAIT)
> > + fgp |= FGP_NOWAIT;
> > +
> > + return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> > + fgp, mapping_gfp_mask(iter->inode->i_mapping));
> > +}
> > +EXPORT_SYMBOL(iomap_folio_prepare);
>
> I'd name this __iomap_get_folio to match __filemap_get_folio.

I was looking at it from the filesystem point of view: this helper is
meant to be used in ->folio_prepare() handlers, so
iomap_folio_prepare() seemed to be a better name than
__iomap_get_folio().

> And all iomap exports are EXPORT_SYMBOL_GPL.

Sure.

Thanks,
Andreas



Re: [Cluster-devel] [RFC v3 2/7] iomap: Add iomap_folio_done helper

2022-12-23 Thread Andreas Grünbacher
Am Fr., 23. Dez. 2022 um 16:12 Uhr schrieb Christoph Hellwig
:
> On Fri, Dec 16, 2022 at 04:06:21PM +0100, Andreas Gruenbacher wrote:
> > +static void iomap_folio_done(struct iomap_iter *iter, loff_t pos, size_t 
> > ret,
> > + struct folio *folio)
> > +{
> > + const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> > +
> > + if (folio)
> > + folio_unlock(folio);
> > + if (page_ops && page_ops->page_done)
> > + page_ops->page_done(iter->inode, pos, ret, &folio->page);
> > + if (folio)
> > + folio_put(folio);
> > +}
>
> How is the folio dereference going to work if folio is NULL?

'&folio->page' is effectively a type cast, not a dereference. I
realize iomap_folio_done() as introduced here is not pretty, but it's
only an intermediary step and the ugliness goes away later in this
series.

> That being said, I really wonder if the current API is the right way to
> go.  Can't we just have a ->get_folio method with the same signature as
> __filemap_get_folio, and then do the __filemap_get_folio from the file
> system and avoid the page/folio == NULL clean path entirely?  Then on
> the done side move the unlock and put into the done method as well.

Yes, this is what happens later in this series (as you've seen by now).

> >   if (!folio) {
> >   status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
> > - goto out_no_page;
> > + iomap_folio_done(iter, pos, 0, NULL);
> > + return status;
> >   }
> >
> >   /*
> > @@ -656,13 +670,9 @@ static int iomap_write_begin(struct iomap_iter *iter, 
> > loff_t pos,
> >   return 0;
> >
> >  out_unlock:
> > - folio_unlock(folio);
> > - folio_put(folio);
> > + iomap_folio_done(iter, pos, 0, folio);
> >   iomap_write_failed(iter->inode, pos, len);
> >
> > -out_no_page:
> > - if (page_ops && page_ops->page_done)
> > - page_ops->page_done(iter->inode, pos, 0, NULL);
> >   return status;
>
> But for the current version I don't really understand why the error
> unwinding changes here.

Currently, we have this order of operations in iomap_write_begin():

  folio_unlock() // folio_put() // iomap_write_failed() // ->page_done()

and this order in iomap_write_end():

  folio_unlock() // ->page_done() // folio_put() // iomap_write_failed()

The unwinding in iomap_write_begin() works because this is the trivial
case in which nothing happens to the page. We might just as well use
the same order of operations there as in iomap_write_end() though, and
when you switch to that, this is what you get.

Thank you for the review.

Andreas



Re: [Cluster-devel] [PATCH -next 4/5] gfs2: fix possible null-ptr-deref when parsing param

2022-10-24 Thread Andreas Grünbacher
Am So., 23. Okt. 2022 um 18:46 Uhr schrieb Hawkins Jiawei :
> According to commit "vfs: parse: deal with zero length string value",
> kernel will set the param->string to null pointer in vfs_parse_fs_string()
> if fs string has zero length.
>
> Yet the problem is that, gfs2_parse_param() will dereferences the
> param->string, without checking whether it is a null pointer, which may
> trigger a null-ptr-deref bug.
>
> This patch solves it by adding sanity check on param->string
> in gfs2_parse_param().

Yes, but then it dereferences param->string in the error message. That
won't help.

> Reported-by: syzbot+da97a57c5b742d05d...@syzkaller.appspotmail.com
> Tested-by: syzbot+da97a57c5b742d05d...@syzkaller.appspotmail.com
> Cc: agrue...@redhat.com
> Cc: cluster-devel@redhat.com
> Cc: linux-ker...@vger.kernel.org
> Cc: rpete...@redhat.com
> Cc: syzkaller-b...@googlegroups.com
> Signed-off-by: Hawkins Jiawei 
> ---
>  fs/gfs2/ops_fstype.c | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index c0cf1d2d0ef5..934746f18c25 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -1446,12 +1446,18 @@ static int gfs2_parse_param(struct fs_context *fc, 
> struct fs_parameter *param)
>
> switch (o) {
> case Opt_lockproto:
> +   if (!param->string)
> +   goto bad_val;
> strscpy(args->ar_lockproto, param->string, GFS2_LOCKNAME_LEN);
> break;
> case Opt_locktable:
> +   if (!param->string)
> +   goto bad_val;
> strscpy(args->ar_locktable, param->string, GFS2_LOCKNAME_LEN);
> break;
> case Opt_hostdata:
> +   if (!param->string)
> +   goto bad_val;
> strscpy(args->ar_hostdata, param->string, GFS2_LOCKNAME_LEN);
> break;
> case Opt_spectator:
> @@ -1535,6 +1541,10 @@ static int gfs2_parse_param(struct fs_context *fc, 
> struct fs_parameter *param)
> return invalfc(fc, "invalid mount option: %s", param->key);
> }
> return 0;
> +
> +bad_val:
> +   return invalfc(fc, "Bad value '%s' for mount option '%s'\n",
> +  param->string, param->key);
>  }
>
>  static int gfs2_reconfigure(struct fs_context *fc)
> --
> 2.25.1
>

Thanks,
Andreas



Re: [Cluster-devel] remove iomap_writepage v2

2022-08-10 Thread Andreas Grünbacher
Am Mi., 10. Aug. 2022 um 22:57 Uhr schrieb Matthew Wilcox :
> On Mon, Aug 01, 2022 at 11:31:50AM -0400, Johannes Weiner wrote:
> > XFS hasn't had a ->writepage call for a while. After LSF I internally
> > tested dropping btrfs' callback, and the results looked good: no OOM
> > kills with dirty/writeback pages remaining, performance parity. Then I
> > went on vacation and Christoph beat me to the patch :)
>
> To avoid duplicating work with you or Christoph ... it seems like the
> plan is to kill ->writepage entirely soon, so there's no point in me
> doing a sweep of all the filesystems to convert ->writepage to
> ->write_folio, correct?
>
> I assume the plan for filesystems which have a writepage but don't have
> a ->writepages (9p, adfs, affs, bfs, ecryptfs, gfs2, hostfs, jfs, minix,
> nilfs2, ntfs, ocfs2, reiserfs, sysv, ubifs, udf, ufs, vboxsf) is to give
> them a writepages, modelled on iomap_writepages().  Seems that adding
> a block_writepages() might be a useful thing for me to do?

Hmm, gfs2 does have gfs2_writepages() and gfs2_jdata_writepages()
functions, so it should probably be fine.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH 2/4] gfs2: remove ->writepage

2022-07-11 Thread Andreas Grünbacher
Am Mo., 11. Juli 2022 um 06:16 Uhr schrieb Christoph Hellwig :
> ->writepage is only used for single page writeback from memory reclaim,
> and not called at all for cgroup writeback.  Follow the lead of XFS
> and remove ->writepage and rely entirely on ->writepages.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/gfs2/aops.c | 26 --
>  1 file changed, 26 deletions(-)
>
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index 106e90a365838..0240a1a717f56 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -81,31 +81,6 @@ static int gfs2_get_block_noalloc(struct inode *inode, 
> sector_t lblock,
> return 0;
>  }
>
> -/**
> - * gfs2_writepage - Write page for writeback mappings
> - * @page: The page
> - * @wbc: The writeback control
> - */
> -static int gfs2_writepage(struct page *page, struct writeback_control *wbc)
> -{
> -   struct inode *inode = page->mapping->host;
> -   struct gfs2_inode *ip = GFS2_I(inode);
> -   struct gfs2_sbd *sdp = GFS2_SB(inode);
> -   struct iomap_writepage_ctx wpc = { };
> -
> -   if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
> -   goto out;
> -   if (current->journal_info)
> -   goto redirty;
> -   return iomap_writepage(page, wbc, &wpc, &gfs2_writeback_ops);
> -
> -redirty:
> -   redirty_page_for_writepage(wbc, page);
> -out:
> -   unlock_page(page);
> -   return 0;
> -}
> -
>  /**
>   * gfs2_write_jdata_page - gfs2 jdata-specific version of 
> block_write_full_page
>   * @page: The page to write
> @@ -765,7 +740,6 @@ bool gfs2_release_folio(struct folio *folio, gfp_t 
> gfp_mask)
>  }
>
>  static const struct address_space_operations gfs2_aops = {
> -   .writepage = gfs2_writepage,
> .writepages = gfs2_writepages,
> .read_folio = gfs2_read_folio,
> .readahead = gfs2_readahead,
> --
> 2.30.2
>

This is looking fine, and it has survived a moderate amount of testing already.

Tested-by: Andreas Gruenbacher 
Reviewed-by: Andreas Gruenbacher 

It should be possible to remove the .writepage operation in
gfs2_jdata_aops as well, but I must be overlooking something because
that actually breaks things.

Thanks,
Andreas



Re: [Cluster-devel] [5.15 REGRESSION] iomap: Fix inline extent handling in iomap_readpage

2021-11-11 Thread Andreas Grünbacher
Am Do., 11. Nov. 2021 um 08:26 Uhr schrieb Christoph Hellwig :
> The iomap mapping sizes are read-only to iomap for a good reason.  You
> can't just break that design.

Right. We can stop iomap_iter by returning 0 now though; see v2.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks

2021-10-28 Thread Andreas Grünbacher
Am Do., 28. Okt. 2021 um 23:21 Uhr schrieb Catalin Marinas
:
> One last try on this path before I switch to the other options.
>
> On Wed, Oct 27, 2021 at 02:14:48PM -0700, Linus Torvalds wrote:
> > On Wed, Oct 27, 2021 at 12:13 PM Catalin Marinas
> >  wrote:
> > > As an alternative, you mentioned earlier that a per-thread fault status
> > > was not feasible on x86 due to races. Was this only for the hw poison
> > > case? I think the uaccess is slightly different.
> >
> > It's not x86-specific, it's very generic.
> >
> > If we set some flag in the per-thread status, we'll need to be careful
> > about not overwriting it if we then have a subsequent NMI that _also_
> > takes a (completely unrelated) page fault - before we then read the
> > per-thread flag.
> >
> > Think 'perf' and fetching backtraces etc.
> >
> > Note that the NMI page fault can easily also be a pointer coloring
> > fault on arm64, for exactly the same reason that whatever original
> > copy_from_user() code was. So this is not a "oh, pointer coloring
> > faults are different". They have the same re-entrancy issue.
> >
> > And both the "pagefault_disable" and "fault happens in interrupt
> > context" cases are also the exact same 'faulthandler_disabled()'
> > thing. So even at fault time they look very similar.
>
> They do look fairly similar but we should have the information in the
> fault handler to distinguish: not a page fault (pte permission or p*d
> translation), in_task(), user address, fixup handler. But I agree the
> logic looks fragile.
>
> I think for nested contexts we can save the uaccess fault state on
> exception entry, restore it on return. Or (needs some thinking on
> atomicity) save it in a local variable. The high-level API would look
> something like:
>
> unsigned long uaccess_flags;/* we could use TIF_ flags */
>
> uaccess_flags = begin_retriable_uaccess();
> copied = copy_page_from_iter_atomic(...);
> retry = end_retriable_uaccess(uaccess_flags);
> ...
>
> if (!retry)
> break;
>
> I think we'd need a TIF flag to mark the retriable region and another to
> track whether a non-recoverable fault occurred. It needs prototyping.
>
> Anyway, if you don't like this approach, I'll look at error codes being
> returned but rather than changing all copy_from_user() etc., introduce a
> new API that returns different error codes depending on the fault
> (e.g -EFAULT vs -EACCES). We already have copy_from_user_nofault(), we'd
> need something for the iov_iter stuff to use in the fs code.

We won't need any of that on the filesystem read and write paths. The
two cases there are buffered and direct I/O:

* In the buffered I/O case, the copying happens with page faults
disabled, at a byte granularity. If that returns a short result, we
need to enable page faults, check if the exact address that failed
still fails (in which case we have a sub-page fault),  fault in the
pages, disable page faults again, and repeat. No probing for sub-page
faults beyond the first byte of the fault-in address is needed.
Functions fault_in_{readable,writeable} implicitly have this behavior;
for fault_in_safe_writeable() the choice we have is to either add
probing of the first byte for sub-page faults to this function or
force callers to do that probing separately. At this point, I'd vote
for the former.

* In the direct I/O case, the copying happens while we're holding page
references, so the only page faults that can occur during copying are
sub-page faults. When iomap_dio_rw or its legacy counterpart is called
with page faults disabled, we need to make sure that the caller can
distinguish between page faults triggered during
bio_iov_iter_get_pages() and during the copying, but that's a separate
problem. (At the moment, when iomap_dio_rw fails with -EFAULT, the
caller *cannot* distinguish between a bio_iov_iter_get_pages failure
and a failure during synchronous copying, but that could be fixed by
returning unique error codes from iomap_dio_rw.)

So as far as I can see, the only problematic case we're left with is
copying bigger than byte-size chunks with page faults disabled when we
don't know whether the underlying pages are resident or not. My guess
would be that in this case, if the copying fails, it would be
perfectly acceptable to explicitly probe the entire chunk for sub-page
faults.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks

2021-09-03 Thread Andreas Grünbacher
Sorry for the HTML post: I'm travelling for the next two weeks and I'll
mostly be in read-only mode. Bob can help out during that time if
necessary, though.

Linus Torvalds  schrieb am Fr., 3. Sep.
2021, 18:20:

> On Wed, Sep 1, 2021 at 12:53 PM Andreas Gruenbacher 
> wrote:
> >
> > So there's a minor merge conflict between Christoph's iomap_iter
> > conversion and this patch queue now, and I should probably clarify the
> > description of "iomap: Add done_before argument to iomap_dio_rw" that
> > Darrick ran into. Then there are the user copy issues that Al has
> > pointed out. Fixing those will create superficial conflicts with this
> > patch queue, but probably nothing serious.
> >
> > So how should I proceed: do you expect a v8 of this patch queue on top
> > of the current mainline?
>
> So if you rebase for fixes, it's going to be a "next merge window" thing
> again.
>
> Personally, I'm ok with the series as is, and the conflict isn't an
> issue. So I'd take it as is, and then people can fix up niggling
> issues later.
>

That sounds fine to me. We can clarify things as Darrick has suggested in a
separate commit.

Thanks,
Andreas


Re: [Cluster-devel] [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

2021-08-27 Thread Andreas Grünbacher
Am Fr., 27. Aug. 2021 um 23:33 Uhr schrieb Darrick J. Wong :
> On Fri, Aug 27, 2021 at 10:15:11PM +0200, Andreas Gruenbacher wrote:
> > On Fri, Aug 27, 2021 at 8:30 PM Darrick J. Wong  wrote:
> > > On Fri, Aug 27, 2021 at 06:49:23PM +0200, Andreas Gruenbacher wrote:
> > > > Add a done_before argument to iomap_dio_rw that indicates how much of
> > > > the request has already been transferred.  When the request succeeds, we
> > > > report that done_before additional bytes were tranferred.  This is
> > > > useful for finishing a request asynchronously when part of the request
> > > > has already been completed synchronously.
> > > >
> > > > We'll use that to allow iomap_dio_rw to be used with page faults
> > > > disabled: when a page fault occurs while submitting a request, we
> > > > synchronously complete the part of the request that has already been
> > > > submitted.  The caller can then take care of the page fault and call
> > > > iomap_dio_rw again for the rest of the request, passing in the number of
> > > > bytes already tranferred.
> > > >
> > > > Signed-off-by: Andreas Gruenbacher 
> > > > ---
> > > >  fs/btrfs/file.c   |  5 +++--
> > > >  fs/ext4/file.c|  5 +++--
> > > >  fs/gfs2/file.c|  4 ++--
> > > >  fs/iomap/direct-io.c  | 11 ---
> > > >  fs/xfs/xfs_file.c |  6 +++---
> > > >  fs/zonefs/super.c |  4 ++--
> > > >  include/linux/iomap.h |  4 ++--
> > > >  7 files changed, 23 insertions(+), 16 deletions(-)
> > > >
> > >
> > > 
> > >
> > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > index ba88fe51b77a..dcf9a2b4381f 100644
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -31,6 +31,7 @@ struct iomap_dio {
> > > >   atomic_tref;
> > > >   unsignedflags;
> > > >   int error;
> > > > + size_t  done_before;
> > > >   boolwait_for_completion;
> > > >
> > > >   union {
> > > > @@ -126,6 +127,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > > >   if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> > > >   ret = generic_write_sync(iocb, ret);
> > > >
> > > > + if (ret > 0)
> > > > + ret += dio->done_before;
> > >
> > > Pardon my ignorance since this is the first time I've had a crack at
> > > this patchset, but why is it necessary to carry the "bytes copied"
> > > count from the /previous/ iomap_dio_rw call all the way through to dio
> > > completion of the current call?
> >
> > Consider the following situation:
> >
> >  * A user submits an asynchronous read request.
> >
> >  * The first page of the buffer is in memory, but the following
> >pages are not. This isn't uncommon for consecutive reads
> >into freshly allocated memory.
> >
> >  * iomap_dio_rw writes into the first page. Then it
> >hits the next page which is missing, so it returns a partial
> >result, synchronously.
> >
> >  * We then fault in the remaining pages and call iomap_dio_rw
> >for the rest of the request.
> >
> >  * The rest of the request completes asynchronously.
> >
> > Does that answer your question?
>
> No, because you totally ignored the second question:
>
> If the directio operation succeeds even partially and the PARTIAL flag
> is set, won't that push the iov iter ahead by however many bytes
> completed?

Yes, exactly as it should.

> We already finished the IO for the first page, so the second attempt
> should pick up where it left off, i.e. the second page.

Yes, so what's the question?

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion

2021-08-20 Thread Andreas Grünbacher
Am Fr., 20. Aug. 2021 um 15:49 Uhr schrieb Steven Whitehouse
:
> We always used to manage to avoid holding fs locks when copying to/from 
> userspace
> to avoid these complications.

I realize the intent, but that goal has never actually been achieved.
Direct I/O has *always* been calling get_user_pages() while holding
the inode glock in deferred mode.

The situation is slightly different for buffered I/O, but we have the
same problem there at least since switching to iomap. (And it's a
fundamental property of iomap that we want to hold the inode glock
across multi-page mappings, not an implementation deficit.)

Here's a slightly outdated version of a test case that makes all
versions of gfs2 prior to this patch queue unhappy:
https://lore.kernel.org/fstests/20210531152604.240462-1-agrue...@redhat.com/

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v3 7/7] gfs2: Fix mmap + page fault deadlocks for direct I/O

2021-07-26 Thread Andreas Grünbacher
Jan Kara  schrieb am Mo., 26. Juli 2021, 19:10:

> On Fri 23-07-21 22:58:40, Andreas Gruenbacher wrote:
> > Also disable page faults during direct I/O requests and implement the
> same kind
> > of retry logic as in the buffered I/O case.
> >
> > Direct I/O requests differ from buffered I/O requests in that they use
> > bio_iov_iter_get_pages for grabbing page references and faulting in pages
> > instead of triggering real page faults.  Those manual page faults can be
> > disabled with the iocb->noio flag.
> >
> > Signed-off-by: Andreas Gruenbacher 
> > ---
> >  fs/gfs2/file.c | 34 +-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> > index f66ac7f56f6d..7986f3be69d2 100644
> > --- a/fs/gfs2/file.c
> > +++ b/fs/gfs2/file.c
> > @@ -782,21 +782,41 @@ static ssize_t gfs2_file_direct_read(struct kiocb
> *iocb, struct iov_iter *to,
> >   struct file *file = iocb->ki_filp;
> >   struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
> >   size_t count = iov_iter_count(to);
> > + size_t written = 0;
> >   ssize_t ret;
> >
> > + /*
> > +  * In this function, we disable page faults when we're holding the
> > +  * inode glock while doing I/O.  If a page fault occurs, we drop
> the
> > +  * inode glock, fault in the pages manually, and then we retry.
> Other
> > +  * than in gfs2_file_read_iter, iomap_dio_rw can trigger implicit
> as
> > +  * well as manual page faults, and we need to disable both kinds
> > +  * separately.
> > +  */
> > +
> >   if (!count)
> >   return 0; /* skip atime */
> >
> >   gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
> > +retry:
> >   ret = gfs2_glock_nq(gh);
> >   if (ret)
> >   goto out_uninit;
> >
> > + pagefault_disable();
>
> Is there any use in pagefault_disable() here? iomap_dio_rw() should not
> trigger any page faults anyway, should it?
>

It can trigger physical page faults when reading from holes.

Andreas


> Honza
> --
> Jan Kara 
> SUSE Labs, CR
>


Re: [Cluster-devel] [PATCH v3 5/7] iomap: Support restarting direct I/O requests after user copy failures

2021-07-26 Thread Andreas Grünbacher
Jan Kara  schrieb am Mo., 26. Juli 2021, 19:21:

> On Fri 23-07-21 22:58:38, Andreas Gruenbacher wrote:
> > In __iomap_dio_rw, when iomap_apply returns an -EFAULT error, complete
> the
> > request synchronously and reset the iterator to the start position.  This
> > allows callers to deal with the failure and retry the operation.
> >
> > In gfs2, we need to disable page faults while we're holding glocks to
> prevent
> > deadlocks.  This patch is the minimum solution I could find to make
> > iomap_dio_rw work with page faults disabled.  It's still expensive
> because any
> > I/O that was carried out before hitting -EFAULT needs to be retried.
> >
> > A possible improvement would be to add an IOMAP_DIO_FAULT_RETRY or
> similar flag
> > that would allow iomap_dio_rw to return a short result when hitting
> -EFAULT.
> > Callers could then retry only the rest of the request after dealing with
> the
> > page fault.
> >
> > Asynchronous requests turn into synchronous requests up to the point of
> the
> > page fault in any case, but they could be retried asynchronously after
> dealing
> > with the page fault.  To make that work, the completion notification
> would have
> > to include the bytes read or written before the page fault(s) as well,
> and we'd
> > need an additional iomap_dio_rw argument for that.
> >
> > Signed-off-by: Andreas Gruenbacher 
> > ---
> >  fs/iomap/direct-io.c | 9 +
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index cc0b4bc8861b..b0a494211bb4 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -561,6 +561,15 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter
> *iter,
> >   ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
> >   iomap_dio_actor);
> >   if (ret <= 0) {
> > + if (ret == -EFAULT) {
> > + /*
> > +  * To allow retrying the request, fail
> > +  * synchronously and reset the iterator.
> > +  */
> > + wait_for_completion = true;
> > + iov_iter_revert(dio->submit.iter,
> dio->size);
> > + }
> > +
>
> Hum, OK, but this means that if userspace submits large enough write, GFS2
> will livelock trying to complete it? While other filesystems can just
> submit multiple smaller bios constructed in iomap_apply() (paging in
> different parts of the buffer) and thus complete the write?
>

No. First, this affects reads but not writes. We cannot just blindly repeat
writes; when a page fault occurs in the middle of a write, the result will
be a short write. For reads, the plan is to ads a flag to allow
iomap_dio_rw to return a partial result when a page fault occurs.
(Currently, it fails the entire request.) Then we can handle the page fault
and complete the rest of the request.

The changes needed for that are simple on the iomap side, but we need to go
through some gymnastics for handling the page fault without giving up the
glock in the non-contended case. There will still be the potential for
losing the lock and having to re-acquire it, in which case we'll actually
have to repeat the entire read.

Thanks,
Andreas

Honza
>
> >   /* magic error code to fall back to buffered I/O */
> >   if (ret == -ENOTBLK) {
> >   wait_for_completion = true;
> > --
> > 2.26.3
> >
> --
> Jan Kara 
> SUSE Labs, CR
>


Re: [Cluster-devel] [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper

2021-07-24 Thread Andreas Grünbacher
Am Sa., 24. Juli 2021 um 03:53 Uhr schrieb Al Viro :
> On Fri, Jul 23, 2021 at 10:58:34PM +0200, Andreas Gruenbacher wrote:
> > Introduce a new fault_in_iov_iter helper for manually faulting in an 
> > iterator.
> > Other than fault_in_pages_writeable(), this function is non-destructive.
> >
> > We'll use fault_in_iov_iter in gfs2 once we've determined that the iterator
> > passed to .read_iter or .write_iter isn't in memory.
>
> Hmm...  I suspect that this is going to be much heavier for read access
> than the existing variant.  Do we ever want it for anything other than
> writes?

I don't know if it actually is slower when pages need to be faulted
in, but I'm fine turning it into a write-only function.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper

2021-07-24 Thread Andreas Grünbacher
Am Sa., 24. Juli 2021 um 01:41 Uhr schrieb Linus Torvalds
:
> On Fri, Jul 23, 2021 at 1:58 PM Andreas Gruenbacher  
> wrote:
> >
> > Introduce a new fault_in_iov_iter helper for manually faulting in an 
> > iterator.
> > Other than fault_in_pages_writeable(), this function is non-destructive.
>
> Again, as I pointed out in the previous version, "Other than" is not
> sensible language.

Thanks for pointing this out a second time. I have no idea how I
screwed up fixing that.

Andreas



Re: [Cluster-devel] [PATCH] gfs2: Fix mmap + page fault deadlocks

2021-07-01 Thread Andreas Grünbacher
Am Fr., 2. Juli 2021 um 02:48 Uhr schrieb Linus Torvalds
:
> On Thu, Jul 1, 2021 at 5:30 PM Linus Torvalds  
> wrote:
> > Of course, if you ask for more data than the file has, that's another
> > thing, but who really does that with direct-IO? And if they do, why
> > should we care about their silly behavior?
>
> Now, if the issue is that people do IO for bigger areas than you have
> memory for, then I think that's a chunking issue. I don't think the
> ITER_IOVEC necessarily needs to be converted to an ITER_BVEC in one
> single go. That would indeed be painful if somebody tries to do some
> huge direct-IO when they just don't have the memory for it.
>
> But the fact is, direct-IO has been an incredible pain-point for
> decades, because it's (a) unusual, (b) buggy and (c) has very little
> overall design and common code.
>
> The three issues are likely intimately tied together.
>
> The iomap code at least has tried to make for much more common code,
> but I really think some direct-IO people should seriously reconsider
> how they are doing things when there are fundamental deadlocks in the
> design.
>
> And I do think that a ITER_IOVEC -> ITER_BVEC conversion function
> might be a really really good idea to solve this problem.

I've tried to explain above how keeping the user-space pages
referenced will just lead to different kinds of deadlocks. That is the
problem with this approach.

> There's even
> a very natural chunking algorithm: try to do as much as possible with
> get_user_pages_fast() - so that if you already *have* the memory, you
> can do the full IO (or at least a big part of it).
>
> And if get_user_pages_fast() only gives you a small area - or nothing
> at all - you chunk it down aggressively, and realize that "oh, doing
> direct-IO when user space is paged out might not be the most optimal
> case".
>
>   Linus

Thanks,
Andreas



Re: [Cluster-devel] [PATCH 006/141] gfs2: Fix fall-through warnings for Clang

2021-04-20 Thread Andreas Grünbacher
Am Di., 20. Apr. 2021 um 22:29 Uhr schrieb Gustavo A. R. Silva
:
> Friendly ping: who can take this, please?

Oops, that got lost. I've added it to for-next now.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH -next] gfs2: use kmem_cache_free() instead of kfree()

2021-04-09 Thread Andreas Grünbacher
Hi,

Am Fr., 9. Apr. 2021 um 10:20 Uhr schrieb Wei Yongjun :
> memory allocated by kmem_cache_alloc() should be freed using
> kmem_cache_free(), not kfree().

thanks for the patch, that's true. This patch has turned out to have
other problems as well, so we've pulled it and Bob is currently
investigating.

Andreas



Re: [Cluster-devel] [PATCH v2] fs: amend SLAB_RECLAIM_ACCOUNT on gfs2 related slab cache

2021-01-05 Thread Andreas Grünbacher
Bob,

Am Di., 5. Jan. 2021 um 14:28 Uhr schrieb Bob Peterson :
> - Original Message -
> > From: Zhaoyang Huang 
> >
> > As gfs2_quotad_cachep and gfs2_glock_cachep have registered
> > the shrinker, amending SLAB_RECLAIM_ACCOUNT when creating
> > them, which make the slab acount to be presiced.
> >
> > Signed-off-by: Zhaoyang Huang 
> > ---
> > v2: add gfs2_glock_cachep for same operation
> > ---
> Hi,
>
> Thanks. Your patch is now pushed to the linux-gfs2/for-next branch.
> I cleaned up the description a bit. For example, I changed "fs:" to
> "gfs2:" to conform to other gfs2 patches.

so the patch description now reads:

"As gfs2_quotad_cachep and gfs2_glock_cachep have registered
shrinkers, amending SLAB_RECLAIM_ACCOUNT when creating them,
which improves slab accounting."

I'm not sure what that description is based on; the definition of
SLAB_RECLAIM_ACCOUNT in include/linux/slab.h looks as follows, which
indicates that the purpose isn't really accounting but object
grouping:

/* The following flags affect the page allocator grouping pages by mobility */
/* Objects are reclaimable */
#define SLAB_RECLAIM_ACCOUNT((slab_flags_t __force)0x0002U)
#define SLAB_TEMPORARY  SLAB_RECLAIM_ACCOUNT/* Objects are
short-lived */

Furthermore, would it make sense to use SLAB_RECLAIM_ACCOUNT or
SLAB_TEMPORARY for gfs2_bufdata and gfs2_trans as well?

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v11 16/25] fs: Convert mpage_readpages to mpage_readahead

2020-06-16 Thread Andreas Grünbacher
Am Mi., 17. Juni 2020 um 02:33 Uhr schrieb Matthew Wilcox :
>
> On Wed, Jun 17, 2020 at 12:36:13AM +0200, Andreas Gruenbacher wrote:
> > Am Mi., 15. Apr. 2020 um 23:39 Uhr schrieb Matthew Wilcox 
> > :
> > > From: "Matthew Wilcox (Oracle)" 
> > >
> > > Implement the new readahead aop and convert all callers (block_dev,
> > > exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
> > > reiserfs & udf).  The callers are all trivial except for GFS2 & OCFS2.
> >
> > This patch leads to an ABBA deadlock in xfstest generic/095 on gfs2.
> >
> > Our lock hierarchy is such that the inode cluster lock ("inode glock")
> > for an inode needs to be taken before any page locks in that inode's
> > address space.
>
> How does that work for ...
>
> writepage:  yes, unlocks (see below)
> readpage:   yes, unlocks
> invalidatepage: yes
> releasepage:yes
> freepage:   yes
> isolate_page:   yes
> migratepage:yes (both)
> putback_page:   yes
> launder_page:   yes
> is_partially_uptodate:  yes
> error_remove_page:  yes
>
> Is there a reason that you don't take the glock in the higher level
> ops which are called before readhead gets called?  I'm looking at XFS,
> and it takes the xfs_ilock SHARED in xfs_file_buffered_aio_read()
> (called from xfs_file_read_iter).

Right, the approach from the following thread might fix this:

https://lore.kernel.org/linux-fsdevel/20191122235324.17245-1-agrue...@redhat.com/T/#t

Andreas



Re: [Cluster-devel] [PATCH] fs/gfs2: remove IS_DINODE

2020-01-21 Thread Andreas Grünbacher
Alex,

Am Di., 21. Jan. 2020 um 09:50 Uhr schrieb Alex Shi
:
> After commit 1579343a73e3 ("GFS2: Remove dirent_first() function"), this
> macro isn't used any more. so remove it.
>
> Signed-off-by: Alex Shi 
> Cc: Bob Peterson 
> Cc: Andreas Gruenbacher 
> Cc: cluster-devel@redhat.com
> Cc: linux-ker...@vger.kernel.org
> ---
>  fs/gfs2/dir.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
> index eb9c0578978f..636a8d0f3dab 100644
> --- a/fs/gfs2/dir.c
> +++ b/fs/gfs2/dir.c
> @@ -74,7 +74,6 @@
>  #include "util.h"
>
>  #define IS_LEAF 1 /* Hashed (leaf) directory */
> -#define IS_DINODE   2 /* Linear (stuffed dinode block) directory */

The same is true for the IS_LEAF macro. I'm adjusting the patch accordingly.

The other patch removing the LBIT macros looks good as well, so I'm
applying both.

Thanks,
Andreas




Re: [Cluster-devel] [PATCH] iomap: Export iomap_page_create and iomap_set_range_uptodate

2019-12-10 Thread Andreas Grünbacher
Am Di., 10. Dez. 2019 um 22:25 Uhr schrieb Darrick J. Wong
:
> On Tue, Dec 10, 2019 at 09:39:31PM +0100, Andreas Grünbacher wrote:
> > Am Di., 10. Dez. 2019 um 21:33 Uhr schrieb Darrick J. Wong
> > :
> > > On Tue, Dec 10, 2019 at 11:29:16AM +0100, Andreas Gruenbacher wrote:
> > > > These two functions are needed by filesystems for converting inline
> > > > ("stuffed") inodes into non-inline inodes.
> > > >
> > > > Signed-off-by: Andreas Gruenbacher 
> > >
> > > Looks fine to me... this is a 5.6 change, correct?
> >
> > Yes, so there's still plenty of time to get things in place until
> > then. I'd like to hear from Christoph if he has any objections. In any
> > case, this patch isn't going to break anything.
>
> By the way, the other symbols in fs/iomap/ are all EXPORT_SYMBOL_GPL.
> Does gfs2/RH/anyone have a particular requirement for EXPORT_SYMBOL, or
> could we make the new exports _GPL to match the rest?

I don't mind EXPORT_SYMBOL_GPL.

Thanks,
Andreas




Re: [Cluster-devel] [PATCH] iomap: Export iomap_page_create and iomap_set_range_uptodate

2019-12-10 Thread Andreas Grünbacher
Am Di., 10. Dez. 2019 um 21:33 Uhr schrieb Darrick J. Wong
:
> On Tue, Dec 10, 2019 at 11:29:16AM +0100, Andreas Gruenbacher wrote:
> > These two functions are needed by filesystems for converting inline
> > ("stuffed") inodes into non-inline inodes.
> >
> > Signed-off-by: Andreas Gruenbacher 
>
> Looks fine to me... this is a 5.6 change, correct?

Yes, so there's still plenty of time to get things in place until
then. I'd like to hear from Christoph if he has any objections. In any
case, this patch isn't going to break anything.

Thanks,
Andreas




Re: [Cluster-devel] [RFC PATCH 3/3] gfs2: Rework read and page fault locking

2019-11-26 Thread Andreas Grünbacher
Am Mo., 25. Nov. 2019 um 10:16 Uhr schrieb Kirill A. Shutemov
:
> On Sat, Nov 23, 2019 at 12:53:24AM +0100, Andreas Gruenbacher wrote:
> > @@ -778,15 +804,51 @@ static ssize_t gfs2_file_direct_write(struct kiocb 
> > *iocb, struct iov_iter *from)
> >
> >  static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >  {
> > + struct gfs2_inode *ip;
> > + struct gfs2_holder gh;
> > + size_t written = 0;
>
> 'written' in a read routine?

It's a bit weird, but it's the same as in generic_file_buffered_read.

> --
>  Kirill A. Shutemov

Thanks,
Andreas




Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

2019-11-22 Thread Andreas Grünbacher
Hi,

Am Do., 31. Okt. 2019 um 12:43 Uhr schrieb Steven Whitehouse
:
> Andreas, Bob, have I missed anything here?

I've looked into this a bit, and it seems that there's a reasonable
way to get rid of the lock taking in ->readpage and ->readpages
without a lot of code duplication. My proposal for that consists of
multiple patches, so I've posted it separately:

https://lore.kernel.org/linux-fsdevel/20191122235324.17245-1-agrue...@redhat.com/T/#t

Thanks,
Andreas




Re: [PATCH] gfs2: Delete an unnecessary check before brelse()

2019-09-03 Thread Andreas Grünbacher
Am Di., 3. Sept. 2019 um 15:21 Uhr schrieb Markus Elfring
:
> From: Markus Elfring 
> Date: Tue, 3 Sep 2019 15:10:05 +0200
>
> The brelse() function tests whether its argument is NULL
> and then returns immediately.
> Thus the test around the call is not needed.
>
> This issue was detected by using the Coccinelle software.

Thanks. The same applies to brelse() in gfs2_dir_no_add (which Coccinelle
apparently missed), so let me fix that as well.

Andreas


Re: [Cluster-devel] [PATCH v2 2/2] iomap: partially revert 4721a601099 (simulated directio short read on EFAULT)

2019-08-29 Thread Andreas Grünbacher
Hi Darrick,

Am Do., 29. Aug. 2019 um 05:12 Uhr schrieb Darrick J. Wong
:
> Hm, so I made an xfstest out of the program you sent me, and indeed
> reverting that chunk makes the failure go away, but that got me
> wondering -- that iomap kludge was a workaround for the splice code
> telling iomap to try to stuff  bytes into a pipe that only has X
> bytes of free buffer space.  We fixed splice_direct_to_actor to clamp
> the length parameter to the available pipe space, but we never did the
> same to do_splice:
>
> /* Don't try to read more the pipe has space for. */
> read_len = min_t(size_t, len,
>  (pipe->buffers - pipe->nrbufs) << PAGE_SHIFT);
> ret = do_splice_to(in, &pos, pipe, read_len, flags);
>
> Applying similar logic to the two (opipe != NULL) cases of do_splice()
> seem to make the EAGAIN problem go away too.  So why don't we teach
> do_splice to only ask for as many bytes as the pipe has space here too?
>
> Does the following patch fix it for you?

Yes, that works, thank you.

> From: Darrick J. Wong 
> Subject: [PATCH] splice: only read in as much information as there is pipe 
> buffer space
>
> Andreas Gruenbacher reports that on the two filesystems that support
> iomap directio, it's possible for splice() to return -EAGAIN (instead of
> a short splice) if the pipe being written to has less space available in
> its pipe buffers than the length supplied by the calling process.
>
> Months ago we fixed splice_direct_to_actor to clamp the length of the
> read request to the size of the splice pipe.  Do the same to do_splice.

Can you add a reference to that commit here (17614445576b6)?

> Signed-off-by: Darrick J. Wong 
> ---
>  fs/splice.c |   12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/fs/splice.c b/fs/splice.c
> index 98412721f056..50335515d7c1 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -1101,6 +1101,7 @@ static long do_splice(struct file *in, loff_t __user 
> *off_in,
> struct pipe_inode_info *ipipe;
> struct pipe_inode_info *opipe;
> loff_t offset;
> +   unsigned int pipe_pages;
> long ret;
>
> ipipe = get_pipe_info(in);
> @@ -1123,6 +1124,10 @@ static long do_splice(struct file *in, loff_t __user 
> *off_in,
> if ((in->f_flags | out->f_flags) & O_NONBLOCK)
> flags |= SPLICE_F_NONBLOCK;
>
> +   /* Don't try to read more the pipe has space for. */
> +   pipe_pages = opipe->buffers - opipe->nrbufs;
> +   len = min_t(size_t, len, pipe_pages << PAGE_SHIFT);

This should probably be min(len, (size_t)pipe_pages << PAGE_SHIFT).
Same for the second min_t here and the one added by commit
17614445576b6.

> +
> return splice_pipe_to_pipe(ipipe, opipe, len, flags);
> }
>
> @@ -1180,8 +1185,13 @@ static long do_splice(struct file *in, loff_t __user 
> *off_in,
>
> pipe_lock(opipe);
> ret = wait_for_space(opipe, flags);
> -   if (!ret)
> +   if (!ret) {
> +   /* Don't try to read more the pipe has space for. */
> +   pipe_pages = opipe->buffers - opipe->nrbufs;
> +   len = min_t(size_t, len, pipe_pages << PAGE_SHIFT);
> +
> ret = do_splice_to(in, &offset, opipe, len, flags);
> +   }
> pipe_unlock(opipe);
> if (ret > 0)
> wakeup_pipe_readers(opipe);

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v2 2/2] iomap: partially revert 4721a601099 (simulated directio short read on EFAULT)

2019-08-28 Thread Andreas Grünbacher
Am Mi., 28. Aug. 2019 um 16:23 Uhr schrieb Darrick J. Wong
:
> On Wed, Aug 21, 2019 at 10:23:49PM +0200, Andreas Grünbacher wrote:
> > Hi Darrick,
> >
> > Am So., 2. Dez. 2018 um 19:13 Uhr schrieb Darrick J. Wong
> > :
> > > From: Darrick J. Wong 
> > >
> > > In commit 4721a601099, we tried to fix a problem wherein directio reads
> > > into a splice pipe will bounce EFAULT/EAGAIN all the way out to
> > > userspace by simulating a zero-byte short read.  This happens because
> > > some directio read implementations (xfs) will call
> > > bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
> > > reads, but as soon as we run out of pipe buffers that _get_pages call
> > > returns EFAULT, which the splice code translates to EAGAIN and bounces
> > > out to userspace.
> > >
> > > In that commit, the iomap code catches the EFAULT and simulates a
> > > zero-byte read, but that causes assertion errors on regular splice reads
> > > because xfs doesn't allow short directio reads.  This causes infinite
> > > splice() loops and assertion failures on generic/095 on overlayfs
> > > because xfs only permit total success or total failure of a directio
> > > operation.  The underlying issue in the pipe splice code has now been
> > > fixed by changing the pipe splice loop to avoid avoid reading more data
> > > than there is space in the pipe.
> > >
> > > Therefore, it's no longer necessary to simulate the short directio, so
> > > remove the hack from iomap.
> > >
> > > Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when 
> > > pipes fill")
> > > Reported-by: Amir Goldstein 
> > > Reviewed-by: Christoph Hellwig 
> > > Signed-off-by: Darrick J. Wong 
> > > ---
> > > v2: split into two patches per hch request
> > > ---
> > >  fs/iomap.c |9 -
> > >  1 file changed, 9 deletions(-)
> > >
> > > diff --git a/fs/iomap.c b/fs/iomap.c
> > > index 3ffb776fbebe..d6bc98ae8d35 100644
> > > --- a/fs/iomap.c
> > > +++ b/fs/iomap.c
> > > @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter 
> > > *iter,
> > > dio->wait_for_completion = true;
> > > ret = 0;
> > > }
> > > -
> > > -   /*
> > > -* Splicing to pipes can fail on a full pipe. We 
> > > have to
> > > -* swallow this to make it look like a short IO
> > > -* otherwise the higher splice layers will 
> > > completely
> > > -* mishandle the error and stop moving data.
> > > -*/
> > > -   if (ret == -EFAULT)
> > > -   ret = 0;
> > > break;
> > > }
> > > pos += ret;
> >
> > I'm afraid this breaks the following test case on xfs and gfs2, the
> > two current users of iomap_dio_rw.
>
> Hmm, I had kinda wondered if regular pipes still needed this help.
> Evidently we don't have a lot of splice tests in fstests. :(

So what do you suggest as a fix?

> > Here, the splice system call fails with errno = EAGAIN when trying to
> > "move data" from a file opened with O_DIRECT into a pipe.
> >
> > The test case can be run with option -d to not use O_DIRECT, which
> > makes the test succeed.
> >
> > The -r option switches from reading from the pipe sequentially to
> > reading concurrently with the splice, which doesn't change the
> > behavior.
> >
> > Any thoughts?
>
> This would be great as an xfstest! :)

Or perhaps something generalized from it.

> Do you have one ready to go, or should I just make one from the source
> code?

The bug originally triggered in our internal cluster test system and
I've recreated the test case I've included from the strace. That's all
I have for now; feel free to take it, of course.

It could be that the same condition can be triggered with one of the
existing utilities (fio/fsstress/...).

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v2 2/2] iomap: partially revert 4721a601099 (simulated directio short read on EFAULT)

2019-08-21 Thread Andreas Grünbacher
Hi Darrick,

Am So., 2. Dez. 2018 um 19:13 Uhr schrieb Darrick J. Wong
:
> From: Darrick J. Wong 
>
> In commit 4721a601099, we tried to fix a problem wherein directio reads
> into a splice pipe will bounce EFAULT/EAGAIN all the way out to
> userspace by simulating a zero-byte short read.  This happens because
> some directio read implementations (xfs) will call
> bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
> reads, but as soon as we run out of pipe buffers that _get_pages call
> returns EFAULT, which the splice code translates to EAGAIN and bounces
> out to userspace.
>
> In that commit, the iomap code catches the EFAULT and simulates a
> zero-byte read, but that causes assertion errors on regular splice reads
> because xfs doesn't allow short directio reads.  This causes infinite
> splice() loops and assertion failures on generic/095 on overlayfs
> because xfs only permit total success or total failure of a directio
> operation.  The underlying issue in the pipe splice code has now been
> fixed by changing the pipe splice loop to avoid avoid reading more data
> than there is space in the pipe.
>
> Therefore, it's no longer necessary to simulate the short directio, so
> remove the hack from iomap.
>
> Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when 
> pipes fill")
> Reported-by: Amir Goldstein 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Darrick J. Wong 
> ---
> v2: split into two patches per hch request
> ---
>  fs/iomap.c |9 -
>  1 file changed, 9 deletions(-)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 3ffb776fbebe..d6bc98ae8d35 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> dio->wait_for_completion = true;
> ret = 0;
> }
> -
> -   /*
> -* Splicing to pipes can fail on a full pipe. We have 
> to
> -* swallow this to make it look like a short IO
> -* otherwise the higher splice layers will completely
> -* mishandle the error and stop moving data.
> -*/
> -   if (ret == -EFAULT)
> -   ret = 0;
> break;
> }
> pos += ret;

I'm afraid this breaks the following test case on xfs and gfs2, the
two current users of iomap_dio_rw.

Here, the splice system call fails with errno = EAGAIN when trying to
"move data" from a file opened with O_DIRECT into a pipe.

The test case can be run with option -d to not use O_DIRECT, which
makes the test succeed.

The -r option switches from reading from the pipe sequentially to
reading concurrently with the splice, which doesn't change the
behavior.

Any thoughts?

Thanks,
Andreas

=== 8< ===
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 

#define SECTOR_SIZE 512
#define BUFFER_SIZE (150 * SECTOR_SIZE)

void read_from_pipe(int fd, const char *filename, size_t size)
{
char buffer[SECTOR_SIZE];
size_t sz;
ssize_t ret;

while (size) {
sz = size;
if (sz > sizeof buffer)
sz = sizeof buffer;
ret = read(fd, buffer, sz);
if (ret < 0)
err(1, "read: %s", filename);
if (ret == 0) {
fprintf(stderr, "read: %s: unexpected EOF\n", filename);
exit(1);
}
size -= sz;
}
}

void do_splice1(int fd, const char *filename, size_t size)
{
bool retried = false;
int pipefd[2];

if (pipe(pipefd) == -1)
err(1, "pipe");
while (size) {
ssize_t spliced;

spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE);
if (spliced == -1) {
if (errno == EAGAIN && !retried) {
retried = true;
fprintf(stderr, "retrying splice\n");
sleep(1);
continue;
}
err(1, "splice");
}
read_from_pipe(pipefd[0], filename, spliced);
size -= spliced;
}
close(pipefd[0]);
close(pipefd[1]);
}

void do_splice2(int fd, const char *filename, size_t size)
{
bool retried = false;
int pipefd[2];
int pid;

if (pipe(pipefd) == -1)
err(1, "pipe");

pid = fork();
if (pid == 0) {
close(pipefd[1]);
read_from_pipe(pipefd[0], filename, size);
exit(0);
} else {
close(pipefd[0]);
while (size) {
ssize_t spliced;

spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE);
if (spliced == -1) {
if (errno == EAGAIN && !retried) {
retried = true;
fprintf(

Re: [PATCH v5 03/18] gfs2: add compat_ioctl support

2019-08-18 Thread Andreas Grünbacher
Am So., 18. Aug. 2019 um 21:32 Uhr schrieb Arnd Bergmann :
> On Fri, Aug 16, 2019 at 7:32 PM Andreas Gruenbacher  
> wrote:
> > On Wed, Aug 14, 2019 at 10:45 PM Arnd Bergmann  wrote:
> > > +   /* These are just misnamed, they actually get/put from/to user an 
> > > int */
> > > +   switch(cmd) {
> > > +   case FS_IOC32_GETFLAGS:
> > > +   cmd = FS_IOC_GETFLAGS;
> > > +   break;
> > > +   case FS_IOC32_SETFLAGS:
> > > +   cmd = FS_IOC_SETFLAGS;
> > > +   break;
> >
> > I'd like the code to be more explicit here:
> >
> > case FITRIM:
> > case FS_IOC_GETFSLABEL:
> >   break;
> > default:
> >   return -ENOIOCTLCMD;
>
> I've looked at it again: if we do this, the function actually becomes
> longer than the native gfs2_ioctl(). Should we just make a full copy then?

I don't think the length of gfs2_compat_ioctl is really an issue as
long as the function is that simple.

> static long gfs2_compat_ioctl(struct file *filp, unsigned int cmd,
> unsigned long arg)
> {
> switch(cmd) {
> case FS_IOC32_GETFLAGS:
> return gfs2_get_flags(filp, (u32 __user *)arg);
> case FS_IOC32_SETFLAGS:
> return gfs2_set_flags(filp, (u32 __user *)arg);
> case FITRIM:
> return gfs2_fitrim(filp, (void __user *)arg);
> case FS_IOC_GETFSLABEL:
> return gfs2_getlabel(filp, (char __user *)arg);
> }
>
> return -ENOTTY;
> }

Don't we still need the compat_ptr conversion? That seems to be the
main point of having a compat_ioctl operation.

> > Should we feed this through the gfs2 tree?
>
> A later patch that removes the FITRIM handling from fs/compat_ioctl.c
> depends on it, so I'd like to keep everything together.

Ok, fine for me.

Thanks,
Andreas


Re: [PATCH] sub sd_rgrps When clear rgrp

2019-06-27 Thread Andreas Grünbacher
Hi Yang Bin,

Am Do., 27. Juni 2019 um 05:08 Uhr schrieb Yang Bin :
> From: " Yang Bin "
>
> When clear rgrp,sub sd_rgrps after erased from rindex_tree

this patch isn't incorrect, but all it does it ensure that sd_rgrps
reaches zero before we destroy a struct gfs2_sbd. However, sd_rgrps is
only ever used when a filesystem is active, and while that is the
case, it can never decrease. It will grow when the filesystem is grown
though.

So what are you trying to achieve with this patch?

> Signed-off-by: Yang Bin 
> ---
>  fs/gfs2/rgrp.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 15d6e32..a4b2e83
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -730,6 +730,7 @@ void gfs2_clear_rgrpd(struct gfs2_sbd *sdp)
> gl = rgd->rd_gl;
>
> rb_erase(n, &sdp->sd_rindex_tree);
> +   sdp->sd_rgrps--;
>
> if (gl) {
> glock_clear_object(gl, rgd);
> --
> 1.8.3.1
>

Thanks,
Andreas


[Cluster-devel] GFS2: Pull Request (2nd attempt)

2019-05-08 Thread Andreas Grünbacher
Hi Linus,

please consider pulling the following changes for the GFS2 file
system, now without resolving the merge conflict Stephen Rothwell has
pointed out between commit

  2b070cfe582b ("block: remove the i argument to bio_for_each_segment_all")

which is already merged, interacting with commit

  e21e191994af ("gfs2: read journal in large chunks")

which is part of this pull request. Stephen's suggested resolution of
the conflict can be found at
https://lore.kernel.org/lkml/20190506150707.618f0...@canb.auug.org.au/.

Thanks,
Andreas

The following changes since commit b4b52b881cf08e13d110eac811d4becc0775abbf:

  Merge tag 'Wimplicit-fallthrough-5.2-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
(2019-05-07 12:48:10 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git
tags/gfs2-for-5.2

for you to fetch changes up to f4686c26ecc34e8e458b8235f0af5198c9b13bfd:

  gfs2: read journal in large chunks (2019-05-07 23:39:15 +0200)


We've got the following patches ready for this merge window:

"gfs2: Fix loop in gfs2_rbm_find (v2)"

  A rework of a fix we ended up reverting in 5.0 because of an iozone
  performance regression.

"gfs2: read journal in large chunks" and
"gfs2: fix race between gfs2_freeze_func and unmount"

  An improved version of a commit we also ended up reverting in 5.0
  because of a regression in xfstest generic/311.  It turns out that the
  journal changes were mostly innocent and that unfreeze didn't wait for
  the freeze to complete, which caused the filesystem to be unmounted
  before it was actually idle.

"gfs2: Fix occasional glock use-after-free"
"gfs2: Fix iomap write page reclaim deadlock"
"gfs2: Fix lru_count going negative"

  Fixes for various problems reported and partially fixed by Citrix
  engineers.  Thank you very much.

"gfs2: clean_journal improperly set sd_log_flush_head"

  Another fix from Bob.

A few other minor cleanups.


Abhi Das (2):
  gfs2: fix race between gfs2_freeze_func and unmount
  gfs2: read journal in large chunks

Andreas Gruenbacher (7):
  gfs2: Fix loop in gfs2_rbm_find (v2)
  gfs2: Fix occasional glock use-after-free
  gfs2: Remove misleading comments in gfs2_evict_inode
  gfs2: Remove unnecessary extern declarations
  gfs2: Rename sd_log_le_{revoke,ordered}
  gfs2: Rename gfs2_trans_{add_unrevoke => remove_revoke}
  gfs2: Fix iomap write page reclaim deadlock

Bob Peterson (2):
  gfs2: clean_journal improperly set sd_log_flush_head
  gfs2: Replace gl_revokes with a GLF flag

Ross Lagerwall (1):
  gfs2: Fix lru_count going negative

 fs/gfs2/aops.c   |  14 ++-
 fs/gfs2/bmap.c   | 118 ++-
 fs/gfs2/bmap.h   |   1 +
 fs/gfs2/dir.c|   2 +-
 fs/gfs2/glock.c  |  25 +++--
 fs/gfs2/glops.c  |   3 +-
 fs/gfs2/incore.h |   9 +-
 fs/gfs2/log.c|  47 ++
 fs/gfs2/log.h|   5 +-
 fs/gfs2/lops.c   | 261 ++-
 fs/gfs2/lops.h   |  11 +--
 fs/gfs2/main.c   |   1 -
 fs/gfs2/ops_fstype.c |   7 +-
 fs/gfs2/recovery.c   | 135 ++
 fs/gfs2/recovery.h   |   4 +-
 fs/gfs2/rgrp.c   |  56 +--
 fs/gfs2/super.c  |  20 ++--
 fs/gfs2/trans.c  |   4 +-
 fs/gfs2/trans.h  |   2 +-
 fs/gfs2/xattr.c  |   6 +-
 20 files changed, 438 insertions(+), 293 deletions(-)



Re: [Cluster-devel] [PATCH v7 5/5] gfs2: Fix iomap write page reclaim deadlock

2019-04-30 Thread Andreas Grünbacher
Am Di., 30. Apr. 2019 um 17:48 Uhr schrieb Darrick J. Wong
:
> Ok, I'll take the first four patches through the iomap branch and cc you
> on the pull request.

Ok great, thanks.

Andreas



Re: [Cluster-devel] [PATCH 5/6] iomap: generic inline data handling

2018-06-07 Thread Andreas Grünbacher
2018-06-06 12:40 GMT+02:00 Christoph Hellwig :
> From: Andreas Gruenbacher 
>
> Add generic inline data handling by adding a pointer to the inline data
> region to struct iomap.  When handling a buffered IOMAP_INLINE write,
> iomap_write_begin will copy the current inline data from the inline data
> region into the page cache, and iomap_write_end will copy the changes in
> the page cache back to the inline data region.
>
> This doesn't cover inline data reads and direct I/O yet because so far,
> we have no users.
>
> Signed-off-by: Andreas Gruenbacher 
> [hch: small cleanups to better fit in with other iomap work]
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap.c| 62 
> +--
>  include/linux/iomap.h |  1 +
>  2 files changed, 56 insertions(+), 7 deletions(-)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index f428baa32185..f2a491b98b7c 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -116,6 +116,26 @@ iomap_write_failed(struct inode *inode, loff_t pos, 
> unsigned len)
> truncate_pagecache_range(inode, max(pos, i_size), pos + len);
>  }
>
> +static void
> +iomap_read_inline_data(struct inode *inode, struct page *page,
> +   struct iomap *iomap)
> +{
> +   size_t size = i_size_read(inode);
> +   void *addr;
> +
> +   if (PageUptodate(page))
> +   return;
> +
> +   BUG_ON(page->index);
> +   BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
> +
> +   addr = kmap_atomic(page);
> +   memcpy(addr, iomap->inline_data, size);
> +   memset(addr + size, 0, PAGE_SIZE - size);
> +   kunmap_atomic(addr);
> +   SetPageUptodate(page);
> +}

Can you please move iomap_read_inline_data above iomap_write_failed in
fs/iomap.c so that it will end up above the readpage code? We'll need
iomap_read_inline_data in the readpage code.

>  static int
>  iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned 
> flags,
> struct page **pagep, struct iomap *iomap)
> @@ -133,7 +153,11 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
> if (!page)
> return -ENOMEM;
>
> -   status = __block_write_begin_int(page, pos, len, NULL, iomap);
> +   if (iomap->type == IOMAP_INLINE)
> +   iomap_read_inline_data(inode, page, iomap);
> +   else
> +   status = __block_write_begin_int(page, pos, len, NULL, iomap);
> +
> if (unlikely(status)) {
> unlock_page(page);
> put_page(page);
> @@ -146,14 +170,37 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
> return status;
>  }
>
> +static int
> +iomap_write_end_inline(struct inode *inode, struct page *page,
> +   struct iomap *iomap, loff_t pos, unsigned copied)
> +{
> +   void *addr;
> +
> +   WARN_ON_ONCE(!PageUptodate(page));
> +   BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
> +
> +   addr = kmap_atomic(page);
> +   memcpy(iomap->inline_data + pos, addr + pos, copied);
> +   kunmap_atomic(addr);
> +
> +   mark_inode_dirty(inode);
> +   __generic_write_end(inode, pos, copied, page);
> +   return copied;
> +}
> +
>  static int
>  iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
> -   unsigned copied, struct page *page)
> +   unsigned copied, struct page *page, struct iomap *iomap)
>  {
> int ret;
>
> -   ret = generic_write_end(NULL, inode->i_mapping, pos, len,
> -   copied, page, NULL);
> +   if (iomap->type == IOMAP_INLINE) {
> +   ret = iomap_write_end_inline(inode, page, iomap, pos, copied);
> +   } else {
> +   ret = generic_write_end(NULL, inode->i_mapping, pos, len,
> +   copied, page, NULL);
> +   }
> +
> if (ret < len)
> iomap_write_failed(inode, pos, len);
> return ret;
> @@ -208,7 +255,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
> length, void *data,
>
> flush_dcache_page(page);
>
> -   status = iomap_write_end(inode, pos, bytes, copied, page);
> +   status = iomap_write_end(inode, pos, bytes, copied, page,
> +   iomap);
> if (unlikely(status < 0))
> break;
> copied = status;
> @@ -302,7 +350,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t 
> length, void *data,
>
> WARN_ON_ONCE(!PageUptodate(page));
>
> -   status = iomap_write_end(inode, pos, bytes, bytes, page);
> +   status = iomap_write_end(inode, pos, bytes, bytes, page, 
> iomap);
> if (unlikely(status <= 0)) {
> if (WARN_ON_ONCE(status == 0))
> return -EIO;
> @@ -354,7

Re: [Cluster-devel] iomap preparations for GFS2

2018-06-06 Thread Andreas Grünbacher
2018-06-06 13:38 GMT+02:00 Andreas Gruenbacher :
> On 6 June 2018 at 12:40, Christoph Hellwig  wrote:
>> Hi all,
>>
>> this is a slight rework of the patches from Andreas to prepare for gfs2
>> using the iomap code.  I'd like to start with an immutable branch for
>> iomap patches in either the XFS tree or a tree of mine own with something
>> like this so that we can have a nice common base for both the major
>> iomap changes for XFS and the GFS2 work.

When do you want to push this upstream? We've got the gfs2 patches
that depend on it, and we cannot make progress with those before this
set is pushed.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH 1/6] fs: factor out a __generic_write_end helper

2018-06-06 Thread Andreas Grünbacher
2018-06-06 12:59 GMT+02:00 Christoph Hellwig :
> On Wed, Jun 06, 2018 at 12:40:28PM +0200, Christoph Hellwig wrote:
>> Bits of the buffer.c based write_end implementations that don't know
>> about buffer_heads and can be reused by other implementations.
>>
>> Signed-off-by: Christoph Hellwig 
>
> This actually already had two reviews:
>
> Reviewed-by: Brian Foster 
> Reviewed-by: Darrick J. Wong 
>
> That got lost through the detour via gfs2.

And one by me which has been dropped this time around, but I suppose
it doesn't matter.

Andreas



Re: [Cluster-devel] [PATCH v8 06/10] iomap: Add page_write_end iomap hook

2018-06-05 Thread Andreas Grünbacher
2018-06-05 14:29 GMT+02:00 David Sterba :
> On Tue, Jun 05, 2018 at 02:17:38PM +0200, Andreas Grünbacher wrote:
>> 2018-06-05 14:07 GMT+02:00 David Sterba :
>> > On Mon, Jun 04, 2018 at 09:31:19PM +0200, Andreas Gruenbacher wrote:
>> >> --- a/fs/iomap.c
>> >> +++ b/fs/iomap.c
>> >> @@ -181,16 +181,22 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
>> >> unsigned len, unsigned flags,
>> >>
>> >>  static int
>> >>  iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
>> >> - unsigned copied, struct page *page, struct iomap *iomap)
>> >> + unsigned copied, struct page *page, struct iomap *iomap,
>> >> + const struct iomap_ops *ops)
>> >>  {
>> >> + typeof(ops->page_write_end) page_write_end = ops->page_write_end;
>> >
>> > Is the reason to use typeof is to avoid repeating the type of
>> > page_write_end?
>>
>> Yes, the type is void (*)(struct inode *, loff_t, unsigned, struct
>> page *, struct iomap *), which is a bit bulky.
>
> Agreed, so why don't you simply use ops->page_write_end ?

Alright.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v8 03/10] iomap: Complete partial direct I/O writes synchronously

2018-06-05 Thread Andreas Grünbacher
2018-06-05 14:10 GMT+02:00 David Sterba :
> On Mon, Jun 04, 2018 at 09:31:16PM +0200, Andreas Gruenbacher wrote:
>> @@ -1062,8 +1063,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>   if (ret < 0)
>>   iomap_dio_set_error(dio, ret);
>>

+   /*
+* Make sure changes to dio are visible globally before the atomic
+* counter decrement.
+*/

>> + smp_mb__before_atomic();
>>   if (!atomic_dec_and_test(&dio->ref)) {
>
> The barrier should be documented. I tried to do a quick look around the
> code if it's clear why it's there but it's not. Thanks.
>
>> - if (!is_sync_kiocb(iocb))
>> + if (!dio->wait_for_completion)
>>   return -EIOCBQUEUED;
>>
>>   for (;;) {

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v8 06/10] iomap: Add page_write_end iomap hook

2018-06-05 Thread Andreas Grünbacher
2018-06-05 14:07 GMT+02:00 David Sterba :
> On Mon, Jun 04, 2018 at 09:31:19PM +0200, Andreas Gruenbacher wrote:
>> --- a/fs/iomap.c
>> +++ b/fs/iomap.c
>> @@ -181,16 +181,22 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
>> unsigned len, unsigned flags,
>>
>>  static int
>>  iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
>> - unsigned copied, struct page *page, struct iomap *iomap)
>> + unsigned copied, struct page *page, struct iomap *iomap,
>> + const struct iomap_ops *ops)
>>  {
>> + typeof(ops->page_write_end) page_write_end = ops->page_write_end;
>
> Is the reason to use typeof is to avoid repeating the type of
> page_write_end?

Yes, the type is void (*)(struct inode *, loff_t, unsigned, struct
page *, struct iomap *), which is a bit bulky.

> As it's only for a temporary variable with 2 uses,
> ops->page_write_end does not hurt readability nor is too much typing.
> I would not recommend using typeof outside of the justified contexts
> like macros or without a good reason.

Andreas



Re: [Cluster-devel] [PATCH v8 06/10] iomap: Add page_write_end iomap hook

2018-06-05 Thread Andreas Grünbacher
2018-06-04 21:31 GMT+02:00 Andreas Gruenbacher :
> Add a page_write_end hook called when done writing to a page, for
> filesystems that implement data journaling: in that case, pages are
> written to the journal before being written back to their proper on-disk
> locations.

> The new hook is bypassed for IOMAP_INLINE mappings.

Oops, no longer true of course.

> Signed-off-by: Andreas Gruenbacher 
> ---
>  fs/iomap.c| 58 ---
>  include/linux/iomap.h |  8 ++
>  2 files changed, 52 insertions(+), 14 deletions(-)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 48cd67227811..3b34c957d2fd 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -181,16 +181,22 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
>
>  static int
>  iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
> -   unsigned copied, struct page *page, struct iomap *iomap)
> +   unsigned copied, struct page *page, struct iomap *iomap,
> +   const struct iomap_ops *ops)
>  {
> +   typeof(ops->page_write_end) page_write_end = ops->page_write_end;
> int ret;
>
> if (iomap->type == IOMAP_INLINE) {
> iomap_write_inline_data(inode, page, iomap, pos, copied);
> __generic_write_end(inode, pos, copied, page);
> +   if (page_write_end)
> +   page_write_end(inode, pos, copied, page, iomap);
> return copied;
> }
>
> +   if (page_write_end)
> +   page_write_end(inode, pos, copied, page, iomap);
> ret = generic_write_end(NULL, inode->i_mapping, pos, len,
> copied, page, NULL);
> if (ret < len)
> @@ -198,11 +204,17 @@ iomap_write_end(struct inode *inode, loff_t pos, 
> unsigned len,
> return ret;
>  }
>
> +struct iomap_write_args {
> +   const struct iomap_ops *ops;
> +   struct iov_iter *iter;
> +};
> +
>  static loff_t
>  iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> struct iomap *iomap)
>  {
> -   struct iov_iter *i = data;
> +   struct iomap_write_args *args = data;
> +   struct iov_iter *i = args->iter;
> long status = 0;
> ssize_t written = 0;
> unsigned int flags = AOP_FLAG_NOFS;
> @@ -248,7 +260,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
> length, void *data,
> flush_dcache_page(page);
>
> status = iomap_write_end(inode, pos, bytes, copied, page,
> -   iomap);
> +   iomap, args->ops);
> if (unlikely(status < 0))
> break;
> copied = status;
> @@ -285,10 +297,14 @@ iomap_file_buffered_write(struct kiocb *iocb, struct 
> iov_iter *iter,
>  {
> struct inode *inode = iocb->ki_filp->f_mapping->host;
> loff_t pos = iocb->ki_pos, ret = 0, written = 0;
> +   struct iomap_write_args args = {
> +   .ops = ops,
> +   .iter = iter,
> +   };
>
> while (iov_iter_count(iter)) {
> ret = iomap_apply(inode, pos, iov_iter_count(iter),
> -   IOMAP_WRITE, ops, iter, iomap_write_actor);
> +   IOMAP_WRITE, ops, &args, iomap_write_actor);
> if (ret <= 0)
> break;
> pos += ret;
> @@ -319,6 +335,7 @@ static loff_t
>  iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> struct iomap *iomap)
>  {
> +   const struct iomap_ops *ops = data;
> long status = 0;
> ssize_t written = 0;
>
> @@ -342,7 +359,8 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t 
> length, void *data,
>
> WARN_ON_ONCE(!PageUptodate(page));
>
> -   status = iomap_write_end(inode, pos, bytes, bytes, page, 
> iomap);
> +   status = iomap_write_end(inode, pos, bytes, bytes, page, 
> iomap,
> +   ops);
> if (unlikely(status <= 0)) {
> if (WARN_ON_ONCE(status == 0))
> return -EIO;
> @@ -368,8 +386,8 @@ iomap_file_dirty(struct inode *inode, loff_t pos, loff_t 
> len,
> loff_t ret;
>
> while (len) {
> -   ret = iomap_apply(inode, pos, len, IOMAP_WRITE, ops, NULL,
> -   iomap_dirty_actor);
> +   ret = iomap_apply(inode, pos, len, IOMAP_WRITE, ops,
> +   (void *)ops, iomap_dirty_actor);
> if (ret <= 0)
> return ret;
> pos += ret;
> @@ -381,7 +399,8 @@ iomap_file_dirty(struct inode *inode, loff_t pos, loff_t 
> len,
>  EXPORT_SYMBOL_GPL(iomap_file_dirty);
>
>  static int iomap_zero(struct inode *inode, loff_t pos, unsigned of

Re: [Cluster-devel] [PATCH v6 4/9] iomap: Generic inline data handling

2018-06-04 Thread Andreas Grünbacher
2018-06-04 14:12 GMT+02:00 Christoph Hellwig :
> On Mon, Jun 04, 2018 at 02:02:10PM +0200, Andreas Grünbacher wrote:
>> For gfs2, passing a private pointer from iomap_begin to iomap_end (for
>> the buffer head holding the inode) would help as well, but it's not
>> critical.
>
> A private pointer is fine with me.

Okay, that'll be a patch for later though.

Andreas



Re: [Cluster-devel] [PATCH v7 12/12] iomap: Put struct iomap_ops into struct iomap

2018-06-04 Thread Andreas Grünbacher
2018-06-04 14:52 GMT+02:00 Christoph Hellwig :
> On Mon, Jun 04, 2018 at 02:37:29PM +0200, Andreas Gruenbacher wrote:
>> We need access to the struct iomap_ops in iomap_write_end to call the
>> (optional) page_write_end hook, so instead of passing the operators to
>> iomap_write_end differently depending on the code path, add an ops field
>> to struct iomap.
>
> I don't really like this.  We already pass the iomap_ops to the
> function we call from iomap.c, and in fact ->iomap_begin is called
> through the ops.

There's no duplicate ops passing going on if that's what you mean.

> But what we could do is to move the page_write_end callback out of
> iomap_ops and just attached it to the iomap, especially given that
> it isn't really as generic as the other ops.

Given that struct iomap is allocated inside iomap_apply(), if
page_write_end is moved to struct iomap, how would the filesystem set
it?

> Can't say I like the implications of this callback in general, but
> the use case is real, so..

Yes, it's not very pretty.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v7 07/12] iomap: Add page_write_end iomap hook

2018-06-04 Thread Andreas Grünbacher
2018-06-04 14:50 GMT+02:00 Christoph Hellwig :
> On Mon, Jun 04, 2018 at 02:37:24PM +0200, Andreas Gruenbacher wrote:
>> Add a page_write_end hook called when done writing to a page, for
>> filesystems that implement data journaling: in that case, pages are
>> written to the journal before being written back to their proper on-disk
>> locations.  The new hook is bypassed for IOMAP_INLINE mappings.
>
> I'd rather not bypass it in common code.  Can you pass the iomap
> to the callback and then do the bypass in gfs2 to keep it generic?

As you wish. You realize we're in extreme bike shed coloring land though, right?

Andreas



Re: [Cluster-devel] [PATCH v7 05/12] fs: allow to always dirty inode in __generic_write_end

2018-06-04 Thread Andreas Grünbacher
2018-06-04 14:48 GMT+02:00 Christoph Hellwig :
> On Mon, Jun 04, 2018 at 02:37:22PM +0200, Andreas Gruenbacher wrote:
>> @@ -2106,7 +2105,7 @@ int __generic_write_end(struct inode *inode, loff_t 
>> pos, unsigned copied,
>>* ordering of page lock and transaction start for journaling
>>* filesystems.
>>*/
>> - if (i_size_changed)
>> + if (dirty_inode)
>>   mark_inode_dirty(inode);
>>   return copied;
>
> Calling mark_inode_dirty on an already dirty inode is cheap, so how
> about just calling it directly in your caller that always wants to
> set the inode dirty?

The dirty_inode hook is where gfs2 converts the in-core inode into the
on-disk format (the equivalent of xfs_inode_to_disk), so it's not
quite that cheap. I realize this could be done differently in gfs2,
but that's not where we stand today, and I really can't fix ten things
all at the same time.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v6 5/9] iomap: Add write_end iomap operation

2018-06-04 Thread Andreas Grünbacher
2018-06-02 19:06 GMT+02:00 Christoph Hellwig :
> On Sat, Jun 02, 2018 at 11:57:13AM +0200, Andreas Gruenbacher wrote:
>> Add a write_end operation to struct iomap_ops to provide a way of
>> overriding the default behavior of iomap_write_end.  This will be used
>> for implementing data journaling in gfs2: in the data journaling case,
>> pages are written into the journal before being written back to their
>> proper on-disk locations.
>
> Please names this page_write_end and make it an optional callout
> just for the additional functionality, that is keep the call to
> iomap_write_end hardcoded in iomap.c, just call the new method before
> it.

I'll send an updated patch queue shortly.

>> +
>> +struct iomap_write_args {
>> + const struct iomap_ops *ops;
>> + struct iov_iter *iter;
>> +};
>
> Also I wonder if we should just pass the iomap_ops diretly to the actor
> callback.

It would make more sense to put the iomap_ops pointer into struct
iomap itself. I'll put a patch doing that at the end of my patch queue
so you can have a look.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v6 4/9] iomap: Generic inline data handling

2018-06-04 Thread Andreas Grünbacher
2018-06-02 19:04 GMT+02:00 Christoph Hellwig :
> On Sat, Jun 02, 2018 at 11:57:12AM +0200, Andreas Gruenbacher wrote:
>> Add generic inline data handling by adding a pointer to the inline data
>> region to struct iomap.  When handling a buffered IOMAP_INLINE write,
>> iomap_write_begin will copy the current inline data from the inline data
>> region into the page cache, and iomap_write_end will copy the changes in
>> the page cache back to the inline data region.
>
> This approach looks good.  A few comments below:
>
>>
>> This doesn't cover inline data reads and direct I/O yet because so far,
>> we have no users.
>
> I'm fine with that as a first step, but gfs2 should be able to do
> proper direct I/O to inline data and use by new iomap_readpage(s)
> easily at least for blocksize == PAGE_SIZE, so this should be added
> soon.

Yes, there are a few open ends in gfs's iomap support, but what we
have so far is already quite useful.

>> -int generic_write_end(struct file *file, struct address_space *mapping,
>> +void __generic_write_end(struct file *file, struct address_space *mapping,
>>   loff_t pos, unsigned len, unsigned copied,
>> - struct page *page, void *fsdata)
>> + struct page *page, void *fsdata, bool dirty_inode)
>
> This is going to clash with
>
> http://git.infradead.org/users/hch/xfs.git/commitdiff/2733909d6b40046ce9c7302c2e742c5e993a0108
>
> It should also be a separate prep patch with a good explanation

I'll cherry-pick that commit for now.

>>  iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
>> - unsigned copied, struct page *page)
>> + unsigned copied, struct page *page, struct iomap *iomap)
>
> Note that I have another patch adding this parameter.  I think we'll need
> a common iomap base tree for the gfs2 and xfs changes for the next
> merge window.  I'd be happy to one one up.

No objections there, I don't think we'll end up with major conflicts though.

>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 918f14075702..c61113c71a60 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -47,6 +47,7 @@ struct iomap {
>>   u64 length; /* length of mapping, bytes */
>>   u16 type;   /* type of mapping */
>>   u16 flags;  /* flags for mapping */
>> + void*inline_data;  /* inline data buffer */
>>   struct block_device *bdev;  /* block device for I/O */
>>   struct dax_device   *dax_dev; /* dax_dev for dax operations */
>>  };
>
> Eventually we need to find a way to union the inline_data, bdev and
> dax_dev fields, but that can be left to later.

For gfs2, passing a private pointer from iomap_begin to iomap_end (for
the buffer head holding the inode) would help as well, but it's not
critical.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v4 06/11] iomap: Add write_{begin, end} iomap operations

2018-05-25 Thread Andreas Grünbacher
2018-05-18 18:04 GMT+02:00 Christoph Hellwig :
> On Tue, May 15, 2018 at 10:16:52AM +0200, Andreas Gruenbacher wrote:
>> > Yes.  And from a quick look we can easily handle "stuffed" aka inline
>> > data with the existing architecture.  We just need a new IOMAP_INLINE
>> > flag for the extent type. Details will need to be worked out how to
>> > pass that data, either a struct page or virtual address should probably
>> > do it.
>>
>> This is about data journaling, it has nothing to do with "stuffed"
>> files per se. With journaled data writes, we need to do some special
>> processing for each page so that the page ends up in the journal
>> before it gets written back to its proper on-disk location. It just
>> happens that since we need these per-page operations anyway, the
>> "unstuffing" and "re-stuffing" nicely fits in those operations as
>> well.
>
> Your series makes two uses of the callbacks - one use case is to
> deal with inline data.

Yes, it's not perfect.

> The right way to deal with that is to make
> inline data an explicit iomap type (done in my next posting of the
> buffer head removal series), and then have iomap_begin find the data,
> kmap it and return it in the iomap, with iomap_end doing any fixups.

I see what you mean. How would iomap_write_actor deal with
IOMAP_INLINE -- just skip iomap_write_begin and iomap_write_end and
grab the page from the iomap? I'm a bit worried about reintroducing
the deadlock that iomap_write_actor so carefully tries to avoid.

> For the data journaing I agree that we need a post-write per page
> callout, but I think this should be just a simple optional callout
> just for data journaling, which doesn't override the remaining
> write_end handling.
>
>> Been there. An earlier version of the patches did have a separate
>> stuffed write path and the unstuffing and re-stuffing didn't happen in
>> those per-page operations. The result was much messier overall.
>
> Pointer, please.

It's in this patch:

  https://patchwork.kernel.org/patch/10191007/

from this posting:

https://www.spinics.net/lists/linux-fsdevel/msg121318.html

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v4 11/11] iomap: Complete partial direct I/O writes synchronously

2018-05-18 Thread Andreas Grünbacher
2018-05-18 17:56 GMT+02:00 Christoph Hellwig :
> On Wed, May 16, 2018 at 10:27:18PM +0200, Andreas Gruenbacher wrote:
>> I/O completion triggers when dio->ref reaches zero, so
>> iomap_dio_bio_end_io will never evaluate submit.waiter before the
>> atomic_dec_and_test in iomap_dio_rw. We probably need an
>> smp_mb__before_atomic() before that atomic_dec_and_test to make sure
>> that iomap_dio_bio_end_io sees the right value of submit.waiter
>> though. Any concerns beyond that?
>
> I suspect in the end union usage might still be fine, but it still
> looks rather dangerous to scramble over the union that late.  It will
> also do things like overriding the last_queue pointer which we need
> initialized early for block polling.
>
> I suspect something like the variant below will work much better.
> This moves the wait_for_completion flag into struct iomap_dio
> (in an existing packing hole) and uses that in the completion
> handler.  It also always initializes the submit fields to prepare
> for this early, and makes all fallbacks synchronous (if we ever
> get a read fallback that would be the right thing to do).

Yes, that should work.

> diff --git a/fs/iomap.c b/fs/iomap.c
> index e4617ae1f8d6..54b28a58b7e3 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1327,6 +1327,7 @@ struct iomap_dio {
> atomic_tref;
> unsignedflags;
> int error;
> +   boolwait_for_completion;
>
> union {
> /* used during submission and for synchronous completion: */
> @@ -1430,7 +1431,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
> iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
>
> if (atomic_dec_and_test(&dio->ref)) {
> -   if (is_sync_kiocb(dio->iocb)) {
> +   if (dio->wait_for_completion) {
> struct task_struct *waiter = dio->submit.waiter;
>
> WRITE_ONCE(dio->submit.waiter, NULL);
> @@ -1646,13 +1647,12 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter 
> *iter,
> dio->end_io = end_io;
> dio->error = 0;
> dio->flags = 0;
> +   dio->wait_for_completion = is_sync_kiocb(iocb);
>
> dio->submit.iter = iter;
> -   if (is_sync_kiocb(iocb)) {
> -   dio->submit.waiter = current;
> -   dio->submit.cookie = BLK_QC_T_NONE;
> -   dio->submit.last_queue = NULL;
> -   }
> +   dio->submit.waiter = current;
> +   dio->submit.cookie = BLK_QC_T_NONE;
> +   dio->submit.last_queue = NULL;
>
> if (iov_iter_rw(iter) == READ) {
> if (pos >= dio->i_size)
> @@ -1702,7 +1702,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> dio_warn_stale_pagecache(iocb->ki_filp);
> ret = 0;
>
> -   if (iov_iter_rw(iter) == WRITE && !is_sync_kiocb(iocb) &&
> +   if (iov_iter_rw(iter) == WRITE && !dio->wait_for_completion &&
> !inode->i_sb->s_dio_done_wq) {
> ret = sb_init_dio_done_wq(inode->i_sb);
> if (ret < 0)
> @@ -1717,8 +1717,10 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> iomap_dio_actor);
> if (ret <= 0) {
> /* magic error code to fall back to buffered I/O */
> -   if (ret == -ENOTBLK)
> +   if (ret == -ENOTBLK) {
> +   dio->wait_for_completion = true;
> ret = 0;
> +   }
> break;
> }
> pos += ret;
> @@ -1739,7 +1741,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> dio->flags &= ~IOMAP_DIO_NEED_SYNC;

We still want to make sure the right value of dio->wait_for_completion
is seen by iomap_dio_bio_end_io:

smp_mb__before_atomic();

> if (!atomic_dec_and_test(&dio->ref)) {
> -   if (!is_sync_kiocb(iocb))
> +   if (!dio->wait_for_completion)
> return -EIOCBQUEUED;
>
> for (;;) {

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v3 5/8] gfs2: Implement iomap buffered write support

2018-04-06 Thread Andreas Grünbacher
2018-04-06 16:44 GMT+02:00 Bob Peterson :
> Hi,
>
> - Original Message -
> (snip)
>> +static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos, loff_t
>> length,
>> +   unsigned flags, struct iomap *iomap)
> (snip)
>> +{
>> + struct metapath mp = { .mp_aheight = 1, };
>> + struct gfs2_inode *ip = GFS2_I(inode);
>> + struct gfs2_sbd *sdp = GFS2_SB(inode);
>> + unsigned int data_blocks = 0, ind_blocks = 0, rblocks;
>> + bool unstuff, alloc_required;
>> + int ret;
>> +
>> + ret = gfs2_write_lock(inode);
>> + if (ret)
>> + return ret;
>> +
>> + unstuff = gfs2_is_stuffed(ip) &&
>> +   pos + length > gfs2_max_stuffed_size(ip);
>> +
>> + ret = gfs2_iomap_get(inode, pos, length, flags, iomap, &mp);
>> + if (ret)
>> + goto out_release;
>> +
>> + alloc_required = unstuff || iomap->type != IOMAP_MAPPED;
>> +
>> + if (alloc_required || gfs2_is_jdata(ip))
>> + gfs2_write_calc_reserv(ip, iomap->length, &data_blocks, 
>> &ind_blocks);
>
> It would be ideal if the iomap_get could tell us how many blocks are mapped
> and how many are unmapped, if it wouldn't cause unnecessary IO.
> If we could use the number of unmapped blocks rather than iomap->length, it 
> could
> potentially save us from reserving unnecessary journal space, for example, on
> rewrites (iow where the metadata for the writes already exists).
> (snip)

Yes, I'll change gfs2_iomap_get to return the number of blocks
gfs2_iomap_alloc should be able to allocate. It's a simple variant of
hole_size; the maximum will be an entire indirect block worth of data.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-04-04 Thread Andreas Grünbacher
Herbert Xu  schrieb am Mi. 4. Apr. 2018 um
17:51:

> On Wed, Apr 04, 2018 at 11:46:28AM -0400, Bob Peterson wrote:
> >
> > The patches look good. The big question is whether to add them to this
> > merge window while it's still open. Opinions?
>
> We're still hashing out the rhashtable interface so I don't think now is
> the time to rush things.


Fair enough. No matter how rhashtable_walk_peek changes, we‘ll still need
these two patches to fix the glock dump though.

Thanks,
Andreas


Re: [Cluster-devel] [PATCH] gfs2: select CONFIG_LIBCRC32C

2018-02-02 Thread Andreas Grünbacher
Hi Arnd,

2018-02-02 16:43 GMT+01:00 Arnd Bergmann :
> The new crc32c logic in gfs2_log_header_v2 causes a link
> error without libcrc32c:
>
> ERROR: "crc32c" [fs/gfs2/gfs2.ko] undefined!
>
> While the original patch selected CONFIG_CRYPTO_CRC32C to deal
> with this issue, it turned out to be the wrong symbol.
>
> Fixes: c1696fb85d33 ("GFS2: Introduce new gfs2_log_header_v2")
> Signed-off-by: Arnd Bergmann 
> ---
>  fs/gfs2/Kconfig | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/gfs2/Kconfig b/fs/gfs2/Kconfig
> index c0225d4b5435..3ed2b088dcfd 100644
> --- a/fs/gfs2/Kconfig
> +++ b/fs/gfs2/Kconfig
> @@ -3,8 +3,7 @@ config GFS2_FS
> depends on (64BIT || LBDAF)
> select FS_POSIX_ACL
> select CRC32
> -   select CRYPTO
> -   select CRYPTO_CRC32C
> +   select LIBCRC32C
> select QUOTACTL
> select FS_IOMAP
> help
> --
> 2.9.0
>

thanks for the fix, it's identical to what Bob has just asked Linus to
merge (https://lkml.org/lkml/2018/2/2/361).

Andreas



Re: [Cluster-devel] [PATCH 09/10] gfs2: Implement buffered iomap write support (1)

2018-01-11 Thread Andreas Grünbacher
2018-01-11 22:15 GMT+01:00 Andreas Gruenbacher :
> With the traditional page-based writes, blocks are allocated separately
> for each page written to.  With iomap writes, we can allocate a lot more
> blocks at once, with a fraction of the allocation overhead for each
> page.

Oops, please ignore this patch.

Thanks,
Andreas



Re: [Cluster-devel] gfs2 on-disk headers in user space

2015-09-11 Thread Andreas Grünbacher
2015-09-11 13:28 GMT+02:00 Andrew Price :
> That makes sense to me. I assume we would remove the #include
>  from gfs2_ondisk.h and replace it with our own user space
> types header. I'm worried about clashes using the kernel type names though,
> so we might need to process the header somehow to change them.

We could substitute the __beX types with uintX_t on import; also we
could warn if the system header exists but in a different version than
what's "cached" in gfs2-utils.

Andreas



Re: [Cluster-devel] [PATCH 2/2] gfs2_edit: Include dirent.de_rahead in directory listings

2015-09-10 Thread Andreas Grünbacher
Thanks!

Andreas



Re: [Cluster-devel] [PATCH] gfs2: Incomplete "sbstat" statistics?

2015-08-26 Thread Andreas Grünbacher
Bob,

2015-08-26 19:45 GMT+02:00 Bob Peterson :
> NAK. I'm pretty sure this indexing was intentional. The "0th" entry is just
> to print the entry number.

ah, I see now that the gfs2_gltype array contains a special first
entry that makes things balance out again. My mistake.

Thanks,
Andreas



Re: [Cluster-devel] [RFC 00/11] Inode security label invalidation

2015-08-24 Thread Andreas Grünbacher
Stephen,

2015-08-24 19:42 GMT+02:00 Stephen Smalley :
>> BACKGROUND
>>
>> Selinux currently assumes that, after initialization, inode->i_security 
>> always
>> represents the current security label of the inode.  This assumption works 
>> for
>> local file systems; any change of the label must go through setxattr (or
>> removexattr) which updates inode->i_security.
>>
>> On an nfs mount, other nodes can change the security label; there is no
>> immediate notification mechanism.  Other nodes will eventually notice a label
>> change because the label is transmitted as part of the reply of operations 
>> like
>> open. (A timeout for cached labels would help too; I'm not sure if the code
>> implements that.)
>>
>> Other file systems have different consistency models. For example, gfs2 
>> inodes
>> go "invalid" when a node drops the inode's glocks. When such an invalid inode
>> is accessed again, all the metadata must be read from disk again, including 
>> the
>> security label.
>>
>> For that case, the file system has no way of updating the security label 
>> before
>> selinux next uses it.  Things also don't fix themselves over time; when 
>> selinux
>> rejects access, the file system never notices.
>
> The current NFSv4 model is to call security_inode_notifysecctx() to
> notify the security module of the new label.  Does that not work for
> gfs2 or others?

Unfortunately not. In fact, NFS also cannot work fully correctly with
this infrastructure alone because it cannot invalidate cached security
labels once they expire.

> It is up to the filesystem client side code to actually
> detect the change and fetch the new value, then push it to the security
> module via the security_inode_notifysecctx() hook.

That is unfortunately not how non-local file systems work. Clients
don't normally get notified of changes performed by other nodes.
Instead, they do one of the following:

 * On NFS, inode attributes are cached opportunistically for a
   duration determined by the client. When they are needed
   and it turns out that the cached values have expired, they
   are re-fetched from the server. (See cache_validity and
   revalidate_inode in fs/nfs/; RFC 7530, Section 1.5: Client
   Caching and Delegation and Section 10: Client-Side
   Caching.)

   This is not true for security labels though; security labels currently
   won't be expired.

 * GFS2 and some other file systems grab locks on inodes before
   caching data or metadata. When the locks on an inode are
   released, the cached data and metadata become invalid
   and must no longer be used.

File system clients can't actively poll their servers about metadata
changes; that would be way too costly. We really need a way to
invalidate security labels, and selinux needs to reload invalid
security labels when they are needed.

Thanks,
Andreas