Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

2019-11-27 Thread Linus Torvalds
On Wed, Nov 27, 2019 at 7:42 AM Steven Whitehouse  wrote:
>
> I'll leave the finer details to Andreas here, since it is his patch, and
> hopefully we can figure out a good path forward.

As mentioned, I don't _hate_ that patch (ok, I seem to have typoed it
and said that I don't "gate" it ;), so if that's what you guys really
want to do, I'm ok with it. But..

I do think you already get the data with the current case, from the
"short read" thing. So just changing the current generic read function
to check against the size first:

  --- a/mm/filemap.c
  +++ b/mm/filemap.c
  @@ -2021,9 +2021,9 @@ static ssize_t
generic_file_buffered_read(struct kiocb *iocb,
unsigned int prev_offset;
int error = 0;

  - if (unlikely(*ppos >= inode->i_sb->s_maxbytes))
  + if (unlikely(*ppos >= inode->i_size))
return 0;
  - iov_iter_truncate(iter, inode->i_sb->s_maxbytes);
  + iov_iter_truncate(iter, inode->i_size);

index = *ppos >> PAGE_SHIFT;
prev_index = ra->prev_pos >> PAGE_SHIFT;

and you're done. Nice and clean.

Then in gfs2 you just notice the short read, and check at that point.
Sure, you'll also cut read-ahead to the old size boundary, but does
anybody _seriously_ believe that read-ahead matters when you hit the
"some other node write more data, we're reading past the old end"
case? I don't think that's the case.

But I _can_ live with the patch that adds the extra "cached only" bit.
It just honestly feels pointless.

   Linus




Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

2019-11-27 Thread Andreas Gruenbacher
On Wed, Nov 27, 2019 at 4:42 PM Steven Whitehouse  wrote:
> Hi,
>
> On 25/11/2019 17:05, Linus Torvalds wrote:
> > On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse  
> > wrote:
> >> Linus, is that roughly what you were thinking of?
> > So the concept looks ok, but I don't really like the new flags as they
> > seem to be gfs2-specific rather than generic.
> >
> > That said, I don't _gate_ them either, since they aren't in any
> > critical code sequence, and it's not like they do anything really odd.
> >
> > I still think the whole gfs2 approach is broken. You're magically ok
> > with using stale data from the cache just because it's cached, even if
> > another client might have truncated the file or something.
>
> If another node tries to truncate the file, that will require an
> exclusive glock, and in turn that means the all the other nodes will
> have to drop their glock(s) shared or exclusive. That process
> invalidates the page cache on those nodes, such that any further
> requests on those nodes will find the cache empty and have to call into
> the filesystem.
>
> If a page is truncated on another node, then when the local node gives
> up its glock, after any copying (e.g. for read) has completed then the
> truncate will take place. The local node will then have to reread any
> data relating to new pages or return an error in case the next page to
> be read has vanished due to the truncate. It is a pretty small window,
> and the advantage is that in cases where the page is in cache, we can
> directly use the cached page without having to call into the filesystem
> at all. So it is page atomic in that sense.
>
> The overall aim here is to avoid taking (potentially slow) cluster locks
> when at all possible, yet at the same time deliver close to local fs
> semantics whenever we can. You can think of GFS2's glock concept (at
> least as far as the inodes we are discussing here) as providing a
> combination of (page) cache control and cluster (dlm) locking.
>
> >
> > So you're ok with saying "the file used to be X bytes in size, so
> > we'll just give you this data because we trust that the X is correct".
> >
> > But you're not ok to say "oh, the file used to be X bytes in size, but
> > we don't want to give you a short read because it might not be correct
> > any more".
> >
> > See the disconnect? You trust the size in one situation, but not in another 
> > one.
>
> Well we are not trusting the size at all... the original algorithm
> worked entirely off "is this page in cache and uptodate?" and for
> exactly the reason that we know the size in the inode might be out of
> date, if we are not currently holding a glock in either shared or
> exclusive mode. We also know that if there is a page in cache and
> uptodate then we must be holding the glock too.
>
>
> >
> > I also don't really see that you *need* the new flag at all. Since
> > you're doing to do a speculative read and then a real read anyway, and
> > since the only thing that you seem to care about is the file size
> > (because the *data* you will trust if it is cached), then why don't
> > you just use the *existing* generic read, and *IFF* you get a
> > truncated return value, then you go and double-check that the file
> > hasn't changed in size?
> >
> > See what I'm saying? I think gfs2 is being very inconsistent in when
> > it trusts the file size, and I don't see that you even need the new
> > behavior that patch gives, because you might as well just use the
> > existing code (just move the i_size check earlier, and then teach gfs2
> > to double-check the "I didn't get as much as I expected" case).

We can identify short reads, but we won't get information about
readahead back from generic_file_read_iter or filemap_fault. We could
try to work around this with filesystem specific flags for ->readpage
and ->readpages, but that would break down with multiple concurrent
readers in addition to being a real mess. I'm currently out of better
ideas that avoid duplicating the generic code.

> >   Linus
>
> I'll leave the finer details to Andreas here, since it is his patch, and
> hopefully we can figure out a good path forward. We are perhaps also a
> bit reluctant to go off and (nearly) duplicate code that is already in
> the core vfs library functions, since that often leads to things getting
> out of sync (our implementation of ->writepages is one case where that
> happened in the past) and missing important bug fixes/features in some
> cases. Hopefully though we can iterate on this a bit and come up with
> something which will resolve all the issues,
>
> Steve.

Thanks,
Andreas




Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

2019-11-27 Thread Steven Whitehouse

Hi,

On 25/11/2019 17:05, Linus Torvalds wrote:

On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse  wrote:

Linus, is that roughly what you were thinking of?

So the concept looks ok, but I don't really like the new flags as they
seem to be gfs2-specific rather than generic.

That said, I don't _gate_ them either, since they aren't in any
critical code sequence, and it's not like they do anything really odd.

I still think the whole gfs2 approach is broken. You're magically ok
with using stale data from the cache just because it's cached, even if
another client might have truncated the file or something.


If another node tries to truncate the file, that will require an 
exclusive glock, and in turn that means the all the other nodes will 
have to drop their glock(s) shared or exclusive. That process 
invalidates the page cache on those nodes, such that any further 
requests on those nodes will find the cache empty and have to call into 
the filesystem.


If a page is truncated on another node, then when the local node gives 
up its glock, after any copying (e.g. for read) has completed then the 
truncate will take place. The local node will then have to reread any 
data relating to new pages or return an error in case the next page to 
be read has vanished due to the truncate. It is a pretty small window, 
and the advantage is that in cases where the page is in cache, we can 
directly use the cached page without having to call into the filesystem 
at all. So it is page atomic in that sense.


The overall aim here is to avoid taking (potentially slow) cluster locks 
when at all possible, yet at the same time deliver close to local fs 
semantics whenever we can. You can think of GFS2's glock concept (at 
least as far as the inodes we are discussing here) as providing a 
combination of (page) cache control and cluster (dlm) locking.




So you're ok with saying "the file used to be X bytes in size, so
we'll just give you this data because we trust that the X is correct".

But you're not ok to say "oh, the file used to be X bytes in size, but
we don't want to give you a short read because it might not be correct
any more".

See the disconnect? You trust the size in one situation, but not in another one.


Well we are not trusting the size at all... the original algorithm 
worked entirely off "is this page in cache and uptodate?" and for 
exactly the reason that we know the size in the inode might be out of 
date, if we are not currently holding a glock in either shared or 
exclusive mode. We also know that if there is a page in cache and 
uptodate then we must be holding the glock too.





I also don't really see that you *need* the new flag at all. Since
you're doing to do a speculative read and then a real read anyway, and
since the only thing that you seem to care about is the file size
(because the *data* you will trust if it is cached), then why don't
you just use the *existing* generic read, and *IFF* you get a
truncated return value, then you go and double-check that the file
hasn't changed in size?

See what I'm saying? I think gfs2 is being very inconsistent in when
it trusts the file size, and I don't see that you even need the new
behavior that patch gives, because you might as well just use the
existing code (just move the i_size check earlier, and then teach gfs2
to double-check the "I didn't get as much as I expected" case).

  Linus


I'll leave the finer details to Andreas here, since it is his patch, and 
hopefully we can figure out a good path forward. We are perhaps also a 
bit reluctant to go off and (nearly) duplicate code that is already in 
the core vfs library functions, since that often leads to things getting 
out of sync (our implementation of ->writepages is one case where that 
happened in the past) and missing important bug fixes/features in some 
cases. Hopefully though we can iterate on this a bit and come up with 
something which will resolve all the issues,


Steve.