Re: Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache

2021-03-09 Thread David Howells
Dave Chinner  wrote:

> > > There was a discussion about fsyncing a range of files on LSFMM [1].
> > > In the last comment on the article dchinner argues why we already have 
> > > that
> > > API (and now also with io_uring(), but AFAIK, we do not have a useful
> > > wait_for_sync() API. And it doesn't need to be exposed to userspace at 
> > > all.
> > > 
> > > [1] https://lwn.net/Articles/789024/
> > 
> > This sounds like an interesting idea.  Actually, what I probably want is a
> > notification to say that a particular object has been completely sync'd to
> > disk, metadata and all.
> 
> This isn't hard to do yourself in the kernel. All it takes is a
> workqueue to run vfs_fsync() calls asynchronously and for the work
> to queue a local notification/wakeup when the fsync completes...
> 
> That's all aio_fsync() does - the notification it queues on
> completion is the AIO completion event for userspace - so I think
> you could do this in about 50 lines of code if you really needed
> it...

I was thinking more in terms of passively finding out when metadata has been
flushed to disk rather than actively forcing it.  Obviously I can manually
flush from a worker thread, but that ties up a thread per file I want to
flush (unless I want to do a higher-level sync).

Btw, looking at aio_fsync(), is there any reason it copies the current creds
rather than just taking a ref on them?  (Granted, this may not be a question
for you)

> > However, there are some performance problems are arising in my fscache-iter
> > branch:
> > 
> >  (1) It's doing a lot of synchronous metadata operations (tmpfile, truncate,
> >  setxattr).
> 
> Async pipelines using unbound workqueues are your friend.

Maybe.  I could just throw everything into a workqueue and let the workqueue
deal with it.  There still have to be synchronisation points, though - I can't
schedule a cache-write from a server-read to the cache following a 3rd-party
induced invalidation until after the invalidation has happened - and that
holds up userspace from writing to the cache.  But maybe it will work.

Btw, how expensive is it to throw an operation off to a workqueue versus doing
it in thread?  Particularly if it's a synchronous op that the thread is going
to have to wait for (e.g. write_begin()).

> >  (2) It's retaining a lot of open file structs on cache files.  Cachefiles
> >  opens the file when it's first asked to access it and retains that till
> >  the cookie is relinquished or the cache withdrawn (the file* doesn't
> >  contribute to ENFILE/EMFILE but it still eats memory).
> 
> Sounds similar to the problem that the NFSd open file cache solves.
> (fs/nfsd/filecache.c)

Looks similiarish to what I was thinking of with having a queue of
currently-not-in-use cookies to go through and commit and close.

> >  but if the file is truncated
> >  larger, that excess data now becomes part of the file.
> 
> Keep the actual file size in your tracking xattr.

I do that, but it doesn't help entirely.  If someone truncates the file larger
and then writes non-contiguously, the problem occurs.

I've tried truncating the file down and then truncating it up, but that
requires two synchronous ops - though the latter is relatively cheap.  I've
also tried fallocate() to clear the block.  What I've found is that the next
DIO write then has to sync because these may read data into the pagecache of
the backing file.

Apart from clearing the tail of a page on writing, it might be better for me
to read the data into a spare page, clear the tail and write it back.

> >  Possibly it's sufficient to just clear the excess page space before
> >  writing, but that doesn't necessarily stop a writable mmap from
> >  scribbling on it.
> 
> We can't stop mmap from scribbling in it. All filesystems have this
> problem, so to prevent data leaks we have to zero the post-eof tail
> region on every write of the EOF block, anyway.

I meant an mmap scribbling on it after it's been cleared - but I guess taking
away the PTE-writeable flag and making page_mkwrite() wait should solve that.

> >  (4) Committing outstanding cache metadata at cache withdrawal or netfs
> >  unmount.  I've previously mentioned this: it ends up with a whole
> >  slew of synchronous metadata changes being committed to the cache in
> >  one go (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile)
> >  - and this can take quite a long time.  The cache needs to be more
> >  proactive in getting stuff committed as it goes along.
> 
> Workqueues give you an easy mechanism for async dispatch and
> concurrency for synchronous operations. This is a largely solved
> problem...

Yes and no.  Yes, I can fan out the number of threads doing the committing,
but there's still a limit on the I/O bandwidth - and a lot of the operations
still have to hit the disk in the right order.  It still stuffs up the user
experience if the cache eats up the entirety of the disk I/O for a 

Re: fscache: Redesigning the on-disk cache

2021-03-09 Thread David Howells
Dave Chinner  wrote:

> > > With ->fiemap() you can at least make the distinction between a non
> > > existing and an UNWRITTEN extent.
> > 
> > I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
> > assertion is that the cache can't rely on the backing filesystem's metadata
> > because these can arbitrarily insert or remove blocks of zeros to bridge or
> > split extents.
> 
> Well, that's not the big problem. The issue that makes FIEMAP
> unusable for determining if there is user data present in a file is
> that on-disk extent maps aren't exactly coherent with in-memory user
> data state.
> 
> That is, we can have a hole on disk with delalloc user data in
> memory.  There's user data in the file, just not on disk. Same goes
> for unwritten extents - there can be dirty data in memory over an
> unwritten extent, and it won't get converted to written until the
> data is written back and the filesystem runs a conversion
> transaction.
> 
> So, yeah, if you use FIEMAP to determine where data lies in a file
> that is being actively modified, you're going get corrupt data
> sooner rather than later.  SEEK_HOLE/DATA are coherent with in
> memory user data, so don't have this problem.

I thought you and/or Christoph said it *was* a problem to use the backing
filesystem's metadata to track presence of data in the cache because the
filesystem (or its tools) can arbitrarily insert blocks of zeros to
bridge/break up extents.

If that is the case, then that is a big problem, and SEEK_HOLE/DATA won't
suffice.

If it's not a problem - maybe if I can set a mark on a file to tell the
filesystem and tools not to do that - then that would obviate the need for me
to store my own maps.

David



Re: Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache

2021-03-08 Thread Matthew Wilcox
On Tue, Mar 09, 2021 at 09:32:47AM +1100, Dave Chinner wrote:
> On Mon, Mar 08, 2021 at 11:28:41AM +, David Howells wrote:
> >  Possibly it's sufficient to just clear the excess page space before
> >  writing, but that doesn't necessarily stop a writable mmap from
> >  scribbling on it.
> 
> We can't stop mmap from scribbling in it. All filesystems have this
> problem, so to prevent data leaks we have to zero the post-eof tail
> region on every write of the EOF block, anyway.

That's certainly one approach.  Another would be to zero it during the I/O
completion handler.  It depends whether you can trust the last writer or
not (eg what do we do with an isofs file that happens to contain garbage
after the last byte in the file?)


Re: Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache

2021-03-08 Thread Dave Chinner
On Mon, Mar 08, 2021 at 11:28:41AM +, David Howells wrote:
> Amir Goldstein  wrote:
> 
> > > But after I've written and sync'd the data, I set the xattr to mark the
> > > file not open.  At the moment I'm doing this too lazily, only doing it
> > > when a netfs file gets evicted or when the cache gets withdrawn, but I
> > > really need to add a queue of objects to be sealed as they're closed.  The
> > > balance is working out how often to do the sealing as something like a
> > > shell script can do a lot of consecutive open/write/close ops.
> > 
> > You could add an internal vfs API wait_for_multiple_inodes_to_be_synced().
> > For example, xfs keeps the "LSN" on each inode, so once the transaction
> > with some LSN has been committed, all the relevant inodes, if not dirty, can
> > be declared as synced, without having to call fsync() on any file and 
> > without
> > having to force transaction commit or any IO at all.
> > 
> > Since fscache takes care of submitting the IO, and it shouldn't care about 
> > any
> > specific time that the data/metadata hits the disk(?), you can make use of 
> > the
> > existing periodic writeback and rolling transaction commit and only ever 
> > need
> > to wait for that to happen before marking cache files "closed".
> > 
> > There was a discussion about fsyncing a range of files on LSFMM [1].
> > In the last comment on the article dchinner argues why we already have that
> > API (and now also with io_uring(), but AFAIK, we do not have a useful
> > wait_for_sync() API. And it doesn't need to be exposed to userspace at all.
> > 
> > [1] https://lwn.net/Articles/789024/
> 
> This sounds like an interesting idea.  Actually, what I probably want is a
> notification to say that a particular object has been completely sync'd to
> disk, metadata and all.

This isn't hard to do yourself in the kernel. All it takes is a
workqueue to run vfs_fsync() calls asynchronously and for the work
to queue a local notification/wakeup when the fsync completes...

That's all aio_fsync() does - the notification it queues on
completion is the AIO completion event for userspace - so I think
you could do this in about 50 lines of code if you really needed
it...

> However, there are some performance problems are arising in my fscache-iter
> branch:
> 
>  (1) It's doing a lot of synchronous metadata operations (tmpfile, truncate,
>  setxattr).

Async pipelines using unbound workqueues are your friend.
> 
>  (2) It's retaining a lot of open file structs on cache files.  Cachefiles
>  opens the file when it's first asked to access it and retains that till
>  the cookie is relinquished or the cache withdrawn (the file* doesn't
>  contribute to ENFILE/EMFILE but it still eats memory).

Sounds similar to the problem that the NFSd open file cache solves.
(fs/nfsd/filecache.c)

>  (3) Trimming excess data from the end of the cache file.  The problem with
>  using DIO to write to the cache is that the write has to be rounded up to
>  a multiple of the backing fs DIO blocksize,

Actually, a multiple of the logical sector size of the backing
device behind the backing filesystem.

>  but if the file is truncated
>  larger, that excess data now becomes part of the file.

Keep the actual file size in your tracking xattr.

>  Possibly it's sufficient to just clear the excess page space before
>  writing, but that doesn't necessarily stop a writable mmap from
>  scribbling on it.

We can't stop mmap from scribbling in it. All filesystems have this
problem, so to prevent data leaks we have to zero the post-eof tail
region on every write of the EOF block, anyway.

>  (4) Committing outstanding cache metadata at cache withdrawal or netfs
>  unmount.  I've previously mentioned this: it ends up with a whole slew of
>  synchronous metadata changes being committed to the cache in one go
>  (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile) - and this
>  can take quite a long time.  The cache needs to be more proactive in
>  getting stuff committed as it goes along.

Workqueues give you an easy mechanism for async dispatch and
concurrency for synchronous operations. This is a largely solved
problem...

>  (5) Attaching to an object requires a pathwalk to it (normally only two
>  steps) and then reading various xattrs on it - all synchronous, but can
>  be punted to a background threadpool.

a.k.a. punting to a workqueue :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: fscache: Redesigning the on-disk cache

2021-03-08 Thread Dave Chinner
On Mon, Mar 08, 2021 at 09:13:55AM +, David Howells wrote:
> Amir Goldstein  wrote:
> 
> > >  (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
> > >   file for every whole operation (which may combine reads and writes).
> > 
> > I read that NFSv4 supports hole punching, so when using ->bmap() or 
> > SEEK_DATA
> > to keep track of present data, it's hard to distinguish between an
> > invalid cached range and a valid "cached hole".
> 
> I wasn't exactly intending to permit caching over NFS.  That leads to fun
> making sure that the superblock you're caching isn't the one that has the
> cache in it.
> 
> However, we will need to handle hole-punching being done on a cached netfs,
> even if that's just to completely invalidate the cache for that file.
> 
> > With ->fiemap() you can at least make the distinction between a non existing
> > and an UNWRITTEN extent.
> 
> I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
> assertion is that the cache can't rely on the backing filesystem's metadata
> because these can arbitrarily insert or remove blocks of zeros to bridge or
> split extents.

Well, that's not the big problem. The issue that makes FIEMAP
unusable for determining if there is user data present in a file is
that on-disk extent maps aren't exactly coherent with in-memory user
data state.

That is, we can have a hole on disk with delalloc user data in
memory.  There's user data in the file, just not on disk. Same goes
for unwritten extents - there can be dirty data in memory over an
unwritten extent, and it won't get converted to written until the
data is written back and the filesystem runs a conversion
transaction.

So, yeah, if you use FIEMAP to determine where data lies in a file
that is being actively modified, you're going get corrupt data
sooner rather than later.  SEEK_HOLE/DATA are coherent with in
memory user data, so don't have this problem.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: fscache: Redesigning the on-disk cache

2021-03-08 Thread David Howells
J. Bruce Fields  wrote:

> On Mon, Mar 08, 2021 at 09:13:55AM +, David Howells wrote:
> > Amir Goldstein  wrote:
> > > With ->fiemap() you can at least make the distinction between a non 
> > > existing
> > > and an UNWRITTEN extent.
> > 
> > I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
> > assertion is that the cache can't rely on the backing filesystem's metadata
> > because these can arbitrarily insert or remove blocks of zeros to bridge or
> > split extents.
> 
> Could you instead make some sort of explicit contract with the
> filesystem?  Maybe you'd flag it at mkfs time and query for it before
> allowing a filesystem to be used for fscache.  You don't need every
> filesystem to support fscache, right, just one acceptable one?

I've asked about that, but the filesystem maintainers are reluctant to do
that.

Something might be possible in ext4 using direct access to jbd2, though I
don't know exactly what facilities that offers.

David



Re: fscache: Redesigning the on-disk cache

2021-03-08 Thread J. Bruce Fields
On Mon, Mar 08, 2021 at 09:13:55AM +, David Howells wrote:
> Amir Goldstein  wrote:
> > With ->fiemap() you can at least make the distinction between a non existing
> > and an UNWRITTEN extent.
> 
> I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
> assertion is that the cache can't rely on the backing filesystem's metadata
> because these can arbitrarily insert or remove blocks of zeros to bridge or
> split extents.

Could you instead make some sort of explicit contract with the
filesystem?  Maybe you'd flag it at mkfs time and query for it before
allowing a filesystem to be used for fscache.  You don't need every
filesystem to support fscache, right, just one acceptable one?

--b.


Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache

2021-03-08 Thread David Howells
Amir Goldstein  wrote:

> > But after I've written and sync'd the data, I set the xattr to mark the
> > file not open.  At the moment I'm doing this too lazily, only doing it
> > when a netfs file gets evicted or when the cache gets withdrawn, but I
> > really need to add a queue of objects to be sealed as they're closed.  The
> > balance is working out how often to do the sealing as something like a
> > shell script can do a lot of consecutive open/write/close ops.
> 
> You could add an internal vfs API wait_for_multiple_inodes_to_be_synced().
> For example, xfs keeps the "LSN" on each inode, so once the transaction
> with some LSN has been committed, all the relevant inodes, if not dirty, can
> be declared as synced, without having to call fsync() on any file and without
> having to force transaction commit or any IO at all.
> 
> Since fscache takes care of submitting the IO, and it shouldn't care about any
> specific time that the data/metadata hits the disk(?), you can make use of the
> existing periodic writeback and rolling transaction commit and only ever need
> to wait for that to happen before marking cache files "closed".
> 
> There was a discussion about fsyncing a range of files on LSFMM [1].
> In the last comment on the article dchinner argues why we already have that
> API (and now also with io_uring(), but AFAIK, we do not have a useful
> wait_for_sync() API. And it doesn't need to be exposed to userspace at all.
> 
> [1] https://lwn.net/Articles/789024/

This sounds like an interesting idea.  Actually, what I probably want is a
notification to say that a particular object has been completely sync'd to
disk, metadata and all.

I'm not sure that io_uring is particularly usable from within the kernel,
though.

> If I were you, I would try to avoid re-implementing a journaled filesystem or
> a database for fscache and try to make use of crash consistency guarantees
> that filesystems already provide.
> Namely, use the data dependency already provided by temp files.
> It doesn't need to be one temp file per cached file.
> 
> Always easier said than done ;-)

Yes.

There are a number of considerations I have to deal with, and they're somewhat
at odds with each other:

 (1) I need to record what data I have stored from a file.

 (2) I need to record where I stored the data.

 (3) I need to make sure that I don't see old data.

 (4) I need to make sure that I don't see data in the wrong file.

 (5) I need to make sure I lose as little as possible on a crash.

 (6) I want to be able to record what changes were made in the event we're
 disconnected from the server.

For my fscache-iter branch, (1) is done with a map in an xattr, but I only
cache up to 1G in a file at the moment; (2), (4) and, to some extent (5), are
handled by the backing fs; (3) is handled by tagging the file and storing
coherency data in in an xattr (though tmpfiles are used on full invalidation).
(6) is not yet supported.

For upstream, (1), (2), (4) and to some extent (5) are handled through the
backing fs.  (3) is handled by storing coherency data in an xattr and
truncating the file on invalidation; (6) is not yet supported.

However, there are some performance problems are arising in my fscache-iter
branch:

 (1) It's doing a lot of synchronous metadata operations (tmpfile, truncate,
 setxattr).

 (2) It's retaining a lot of open file structs on cache files.  Cachefiles
 opens the file when it's first asked to access it and retains that till
 the cookie is relinquished or the cache withdrawn (the file* doesn't
 contribute to ENFILE/EMFILE but it still eats memory).

 I can mitigate this by closing much sooner, perhaps opening the file for
 each operation - but at the cost of having to spend time doing more opens
 and closes.  What's in upstream gets away without having to do open/close
 for reads because it calls readpage.

 Alternatively, I can have a background file closer - which requires an
 LRU queue.  This could be combined with a file "sealer".

 Deferred writeback on the netfs starting writes to the cache makes this
 more interesting as I have to retain the interest on the cache object
 beyond the netfs file being closed.

 (3) Trimming excess data from the end of the cache file.  The problem with
 using DIO to write to the cache is that the write has to be rounded up to
 a multiple of the backing fs DIO blocksize, but if the file is truncated
 larger, that excess data now becomes part of the file.

 Possibly it's sufficient to just clear the excess page space before
 writing, but that doesn't necessarily stop a writable mmap from
 scribbling on it.

 (4) Committing outstanding cache metadata at cache withdrawal or netfs
 unmount.  I've previously mentioned this: it ends up with a whole slew of
 synchronous metadata changes being committed to the cache in one go
 (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile) - 

Re: fscache: Redesigning the on-disk cache

2021-03-08 Thread Amir Goldstein
On Mon, Mar 8, 2021 at 11:14 AM David Howells  wrote:
>
> Amir Goldstein  wrote:
>
> > >  (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
> > >   file for every whole operation (which may combine reads and writes).
> >
> > I read that NFSv4 supports hole punching, so when using ->bmap() or 
> > SEEK_DATA
> > to keep track of present data, it's hard to distinguish between an
> > invalid cached range and a valid "cached hole".
>
> I wasn't exactly intending to permit caching over NFS.  That leads to fun
> making sure that the superblock you're caching isn't the one that has the
> cache in it.
>
> However, we will need to handle hole-punching being done on a cached netfs,
> even if that's just to completely invalidate the cache for that file.
>
> > With ->fiemap() you can at least make the distinction between a non existing
> > and an UNWRITTEN extent.
>
> I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
> assertion is that the cache can't rely on the backing filesystem's metadata
> because these can arbitrarily insert or remove blocks of zeros to bridge or
> split extents.
>
> > You didn't say much about crash consistency or durability requirements of 
> > the
> > cache. Since cachefiles only syncs the cache on shutdown, I guess you
> > rely on the hosting filesystem to provide the required ordering guarantees.
>
> There's an xattr on each file in the cache to record the state.  I use this
> mark a cache file "open".  If, when I look up a file, the file is marked open,
> it is just discarded at the moment.
>
> Now, there are two types of data stored in the cache: data that has to be
> stored as a single complete blob and is replaced as such (e.g. symlinks and
> AFS dirs) and data that might be randomly modified (e.g. regular files).
>
> For the former, I have code, though in yet another branch, that writes this in
> a tmpfile, sets the xattrs and then uses vfs_link(LINK_REPLACE) to cut over.
>
> For the latter, that's harder to do as it would require copying the data to
> the tmpfile before we're allowed to modify it.  However, if it's possible to
> create a tmpfile that's a CoW version of a data file, I could go down that
> route.
>
> But after I've written and sync'd the data, I set the xattr to mark the file
> not open.  At the moment I'm doing this too lazily, only doing it when a netfs
> file gets evicted or when the cache gets withdrawn, but I really need to add a
> queue of objects to be sealed as they're closed.  The balance is working out
> how often to do the sealing as something like a shell script can do a lot of
> consecutive open/write/close ops.
>

You could add an internal vfs API wait_for_multiple_inodes_to_be_synced().
For example, xfs keeps the "LSN" on each inode, so once the transaction
with some LSN has been committed, all the relevant inodes, if not dirty, can
be declared as synced, without having to call fsync() on any file and without
having to force transaction commit or any IO at all.

Since fscache takes care of submitting the IO, and it shouldn't care about any
specific time that the data/metadata hits the disk(?), you can make use of the
existing periodic writeback and rolling transaction commit and only ever need
to wait for that to happen before marking cache files "closed".

There was a discussion about fsyncing a range of files on LSFMM [1].
In the last comment on the article dchinner argues why we already have that
API (and now also with io_uring(), but AFAIK, we do not have a useful
wait_for_sync() API. And it doesn't need to be exposed to userspace at all.

[1] https://lwn.net/Articles/789024/

> > Anyway, how are those ordering requirements going to be handled when entire
> > indexing is in a file? You'd practically need to re-implement a filesystem
>
> Yes, the though has occurred to me too.  I would be implementing a "simple"
> filesystem - and we have lots of those:-/.  The most obvious solution is to
> use the backing filesystem's metadata - except that that's not possible.
>
> > journal or only write cache updates to a temp file that can be discarded at
> > any time?
>
> It might involve keeping a bitmap of "open" blocks.  Those blocks get
> invalidated when the cache restarts.  The simplest solution would be to wipe
> the entire cache in such a situation, but that goes against one of the
> important features I want out of it.
>
> Actually, a journal of open and closed blocks might be better, though all I
> really need to store for each block is a 32-bit number.
>
> It's a particular problem if I'm doing DIO to the data storage area but
> buffering the changes to the metadata.  Further, the metadata and data might
> be on different media, just to add to the complexity.
>
> Another possibility is only to cull blocks when the parent file is culled.
> That probably makes more sense as, as long as the file is registered culled on
> disk first and I don't reuse the file slot too quickly, I can write to the
> data 

Re: fscache: Redesigning the on-disk cache

2021-03-08 Thread David Howells
Amir Goldstein  wrote:

> >  (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
> >   file for every whole operation (which may combine reads and writes).
> 
> I read that NFSv4 supports hole punching, so when using ->bmap() or SEEK_DATA
> to keep track of present data, it's hard to distinguish between an
> invalid cached range and a valid "cached hole".

I wasn't exactly intending to permit caching over NFS.  That leads to fun
making sure that the superblock you're caching isn't the one that has the
cache in it.

However, we will need to handle hole-punching being done on a cached netfs,
even if that's just to completely invalidate the cache for that file.

> With ->fiemap() you can at least make the distinction between a non existing
> and an UNWRITTEN extent.

I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
assertion is that the cache can't rely on the backing filesystem's metadata
because these can arbitrarily insert or remove blocks of zeros to bridge or
split extents.

> You didn't say much about crash consistency or durability requirements of the
> cache. Since cachefiles only syncs the cache on shutdown, I guess you
> rely on the hosting filesystem to provide the required ordering guarantees.

There's an xattr on each file in the cache to record the state.  I use this
mark a cache file "open".  If, when I look up a file, the file is marked open,
it is just discarded at the moment.

Now, there are two types of data stored in the cache: data that has to be
stored as a single complete blob and is replaced as such (e.g. symlinks and
AFS dirs) and data that might be randomly modified (e.g. regular files).

For the former, I have code, though in yet another branch, that writes this in
a tmpfile, sets the xattrs and then uses vfs_link(LINK_REPLACE) to cut over.

For the latter, that's harder to do as it would require copying the data to
the tmpfile before we're allowed to modify it.  However, if it's possible to
create a tmpfile that's a CoW version of a data file, I could go down that
route.

But after I've written and sync'd the data, I set the xattr to mark the file
not open.  At the moment I'm doing this too lazily, only doing it when a netfs
file gets evicted or when the cache gets withdrawn, but I really need to add a
queue of objects to be sealed as they're closed.  The balance is working out
how often to do the sealing as something like a shell script can do a lot of
consecutive open/write/close ops.

> How does this work with write through network fs cache if the client system
> crashes but the write gets to the server?

The presumption is that the coherency info on the server will change, but
won't get updated in the cache.

> Client system get restart with older cached data because disk caches were
> not flushed before crash. Correct?  Is that case handled? Are the caches
> invalidated on unclean shutdown?

The netfs provides some coherency info for the cache to store.  For AFS, for
example, this is the data version number (though it should probably include
the volume creation time too).  This is stored with the state info in the same
xattr and is only updated when the "open" state is cleared.

When the cache file is reopened, if the coherency info doesn't match what
we're expecting (presumably we queried the server), the file is discarded.

(Note that the coherency info is netfs-specific)

> Anyway, how are those ordering requirements going to be handled when entire
> indexing is in a file? You'd practically need to re-implement a filesystem

Yes, the though has occurred to me too.  I would be implementing a "simple"
filesystem - and we have lots of those:-/.  The most obvious solution is to
use the backing filesystem's metadata - except that that's not possible.

> journal or only write cache updates to a temp file that can be discarded at
> any time?

It might involve keeping a bitmap of "open" blocks.  Those blocks get
invalidated when the cache restarts.  The simplest solution would be to wipe
the entire cache in such a situation, but that goes against one of the
important features I want out of it.

Actually, a journal of open and closed blocks might be better, though all I
really need to store for each block is a 32-bit number.

It's a particular problem if I'm doing DIO to the data storage area but
buffering the changes to the metadata.  Further, the metadata and data might
be on different media, just to add to the complexity.

Another possibility is only to cull blocks when the parent file is culled.
That probably makes more sense as, as long as the file is registered culled on
disk first and I don't reuse the file slot too quickly, I can write to the
data store before updating the metadata.

> If you come up with a useful generic implementation of a "file data
> overlay", overlayfs could also use it for "partial copy up" as well as for
> implementation of address space operations, so please keep that in mind.

I'm trying to 

Re: fscache: Redesigning the on-disk cache

2021-03-05 Thread Amir Goldstein
On Thu, Mar 4, 2021 at 4:10 PM David Howells  wrote:
>
> I'm looking at redesigning the on-disk cache format used by fscache's
> cachefiles driver to try and eliminate the number of synchronous metadata
> operations done by the driver, to improve culling performance and to reduce
> the amount of opens/files open.  I also need to stop relying on the backing
> filesystem to track where I have data stored.
>
> There are a number of options that I've considered:
>
>  (0) The current format lays out a directory tree, with directories for each
>  level of index (so in AFS's terms, you've got an overall "afs" dir
>  containing a dir for each cell.  In each cell dir, there's a dir for each
>  volume and within that there's a file for each afs vnode cached.  Extra
>  levels of directory are also interposed to reduce the number of entries
>  in a directory.
>
>  - Pathwalk cost to open a cache file.
>  - Netfs coherency data is in xattrs.
>  - Invalidation done by truncate or unlink.
>  - Uses backing filesystem metadata to keep track of present data.
>- Determined by bmap() on the cache file.
>  - Culling performed by userspace daemon.
>  - Data file opened for every write.
>  - Read done by readpage without file.
>
>  (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
>   file for every whole operation (which may combine reads and writes).

I read that NFSv4 supports hole punching, so when using ->bmap() or SEEK_DATA
to keep track of present data, it's hard to distinguish between an
invalid cached range
and a valid "cached hole".
With ->fiemap() you can at least make the distinction between a non existing
and an UNWRITTEN extent.

>
>  (1) Structured the same as (0), but keeping an independent content map and
>  not relying on backing fs metadata.  Use a larger blocksize, say 256K, to
>  reduce the size of the content map.
>
>  - Netfs coherency data in xattrs.
>  - Invalidation done by tmpfile creation and link-replace.
>  - Content bitmap kept in xattr.
>- Limited capacity.  Could use multiple bitmaps.
>- Can skip the bitmap for a non-sparse file that we have all of.
>  - "Open" state kept in xattr.
>  - File is kept open
>  - Culling performed by userspace daemon.
>  - Cache file open whilst netfs file is open.
>
>  (2) Structured the same as (1), but keeping an extent list instead of a
>  bitmap.
>
>  - Content extent map kept in xattr.
>- Limited capacity.
>- Highly scattered extents use a lot of map space.
>
>  (3) OpenAFS-style format.  One index file to look up {file_key,block#} and an
>  array of data files, each holding one block (e.g. a 256KiB-aligned chunk
>  of a file).  Each index entry has valid start/end offsets for easy
>  truncation.
>
>  The index has a hash to facilitate the lookup and an LRU that allows a
>  block to be recycled at any time.
>
>  - File keys, are highly variable in length and can be rather long,
>particularly NFS FIDs.
>- Might want a separate file index that maps file keys to a slot ID
>  that can then be used in the block index.
>  - Netfs coherency data in vnode index entry.
>  - Invalidation done by clearing matching entries in the index.
>- Dead data files can be lazily unlinked or truncated or just
>  overwritten.
>  - Content mapping by lookup in block index hash table.
>- Fine if the hash table is large and scatter is good.
>  - Potential coherency problem between indices and data file.
>  - Culling performed by block index LRU.
>  - Really want to retain entire block index in RAM.
>  - Data files are opened for every read/write.
>
>  (4) Similar format to (3), but could put entirety of data in one file.
>
>  - Data file open entire time cache online.
>  - Unused block bitmap.
>  - Can use fallocate to punch out dead blocks.
>  - Could put data file on blockdev.
>
>  (5) Similar idea to (4), but just have a file index and use block pointers
>  and indirection blocks instead.  Use an LRU in the file index and cull
>  whole files only, not individual blocks.
>
>  - File keys, are highly variable in length and can be rather long,
>particularly NFS FIDs.
>  - Netfs coherency data in vnode index entry.
>  - Unused data block bitmap.
>  - Invalidation done by clearing entries in the file index.
>- Data blocks must be recycled and returned to bitmap.
>- Dead data blocks can be lazily punched out with fallocate.
>  - Potential coherency problem between index, pointers/indirection and
>bitmap.
>  - Culling performed by file index LRU.
>  - Really want to retain entire file index and block bitmap in RAM.
>- May be less memory than block index.
>  - Data file open entire time cache online.
>  - Could put data file on blockdev.

fscache: Redesigning the on-disk cache - LRU handling

2021-03-04 Thread David Howells
David Howells  wrote:

> 
>  (3) OpenAFS-style format.  One index file to look up {file_key,block#} and an
>  array of data files, each holding one block (e.g. a 256KiB-aligned chunk
>  of a file).  Each index entry has valid start/end offsets for easy
>  truncation.
> 
>  The index has a hash to facilitate the lookup and an LRU that allows a
>  block to be recycled at any time.

The LRU would probably have to be a doubly-linked list so that entries can be
removed from it easily.  This means typically touching two other entries,
which might not be in the same page; further, if the entry is being freed,
we'd need to excise it from the hash chain also, necessitating touching maybe
two more entries - which might also be in different pages.

Maybe the LRU idea plus a free block bitmap could be combined, however.

 (1) Say that there's a bit-pair map, with one bit pair per block.  The pair
 is set to 0 when the block is free.  When the block is accessed, the pair
 is set to 3.

 (2) When we run out of free blocks (ie. pairs that are zero), we decrement
 all the pairs and then look again.

 (3) Excision from the old hash chain would need to be done at allocation,
 though it does give a block whose usage has been reduced to 0 the chance
 to be resurrected.

Possible variations on the theme could be:

 (*) Set the pair to 2, not 3 when accessed.  Set the block to 3 to pin it;
 the process of decrementing all the pairs would leave it at 3.

 (*) Rather than decrementing all pairs at once, have a rotating window that
 does a part of the map at once.

 (*) If a round of decrementing doesn't reduce any pairs to zero, reject a
 request for space.

This would also work for a file index.

David



fscache: Redesigning the on-disk cache

2021-03-03 Thread David Howells
I'm looking at redesigning the on-disk cache format used by fscache's
cachefiles driver to try and eliminate the number of synchronous metadata
operations done by the driver, to improve culling performance and to reduce
the amount of opens/files open.  I also need to stop relying on the backing
filesystem to track where I have data stored.

There are a number of options that I've considered:

 (0) The current format lays out a directory tree, with directories for each
 level of index (so in AFS's terms, you've got an overall "afs" dir
 containing a dir for each cell.  In each cell dir, there's a dir for each
 volume and within that there's a file for each afs vnode cached.  Extra
 levels of directory are also interposed to reduce the number of entries
 in a directory.

 - Pathwalk cost to open a cache file.
 - Netfs coherency data is in xattrs.
 - Invalidation done by truncate or unlink.
 - Uses backing filesystem metadata to keep track of present data.
   - Determined by bmap() on the cache file.
 - Culling performed by userspace daemon.
 - Data file opened for every write.
 - Read done by readpage without file.

 (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
  file for every whole operation (which may combine reads and writes).

 (1) Structured the same as (0), but keeping an independent content map and
 not relying on backing fs metadata.  Use a larger blocksize, say 256K, to
 reduce the size of the content map.

 - Netfs coherency data in xattrs.
 - Invalidation done by tmpfile creation and link-replace.
 - Content bitmap kept in xattr.
   - Limited capacity.  Could use multiple bitmaps.
   - Can skip the bitmap for a non-sparse file that we have all of.
 - "Open" state kept in xattr.
 - File is kept open
 - Culling performed by userspace daemon.
 - Cache file open whilst netfs file is open.

 (2) Structured the same as (1), but keeping an extent list instead of a
 bitmap.

 - Content extent map kept in xattr.
   - Limited capacity.
   - Highly scattered extents use a lot of map space.

 (3) OpenAFS-style format.  One index file to look up {file_key,block#} and an
 array of data files, each holding one block (e.g. a 256KiB-aligned chunk
 of a file).  Each index entry has valid start/end offsets for easy
 truncation.

 The index has a hash to facilitate the lookup and an LRU that allows a
 block to be recycled at any time.

 - File keys, are highly variable in length and can be rather long,
   particularly NFS FIDs.
   - Might want a separate file index that maps file keys to a slot ID
 that can then be used in the block index.
 - Netfs coherency data in vnode index entry.
 - Invalidation done by clearing matching entries in the index.
   - Dead data files can be lazily unlinked or truncated or just
 overwritten.
 - Content mapping by lookup in block index hash table.
   - Fine if the hash table is large and scatter is good.
 - Potential coherency problem between indices and data file.
 - Culling performed by block index LRU.
 - Really want to retain entire block index in RAM.
 - Data files are opened for every read/write.

 (4) Similar format to (3), but could put entirety of data in one file.

 - Data file open entire time cache online.
 - Unused block bitmap.
 - Can use fallocate to punch out dead blocks.
 - Could put data file on blockdev.

 (5) Similar idea to (4), but just have a file index and use block pointers
 and indirection blocks instead.  Use an LRU in the file index and cull
 whole files only, not individual blocks.

 - File keys, are highly variable in length and can be rather long,
   particularly NFS FIDs.
 - Netfs coherency data in vnode index entry.
 - Unused data block bitmap.
 - Invalidation done by clearing entries in the file index.
   - Data blocks must be recycled and returned to bitmap.
   - Dead data blocks can be lazily punched out with fallocate.
 - Potential coherency problem between index, pointers/indirection and
   bitmap.
 - Culling performed by file index LRU.
 - Really want to retain entire file index and block bitmap in RAM.
   - May be less memory than block index.
 - Data file open entire time cache online.
 - Could put data file on blockdev.
 - If the block size is large, lots of dead space in indirection blocks.

 (6) Similar to (5), but use extent lists rather than indirection blocks.

 - Requires allocation of contiguous space to be worthwhile.
 - Buddy allocator approach?
   - Can always arbitrarily recycle buddies to make larger spaces - if we
 can find them...

 (7) Hybrid approach.  Stick the first block of every netfs file in one big
 cache file.  For a lot of cases, that would suffice for the entire file
 if the block