Re: Linux Buffer Cache Does Not Support Mirroring

1999-11-02 Thread Ingo Molnar


On Mon, 1 Nov 1999 [EMAIL PROTECTED] wrote:

 XFS on Irix caches file data in buffers, but not in the regular buffer cache,
 they are cached off the vnode and organized by logical file offset rather
 than by disk block number, the memory in these buffers comes from the page
 subsystem, the page tag being the vnode and file offset. These buffers do
 not have to have a physical disk block associated with them, XFS allows you
 to reserve blocks on the disk for a file without picking which blocks. At
 some point when the data needs to be written (memory pressure, or sync
 activity etc), the filesystem is asked to allocate physical blocks for the
 data, these are associated with the buffers and they get written out.
 Delaying the allocation allows us to collect together multiple small writes
 into one big allocation request. It also means that we can bypass allocation
 altogether if the file is truncated before it is flushed to disk.

the new 2.3 pagecache should enable this almost out-of-box. Apart from
memory pressure issues, the missing bit is to split up fs-get_block()
into a 'soft' and 'real' allocation branch. This means that whenever the
pagecache creates a new dirty page, it calls the 'soft' get_block()
variant, which is very fast and just bumps up some counters within XFS (so
we do not get asynchron out-of-space conditions). Then whenever
ll_rw_block() (or bdflush) sees a !buffer_mapped() but buffer_allocated()
block it will call the 'real' lowlevel handler to do the allocation for
real. 

i kept this in mind all along when doing the pagecache changes, and i
intend to do this for ext2fs. Splitting up get_block() is easy without
breaking filesystems, the last 'create' parameter can be made '2' to mean
'lazy create'. 

note that not all filesystems can know in advance how much space a new
inode block will take, but this is not a problem, the lazy-allocator can
safely 'overestimate' space needs. 

is this the kind of interface you need for XFS? i can make a prototype
patch for ext2fs (and the pagecache  bdflush), which should be easy to
adopt for XFS. 

-- mingo



Re: Linux Buffer Cache Does Not Support Mirroring

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Mon, 01 Nov 1999 15:53:29 -0500, Jeff Garzik
[EMAIL PROTECTED] said:

 XFS delays allocation of user data blocks when possible to
 make blocks more contiguous; holding them in the buffer cache.
 This allows XFS to make extents large without requiring the user
 to specify extent size, and without requiring a filesystem
 reorganizer to fix the extent sizes after the fact. This also
 reduces the number of writes to disk and extents used for a file. 

 Is this sort of manipulation possible with the existing buffer cache?

Absolutely not, but it is not hard with the page cache.  The main thing
missing is a VM callback to allow memory pressure to force unallocated,
pinned pages to disk.

--Stephen



Re: Linux Buffer Cache Does Not Support Mirroring

1999-11-01 Thread Jeff Garzik

(moved to linux-fsdevel)

SGI's XFS white paper[1] describes

 XFS delays allocation of user data blocks when possible to
 make blocks more contiguous; holding them in the buffer cache.
 This allows XFS to make extents large without requiring the user
 to specify extent size, and without requiring a filesystem
 reorganizer to fix the extent sizes after the fact. This also
 reduces the number of writes to disk and extents used for a file. 

Is this sort of manipulation possible with the existing buffer cache?

Regards,

Jeff



[1] http://www.sgi.com/Technology/xfs-whitepaper.html



Re: Linux Buffer Cache Does Not Support Mirroring

1999-11-01 Thread Alexander Viro



On Mon, 1 Nov 1999 [EMAIL PROTECTED] wrote:

 I agree with this, it feels closer to the linux page cache, the terminology in
 the XFS white paper is a little confusing here.
 
 XFS on Irix caches file data in buffers, but not in the regular buffer cache,
 they are cached off the vnode and organized by logical file offset rather
 than by disk block number, the memory in these buffers comes from the page
 subsystem, the page tag being the vnode and file offset. These buffers do

Then it _is_ page cache.

 not have to have a physical disk block associated with them, XFS allows you
 to reserve blocks on the disk for a file without picking which blocks. At

It can be done with 2.3 pagecache. You will need a new function in
buffer.c, but that's about it. Just make sure that -writepage() does
allocation for fragments that are not mapped to disk blocks and pass the
new (lazy) variant of block_write_partial_page to the generic_file_write.

 some point when the data needs to be written (memory pressure, or sync
 activity etc), the filesystem is asked to allocate physical blocks for the
 data, these are associated with the buffers and they get written out.

Yup.

 Delaying the allocation allows us to collect together multiple small writes
 into one big allocation request. It also means that we can bypass allocation
 altogether if the file is truncated before it is flushed to disk.

Hmm... It may make sense for other filesystems too - all allocators tend
to work better if they are asked to grab big lumps... Interesting.

But yes, the thing you are describing _is_ the page cache.



Re: Linux Buffer Cache Does Not Support Mirroring

1999-11-01 Thread lord

 
 
 On Mon, 1 Nov 1999, Jeff Garzik wrote:
 
  (moved to linux-fsdevel)
  
  SGI's XFS white paper[1] describes
  
   XFS delays allocation of user data blocks when possible to
   make blocks more contiguous; holding them in the buffer cache.
   This allows XFS to make extents large without requiring the user
   to specify extent size, and without requiring a filesystem
   reorganizer to fix the extent sizes after the fact. This also
   reduces the number of writes to disk and extents used for a file. 
  
  Is this sort of manipulation possible with the existing buffer cache?
 
 AFAICS it is possible with the 2.3 page cache.

I agree with this, it feels closer to the linux page cache, the terminology in
the XFS white paper is a little confusing here.

XFS on Irix caches file data in buffers, but not in the regular buffer cache,
they are cached off the vnode and organized by logical file offset rather
than by disk block number, the memory in these buffers comes from the page
subsystem, the page tag being the vnode and file offset. These buffers do
not have to have a physical disk block associated with them, XFS allows you
to reserve blocks on the disk for a file without picking which blocks. At
some point when the data needs to be written (memory pressure, or sync
activity etc), the filesystem is asked to allocate physical blocks for the
data, these are associated with the buffers and they get written out.
Delaying the allocation allows us to collect together multiple small writes
into one big allocation request. It also means that we can bypass allocation
altogether if the file is truncated before it is flushed to disk.


Steve

--
Steve Lord  voice: +1-651-683-5291
Silicon Graphics Inc
655F Lone Oak Drive email: [EMAIL PROTECTED]
Eagan, MN, 55121, USA
--




Re: Linux Buffer Cache Does Not Support Mirroring

1999-11-01 Thread Jeff V. Merkey


Jeff,

It's worth a look.  I'll look at it and let you know what I think.

Jeff


Jeff Garzik wrote:
 
 (moved to linux-fsdevel)
 
 SGI's XFS white paper[1] describes
 
  XFS delays allocation of user data blocks when possible to
  make blocks more contiguous; holding them in the buffer cache.
  This allows XFS to make extents large without requiring the user
  to specify extent size, and without requiring a filesystem
  reorganizer to fix the extent sizes after the fact. This also
  reduces the number of writes to disk and extents used for a file.
 
 Is this sort of manipulation possible with the existing buffer cache?
 
 Regards,
 
 Jeff
 
 [1] http://www.sgi.com/Technology/xfs-whitepaper.html



Re: Linux Buffer Cache Does Not Support Mirroring

1999-11-01 Thread Stephen C. Tweedie

Hi,

On Mon, 01 Nov 1999 15:58:33 -0600, [EMAIL PROTECTED] said:

 I agree with this, it feels closer to the linux page cache, the
 terminology in the XFS white paper is a little confusing here.

 XFS on Irix caches file data in buffers, but not in the regular buffer
 cache, they are cached off the vnode and organized by logical file
 offset rather than by disk block number, 

This is describing the job done by the page cache, not the buffer cache,
in Linux.

 the memory in these buffers comes from the page subsystem, the page
 tag being the vnode and file offset. These buffers do not have to have
 a physical disk block associated with them, XFS allows you to reserve
 blocks on the disk for a file without picking which blocks. At some
 point when the data needs to be written (memory pressure, or sync
 activity etc)

The main thing we'd want to establish to support this fully in Linux is
exactly this --- what VM callbacks do you need here?  Memory pressure is
not currently something that gets fed back adequately to the
filesystems, and I'll be needing similar feedback for ext2 journaling
(we need to be told to do early commits if the memory used by a
transaction is required elsewhere).

The details of the IOs themselves should be capable of being handled
perfectly well within the existing device driver abstraction: you don't
need to use the buffer cache.

--Stephen