Re: Buffer and page cache

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Tue, 02 Nov 1999 08:15:36 -0700, [EMAIL PROTECTED] said:

 I'd like these pages to age a little before handing them over to the
 "inode disk", because the "write_one_page" function called by
 generic_file_write would incur significant latency if the inode disk is
 "real", ie. not simulated in the same system.

The write-page method is only required to queue the data for writing to
the media.  It is not required to complete the physical IO, so the
filesystem can use any mechanism it likes to keep those pages queued for
eventual physical IO (just as 2.3 uses the buffer lists to queue that
data for eventual writeback via bdflush).

 So we have a page cache for the inodes in the file system where the
 pages become dirty - but no buffers are attached.  It reminds of a
 shared mapping, but there is no vma for the pages.

Fine.

 What appears to be needed is the following - probably it's mostly
 lacking in my understanding, but I'd appreciate to be advised how to
 attack the following points:

 - a bit to keep shrink_mmap away from the page.  

Yes, bumping the page count is the perfect way to do this.

 - a bit for a struct page that indicates the page needs to be written. 
 From block_write_full_page one could think that the PageUptoDate bit is
 maybe the one to use.  But does that really describe that this page is
 "dirty" - as it is done for buffers.

PageUpToDate can't be used: it is needed to flag whether the contents of
the page are valid for a read.  A written page must always be uptodate:
!uptodate implies that we have created the page but are still reading it
in from disk (or that the readin failed for some reason).

 - some indication of aging: we would like a pgflush daemon to walk the
 dirty pages of the file system and write them back _after_ a little
 while 

The fs should be able to manage that on its own.  If you queue all of
the pages which have been sent to the writepage() method, then you can
flush to the physical disk whenever you want.  A trivial bdflush
lookalike in the fs itself can deal with that.

You might well want a filesystem-private pointer in the page struct off
which to hook any fs-specific data (such as your dirty page linked list
pointers and the dirty flag).  You will also need a way for the VM to
exert memory pressure on those pages if it needs to reclaim memory.

These are both things which ext3 will want anyway, so we should make
sure that any infrastructure that gets put in place for this gets
reviewed by all the different fs groups first.

--Stephen



Buffer and page cache

1999-11-02 Thread braam

Hi,

I'm working on a file system which talks to an "inode disk", the storage
industry calls these object based disks.  A simulated object based disk
can be constructed from the lower half of ext2 (or any other file system
for that matter). 

The file system has no knowledge of disk blocks, and solely uses the
page cache.  

I'd like these pages to age a little before handing them over to the
"inode disk", because the "write_one_page" function called by
generic_file_write would incur significant latency if the inode disk is
"real", ie. not simulated in the same system.

So we have a page cache for the inodes in the file system where the
pages become dirty - but no buffers are attached.  It reminds of a
shared mapping, but there is no vma for the pages.

What appears to be needed is the following - probably it's mostly
lacking in my understanding, but I'd appreciate to be advised how to
attack the following points:

- a bit to keep shrink_mmap away from the page.  When the file system
writes in this page, we need to change its state so that it doesn't get
thrown out afterwards.  We could "get" the page for this purpose. 
Locking is not good, since we may need to write to the page again.

- a bit for a struct page that indicates the page needs to be written. 
From block_write_full_page one could think that the PageUptoDate bit is
maybe the one to use.  But does that really describe that this page is
"dirty" - as it is done for buffers.

- some indication of aging: we would like a pgflush daemon to walk the
dirty pages of the file system and write them back _after_ a little
while 

The construction should hopefully be capable of supporting Stephen's
journaling extensions too, but I can't oversee everything in one blow
(he probably can).

Any advice would be appreciated!

No why are we doing this?   

Effectively we have split Ext2 into an upper half (the file system) and
a lower half (the object based device driver).  

For cluster file systems it does seem an attractive division of labor to
let the drive do the allocation and have the clustered file system only
share inode metadata and data blocks.  So the block and inode allocation
metadata is not spread around the cluster.  This saves locks and traffic
and, perhaps most importantly, complexity.

You can find some preliminary code at:
ftp://carissimi.coda.cs.cmu.edu/pub/obd, but currently it writes through
to the disk and doesn't cluster yet.  Hence this message. 

- Peter -