Re: Buffer and page cache
Hi, On Tue, 02 Nov 1999 08:15:36 -0700, [EMAIL PROTECTED] said: I'd like these pages to age a little before handing them over to the "inode disk", because the "write_one_page" function called by generic_file_write would incur significant latency if the inode disk is "real", ie. not simulated in the same system. The write-page method is only required to queue the data for writing to the media. It is not required to complete the physical IO, so the filesystem can use any mechanism it likes to keep those pages queued for eventual physical IO (just as 2.3 uses the buffer lists to queue that data for eventual writeback via bdflush). So we have a page cache for the inodes in the file system where the pages become dirty - but no buffers are attached. It reminds of a shared mapping, but there is no vma for the pages. Fine. What appears to be needed is the following - probably it's mostly lacking in my understanding, but I'd appreciate to be advised how to attack the following points: - a bit to keep shrink_mmap away from the page. Yes, bumping the page count is the perfect way to do this. - a bit for a struct page that indicates the page needs to be written. From block_write_full_page one could think that the PageUptoDate bit is maybe the one to use. But does that really describe that this page is "dirty" - as it is done for buffers. PageUpToDate can't be used: it is needed to flag whether the contents of the page are valid for a read. A written page must always be uptodate: !uptodate implies that we have created the page but are still reading it in from disk (or that the readin failed for some reason). - some indication of aging: we would like a pgflush daemon to walk the dirty pages of the file system and write them back _after_ a little while The fs should be able to manage that on its own. If you queue all of the pages which have been sent to the writepage() method, then you can flush to the physical disk whenever you want. A trivial bdflush lookalike in the fs itself can deal with that. You might well want a filesystem-private pointer in the page struct off which to hook any fs-specific data (such as your dirty page linked list pointers and the dirty flag). You will also need a way for the VM to exert memory pressure on those pages if it needs to reclaim memory. These are both things which ext3 will want anyway, so we should make sure that any infrastructure that gets put in place for this gets reviewed by all the different fs groups first. --Stephen
Buffer and page cache
Hi, I'm working on a file system which talks to an "inode disk", the storage industry calls these object based disks. A simulated object based disk can be constructed from the lower half of ext2 (or any other file system for that matter). The file system has no knowledge of disk blocks, and solely uses the page cache. I'd like these pages to age a little before handing them over to the "inode disk", because the "write_one_page" function called by generic_file_write would incur significant latency if the inode disk is "real", ie. not simulated in the same system. So we have a page cache for the inodes in the file system where the pages become dirty - but no buffers are attached. It reminds of a shared mapping, but there is no vma for the pages. What appears to be needed is the following - probably it's mostly lacking in my understanding, but I'd appreciate to be advised how to attack the following points: - a bit to keep shrink_mmap away from the page. When the file system writes in this page, we need to change its state so that it doesn't get thrown out afterwards. We could "get" the page for this purpose. Locking is not good, since we may need to write to the page again. - a bit for a struct page that indicates the page needs to be written. From block_write_full_page one could think that the PageUptoDate bit is maybe the one to use. But does that really describe that this page is "dirty" - as it is done for buffers. - some indication of aging: we would like a pgflush daemon to walk the dirty pages of the file system and write them back _after_ a little while The construction should hopefully be capable of supporting Stephen's journaling extensions too, but I can't oversee everything in one blow (he probably can). Any advice would be appreciated! No why are we doing this? Effectively we have split Ext2 into an upper half (the file system) and a lower half (the object based device driver). For cluster file systems it does seem an attractive division of labor to let the drive do the allocation and have the clustered file system only share inode metadata and data blocks. So the block and inode allocation metadata is not spread around the cluster. This saves locks and traffic and, perhaps most importantly, complexity. You can find some preliminary code at: ftp://carissimi.coda.cs.cmu.edu/pub/obd, but currently it writes through to the disk and doesn't cluster yet. Hence this message. - Peter -