Re: [RFC] fsblock
On Sun, 24 Jun 2007, Nick Piggin wrote: Firstly, what is the buffer layer? The buffer layer isn't really a buffer layer as in the buffer cache of unix: the block device cache is unified with the pagecache (in terms of the pagecache, a blkdev file is just like any other, but with a 1:1 mapping between offset and block). I thought that the buffer layer is essentially a method to index to sub section of a page? Why rewrite the buffer layer? Lots of people have had a desire to completely rip out the buffer layer, but we can't do that[*] because it does actually serve a useful purpose. Why the bad rap? Because the code is old and crufty, and buffer_head is an awful name. It must be among the oldest code in the core fs/vm, and the main reason is because of the inertia of so many and such complex filesystems. Hmmm I did not notice that yet but then I have not done much work there. - data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on 64-bit (could easily be 32 if we can have int bitops). Compare this to around 50 and 100ish for struct buffer_head. With a 4K page and 1K blocks, IO requires 10% RAM overhead in buffer heads alone. With fsblocks you're down to around 3%. I thought we were going to simply use the page struct instead of having buffer heads? Would that not reduce the overhead to zero? - Structure packing. A page gets a number of buffer heads that are allocated in a linked list. fsblocks are allocated contiguously, so cacheline footprint is smaller in the above situation. Good idea. - A real nobh mode. nobh was created I think mainly to avoid problems with buffer_head memory consumption, especially on lowmem machines. It is basically a hack (sorry), which requires special code in filesystems, and duplication of quite a bit of tricky buffer layer code (and bugs). It also doesn't work so well for buffers with non-trivial private data (like most journalling ones). fsblock implements this with basically a few lines of code, and it shold work in situations like ext3. Hmmm That means simply page struct are not working... - Large block support. I can mount and run an 8K block size minix3 fs on my 4K page system and it didn't require anything special in the fs. We can go up to about 32MB blocks now, and gigabyte+ blocks would only require one more bit in the fsblock flags. fsblock_superpage blocks are PAGE_CACHE_SIZE, midpage ==, and subpage . Core pagecache code is pretty creaky with respect to this. I think it is mostly race free, but it requires stupid unlocking and relocking hacks because the vm usually passes single locked pages to the fs layers, and we need to lock all pages of a block in offset ascending order. This could be avoided by doing locking on only the first page of a block for locking in the fsblock layer, but that's a bit scary too. Probably better would be to move towards offset,length rather than page based fs APIs where everything can be batched up nicely and this sort of non-trivial locking can be more optimal. Large blocks also have a performance black spot where an 8K sized and aligned write(2) would require an RMW in the filesystem. Again because of the page based nature of the fs API, and this too would be fixed if the APIs were better. The simple solution would be to use a compound page and make the head page represent the status of all the pages in the vm. Logic for that is already in place. Large block memory access via filesystem uses vmap, but it will go back to kmap if the access doesn't cross a page. Filesystems really should do this because vmap is slow as anything. I've implemented a vmap cache which basically wouldn't work on 32-bit systems (because of limited vmap space) for performance testing (and yes it sometimes tries to unmap in interrupt context, I know, I'm using loop). We could possibly do a self limiting cache, but I'd rather build some helpers to hide the raw multi page access for things like bitmap scanning and bit setting etc. and avoid too much vmaps. Argh. No. Too much overhead. So. Comments? Is this something we want? If yes, then how would we transition from buffer.c to fsblock.c? I think many of the ideas are great but the handling of large pages is rather strange. I would suggest to use compound pages to represent larger pages and rely on Mel Gorman's antifrag/compaction work to get you the contiguous memory locations instead of using vmap. This may significantly simplify your patchset and avoid changes to the filesytesm API. Its still pretty invasive though and I am not sure that there is enough benefit from this one. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Mon, Jul 09, 2007 at 10:14:06AM -0700, Christoph Lameter wrote: On Sun, 24 Jun 2007, Nick Piggin wrote: Firstly, what is the buffer layer? The buffer layer isn't really a buffer layer as in the buffer cache of unix: the block device cache is unified with the pagecache (in terms of the pagecache, a blkdev file is just like any other, but with a 1:1 mapping between offset and block). I thought that the buffer layer is essentially a method to index to sub section of a page? It converts pagecache addresses to block addresses I guess. The current implementation cannot handle blocks larger than pages, but not because use of larger pages for pagecache wsa anticipated (likely because it is more work, and the APIs aren't really set up for it). Why rewrite the buffer layer? Lots of people have had a desire to completely rip out the buffer layer, but we can't do that[*] because it does actually serve a useful purpose. Why the bad rap? Because the code is old and crufty, and buffer_head is an awful name. It must be among the oldest code in the core fs/vm, and the main reason is because of the inertia of so many and such complex filesystems. Hmmm I did not notice that yet but then I have not done much work there. Notice what? - data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on 64-bit (could easily be 32 if we can have int bitops). Compare this to around 50 and 100ish for struct buffer_head. With a 4K page and 1K blocks, IO requires 10% RAM overhead in buffer heads alone. With fsblocks you're down to around 3%. I thought we were going to simply use the page struct instead of having buffer heads? Would that not reduce the overhead to zero? What do you mean by that? As I said, you couldn't use just the page struct for anything except page sized blocks, and even then it would require more fields or at least more flags in the page struct. nobh mode actually tries to do something similar, however it requires multiple calls into the filesystem to first allocate the block, and then find its sector. It is also buggy and can't handle errors properly (although I'm trying to fix that). - A real nobh mode. nobh was created I think mainly to avoid problems with buffer_head memory consumption, especially on lowmem machines. It is basically a hack (sorry), which requires special code in filesystems, and duplication of quite a bit of tricky buffer layer code (and bugs). It also doesn't work so well for buffers with non-trivial private data (like most journalling ones). fsblock implements this with basically a few lines of code, and it shold work in situations like ext3. Hmmm That means simply page struct are not working... I don't understand you. jbd needs to attach private data to each bh, and that can stay around for longer than the life of the page in the pagecache. - Large block support. I can mount and run an 8K block size minix3 fs on my 4K page system and it didn't require anything special in the fs. We can go up to about 32MB blocks now, and gigabyte+ blocks would only require one more bit in the fsblock flags. fsblock_superpage blocks are PAGE_CACHE_SIZE, midpage ==, and subpage . Core pagecache code is pretty creaky with respect to this. I think it is mostly race free, but it requires stupid unlocking and relocking hacks because the vm usually passes single locked pages to the fs layers, and we need to lock all pages of a block in offset ascending order. This could be avoided by doing locking on only the first page of a block for locking in the fsblock layer, but that's a bit scary too. Probably better would be to move towards offset,length rather than page based fs APIs where everything can be batched up nicely and this sort of non-trivial locking can be more optimal. Large blocks also have a performance black spot where an 8K sized and aligned write(2) would require an RMW in the filesystem. Again because of the page based nature of the fs API, and this too would be fixed if the APIs were better. The simple solution would be to use a compound page and make the head page represent the status of all the pages in the vm. Logic for that is already in place. I do not consider that a solution because I explicitly want to allow order-0 pages here. I know about your higher order pagecache, the anti-frag and defrag work, I know about compound pages. I'm not just ignoring them because of NIH or something silly. Anyway, I have thought about just using the first page in the block for the locking, and that might be a reasonable optimisation. However for now I'm keeping it simple. Large block memory access via filesystem uses vmap, but it will go back to kmap if the access doesn't cross a page. Filesystems really should do this because vmap is slow as anything. I've implemented a vmap cache which
Re: [RFC] fsblock
On Tue, 10 Jul 2007, Nick Piggin wrote: Hmmm I did not notice that yet but then I have not done much work there. Notice what? The bad code for the buffer heads. - A real nobh mode. nobh was created I think mainly to avoid problems with buffer_head memory consumption, especially on lowmem machines. It is basically a hack (sorry), which requires special code in filesystems, and duplication of quite a bit of tricky buffer layer code (and bugs). It also doesn't work so well for buffers with non-trivial private data (like most journalling ones). fsblock implements this with basically a few lines of code, and it shold work in situations like ext3. Hmmm That means simply page struct are not working... I don't understand you. jbd needs to attach private data to each bh, and that can stay around for longer than the life of the page in the pagecache. Right. So just using page struct alone wont work for the filesystems. There are no changes to the filesystem API for large pages (although I am adding a couple of helpers to do page based bitmap ops). And I don't want to rely on contiguous memory. Why do you think handling of large pages (presumably you mean larger than page sized blocks) is strange? We already have a way to handle large pages: Compound pages. Conglomerating the constituent pages via the pagecache radix-tree seems logical to me. Meaning overhead to handle each page still exists? This scheme cannot handle large contiguous blocks as a single entity? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Mon, Jul 09, 2007 at 05:59:47PM -0700, Christoph Lameter wrote: On Tue, 10 Jul 2007, Nick Piggin wrote: Hmmm I did not notice that yet but then I have not done much work there. Notice what? The bad code for the buffer heads. Oh. Well my first mail in this thrad listed some of the problems with them. - A real nobh mode. nobh was created I think mainly to avoid problems with buffer_head memory consumption, especially on lowmem machines. It is basically a hack (sorry), which requires special code in filesystems, and duplication of quite a bit of tricky buffer layer code (and bugs). It also doesn't work so well for buffers with non-trivial private data (like most journalling ones). fsblock implements this with basically a few lines of code, and it shold work in situations like ext3. Hmmm That means simply page struct are not working... I don't understand you. jbd needs to attach private data to each bh, and that can stay around for longer than the life of the page in the pagecache. Right. So just using page struct alone wont work for the filesystems. There are no changes to the filesystem API for large pages (although I am adding a couple of helpers to do page based bitmap ops). And I don't want to rely on contiguous memory. Why do you think handling of large pages (presumably you mean larger than page sized blocks) is strange? We already have a way to handle large pages: Compound pages. Yes but I don't want to use large pages and I am not going to use them (at least, they won't be mandatory). Conglomerating the constituent pages via the pagecache radix-tree seems logical to me. Meaning overhead to handle each page still exists? This scheme cannot handle large contiguous blocks as a single entity? Of course some things have to be done per-page if the pages are not contiguous. I actually haven't seen that to be a problem or have much reason to think it will suddenly become a problem (although I do like Andrea's config page sizes approach for really big systems that cannot change their HW page size). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Monday 09 July 2007, Christoph Lameter wrote: On Tue, 10 Jul 2007, Nick Piggin wrote: There are no changes to the filesystem API for large pages (although I am adding a couple of helpers to do page based bitmap ops). And I don't want to rely on contiguous memory. Why do you think handling of large pages (presumably you mean larger than page sized blocks) is strange? We already have a way to handle large pages: Compound pages. Um, no, we don't, assuming by compound pages you mean order 0 pages.. None of the stack of changes necessary to make these pages viable has yet been accepted, ie antifrag, defrag, and variable page cache. While these changes may yet all go in and work wonderfully, I applaud Nick's alternative solution that does not include a depency on them. Dave McCracken - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote: - In line with the above item, filesystem block allocation is performed before a page is dirtied. In the buffer layer, mmap writes can dirty a page with no backing blocks which is a problem if the filesystem is ENOSPC (patches exist for buffer.c for this). This raises an eyebrow... The handling of ENOSPC prior to mmap write is more an ABI behavior, so I don't see how this can be fixed with internal changes, yet without changing behavior currently exported to userland (and thus affecting code based on such assumptions). Not really, the current behaviour is a bug. And it's not actually buffer layer specific - XFS now has a fix for that bug and it's generic enough that everyone could use it. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Mon, Jun 25, 2007 at 08:25:21AM -0400, Chris Mason wrote: write_begin/write_end is a step in that direction (and it helps OCFS and GFS quite a bit). I think there is also not much reason for writepage sites to require the page to lock the page and clear the dirty bit themselves (which has seems ugly to me). If we keep the page mapping information with the page all the time (ie writepage doesn't have to call get_block ever), it may be possible to avoid sending down a locked page. But, I don't know the delayed allocation internals well enough to say for sure if that is true. The point of delayed allocations is that the mapping information doesn't even exist until writepage for new allocations :) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
Warning ahead: I've only briefly skipped over the pages so the comments in the mail are very highlevel. On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote: fsblock is a rewrite of the buffer layer (ding dong the witch is dead), which I have been working on, on and off and is now at the stage where some of the basics are working-ish. This email is going to be long... Firstly, what is the buffer layer? The buffer layer isn't really a buffer layer as in the buffer cache of unix: the block device cache is unified with the pagecache (in terms of the pagecache, a blkdev file is just like any other, but with a 1:1 mapping between offset and block). There are filesystem APIs to access the block device, but these go through the block device pagecache as well. These don't exactly define the buffer layer either. The buffer layer is a layer between the pagecache and the block device for block based filesystems. It keeps a translation between logical offset and physical block number, as well as meta information such as locks, dirtyness, and IO status of each block. This information is tracked via the buffer_head structure. Traditional unix buffer cache is always physical block indexed and used for all data/metadata/blockdevice node access. There's been a lot of variants of schemes where data or some data is in a separate inode,logial block indexed scheme. Most modern OSes including Linux now always do the inode,logial block index with some noop substitute for the metadata and block device node variants of operation. Now what you replace is a really crappy hybrid of a traditional unix buffercache implemented ontop of the pagecache for the block device node (for metadata) and a lot of abuse of the same data structure as used in the buffercache for keeping metainformation about the actual data mapping. Why rewrite the buffer layer? Lots of people have had a desire to completely rip out the buffer layer, but we can't do that[*] because it does actually serve a useful purpose. Why the bad rap? Because the code is old and crufty, and buffer_head is an awful name. It must be among the oldest code in the core fs/vm, and the main reason is because of the inertia of so many and such complex filesystems. Actually most of the code is no older than 10 years. Just compare fs/buffer.c in 2.2 and 2.6. buffer_head is a perfectly fine name for one of it's uses in the traditional buffercache. I also thing there is little to no reason to get rid of that use: This buffercache is what most linux block-based filesystems (except xfs and jfs most notably) are written to, and it fits them very nicely. What I'd really like to see is to get rid of the abuse of struct buffer_head in the data path, and the sometimes to intimate coupling of the buffer cache with page cache internals. - Data / metadata separation. I have a struct fsblock and a struct fsblock_meta, so we could put more stuff into the usually less used fsblock_meta without bloating it up too much. After a few tricks, these are no longer any different in my code, and dirty up the typing quite a lot (and I'm aware it still has some warnings, thanks). So if not useful this could be taken out. That's what I mean. And from a quick glimpse at your code they're still far too deeply coupled in fsblock. Really, we don't really want to share anything between the buffer cache and data mapping operations - they are so deeply different that this sharing is what creates the enormous complexity we have to deal with. - No deadlocks (hopefully). The buffer layer is technically deadlocky by design, because it can require memory allocations at page writeout-time. It also has one path that cannot tolerate memory allocation failures. No such problems for fsblock, which keeps fsblock metadata around for as long as a page is dirty (this still has problems vs get_user_pages, but that's going to require an audit of all get_user_pages sites. Phew). The whole concept of delayed allocation requires page allocations at writeout time, as do various network protocols or even storage drivers. - In line with the above item, filesystem block allocation is performed before a page is dirtied. In the buffer layer, mmap writes can dirty a page with no backing blocks which is a problem if the filesystem is ENOSPC (patches exist for buffer.c for this). Not really something that is the block layers fault but rather the lazyness of the filesystem maintainers. - Large block support. I can mount and run an 8K block size minix3 fs on my 4K page system and it didn't require anything special in the fs. We can go up to about 32MB blocks now, and gigabyte+ blocks would only require one more bit in the fsblock flags. fsblock_superpage blocks are PAGE_CACHE_SIZE, midpage ==, and subpage . Core pagecache code is pretty creaky with respect to this. I think it is mostly race free, but it requires stupid
Re: [RFC] fsblock
Christoph Hellwig wrote: On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote: - In line with the above item, filesystem block allocation is performed before a page is dirtied. In the buffer layer, mmap writes can dirty a page with no backing blocks which is a problem if the filesystem is ENOSPC (patches exist for buffer.c for this). This raises an eyebrow... The handling of ENOSPC prior to mmap write is more an ABI behavior, so I don't see how this can be fixed with internal changes, yet without changing behavior currently exported to userland (and thus affecting code based on such assumptions). Not really, the current behaviour is a bug. And it's not actually buffer layer specific - XFS now has a fix for that bug and it's generic enough that everyone could use it. I'm not sure I follow. If you require block allocation at mmap(2) time, rather than when a page is actually dirtied, you are denying userspace the ability to do sparse files with mmap. A quick Google readily turns up people who have built upon the mmap-sparse-file assumption, and I don't think we want to break those assumptions as a bug fix. Where is the bug? Jeff - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Sat, Jun 30, 2007 at 07:10:27AM -0400, Jeff Garzik wrote: Not really, the current behaviour is a bug. And it's not actually buffer layer specific - XFS now has a fix for that bug and it's generic enough that everyone could use it. I'm not sure I follow. If you require block allocation at mmap(2) time, rather than when a page is actually dirtied, you are denying userspace the ability to do sparse files with mmap. A quick Google readily turns up people who have built upon the mmap-sparse-file assumption, and I don't think we want to break those assumptions as a bug fix. Where is the bug? It's not mmap time but page dirtying time. Currently the default behaviour is not to allocate at page dirtying time but rather at writeout time in some scenarious. (and s/allocation/reservation/ applies for delalloc of course) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote: On Thu, Jun 28, 2007 at 08:35:48AM +1000, David Chinner wrote: On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote: Lets look at a typical example of how IO actually gets done today, starting with sys_write(): sys_write(file, buffer, 1MB) for each page: prepare_write() allocate contiguous chunks of disk attach buffers copy_from_user() commit_write() dirty buffers pdflush: writepages() find pages with contiguous chunks of disk build and submit large bios So, we replace prepare_write and commit_write with an extent based api, but we keep the dirty each buffer part. writepages has to turn that back into extents (bio sized), and the result is completely full of dark dark corner cases. That's true but I don't think an extent data structure means we can become too far divorced from the pagecache or the native block size -- what will end up happening is that often we'll need stuff to map between all those as well, even if it is only at IO-time. I think the fundamental difference is that fsblock still does: mapping_info = page-something, where something is attached on a per page basis. What we really want is mapping_info = lookup_mapping(page), where that function goes and finds something stored on a per extent basis, with extra bits for tracking dirty and locked state. Ideally, in at least some of the cases the dirty and locked state could be at an extent granularity (streaming IO) instead of the block granularity (random IO). In my little brain, even block based filesystems should be able to take advantage of this...but such things are always easier to believe in before the coding starts. But the point is taken, and I do believe that at least for APIs, extent based seems like the best way to go. And that should allow fsblock to be replaced or augmented in future without _too_ much pain. Yup - I've been on the painful end of those dark corner cases several times in the last few months. It's also worth pointing out that mpage_readpages() already works on an extent basis - it overloads bufferheads to provide a map_bh that can point to a range of blocks in the same state. The code then iterates the map_bh range a page at a time building bios (i.e. not even using buffer heads) from that map.. One issue I have with the current nobh and mpage stuff is that it requires multiple calls into get_block (first to prepare write, then to writepage), it doesn't allow filesystems to attach resources required for writeout at prepare_write time, and it doesn't play nicely with buffers in general. (not to mention that nobh error handling is buggy). I haven't done any mpage-like code for fsblocks yet, but I think they wouldn't be too much trouble, and wouldn't have any of the above problems... Could be, but the fundamental issue of sometimes pages have mappings attached and sometimes they don't is still there. The window is smaller, but non-zero. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Thu, Jun 28, 2007 at 08:20:31AM -0400, Chris Mason wrote: On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote: That's true but I don't think an extent data structure means we can become too far divorced from the pagecache or the native block size -- what will end up happening is that often we'll need stuff to map between all those as well, even if it is only at IO-time. I think the fundamental difference is that fsblock still does: mapping_info = page-something, where something is attached on a per page basis. What we really want is mapping_info = lookup_mapping(page), where that function goes and finds something stored on a per extent basis, with extra bits for tracking dirty and locked state. Ideally, in at least some of the cases the dirty and locked state could be at an extent granularity (streaming IO) instead of the block granularity (random IO). In my little brain, even block based filesystems should be able to take advantage of this...but such things are always easier to believe in before the coding starts. Now I wouldn't for a minute deny that at least some of the block information would be well to store in extent/tree format (if XFS does it it must be good!). And yes, I'm sure filesystems with even basic block based allocation could get a reasonable ratio of blocks to extents. However I think it is fundamentally another layer or at least more complexity... fsblocks uses the existing pagecache mapping as (much of) the data structure and uses the existing pagecache locking for the locking. And it fundamentally just provides a block access and IO layer into the pagecache for the filesystem, which I think will often be needed anyway. But that said, I would like to see a generic extent mapping layer sitting between fsblock and the filesystem (I might even have a crack at it myself)... and I could be proven completely wrong and it may be that fsblock isn't required at all after such a layer goes in. So I will try to keep all the APIs extent based. The first thing I actually looked at for get_blocks was for the filesystem to build up a tree of mappings itself, completely unconnected from the pagecache. It just ended up being a little more work and locking but the idea isn't insane :) One issue I have with the current nobh and mpage stuff is that it requires multiple calls into get_block (first to prepare write, then to writepage), it doesn't allow filesystems to attach resources required for writeout at prepare_write time, and it doesn't play nicely with buffers in general. (not to mention that nobh error handling is buggy). I haven't done any mpage-like code for fsblocks yet, but I think they wouldn't be too much trouble, and wouldn't have any of the above problems... Could be, but the fundamental issue of sometimes pages have mappings attached and sometimes they don't is still there. The window is smaller, but non-zero. The aim for fsblocks is that any page under IO will always have fsblocks, which I hope is going to make this easy. In the fsblocks patch I sent out there is a window (with mmapped pages), however that's a bug wich can be fixed rather than a fundamental problem. So writepages will be less problem. Readpages may indeed be more efficient and block mapping with extents than individual fsblocks (or it could be, if it were an extent based API itself). Well I don't know. Extents are always going to have benefits, but I don't know if it means the fsblock part could go away completely. I'll keep it in mind though. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote: I think using fsblock to drive the IO and keep the pagecache flags uptodate and using a btree in the filesystem to manage extents of block allocations wouldn't be a bad idea though. Do any filesystems actually do this? Yes. XFS. But we still need to hold state in buffer heads (BH_delay, BH_unwritten) that is needed to determine what type of allocation/extent conversion is necessary during writeback. i.e. what we originally mapped the page as during the -prepare_write call. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote: On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote: On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote: On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote: [ ... fsblocks vs extent range mapping ] iomaps can double as range locks simply because iomaps are expressions of ranges within the file. Seeing as you can only access a given range exclusively to modify it, inserting an empty mapping into the tree as a range lock gives an effective method of allowing safe parallel reads, writes and allocation into the file. The fsblocks and the vm page cache interface cannot be used to facilitate this because a radix tree is the wrong type of tree to store this information in. A sparse, range based tree (e.g. btree) is the right way to do this and it matches very well with a range based API. I'm really not against the extent based page cache idea, but I kind of assumed it would be too big a change for this kind of generic setup. At any rate, if we'd like to do it, it may be best to ditch the idea of attach mapping information to a page, and switch to lookup mapping information and range locking for a page. Well the get_block equivalent API is extent based one now, and I'll look at what is required in making map_fsblock a more generic call that could be used for an extent-based scheme. An extent based thing IMO really isn't appropriate as the main generic layer here though. If it is really useful and popular, then it could be turned into generic code and sit along side fsblock or underneath fsblock... Lets look at a typical example of how IO actually gets done today, starting with sys_write(): sys_write(file, buffer, 1MB) for each page: prepare_write() allocate contiguous chunks of disk attach buffers copy_from_user() commit_write() dirty buffers pdflush: writepages() find pages with contiguous chunks of disk build and submit large bios So, we replace prepare_write and commit_write with an extent based api, but we keep the dirty each buffer part. writepages has to turn that back into extents (bio sized), and the result is completely full of dark dark corner cases. I do think fsblocks is a nice cleanup on its own, but Dave has a good point that it makes sense to look for ways generalize things even more. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Jun 26, 2007, at 07:14:14, Nick Piggin wrote: On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote: Can we call it a block mapping layer or something like that? e.g. struct blkmap? I'm not fixed on fsblock, but blkmap doesn't grab me either. It is a map from the pagecache to the block layer, but blkmap sounds like it is a map from the block to somewhere. fsblkmap ;) vmblock? pgblock? Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On 27 Jun 2007, at 12:50, Chris Mason wrote: On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote: On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote: On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote: On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote: [ ... fsblocks vs extent range mapping ] iomaps can double as range locks simply because iomaps are expressions of ranges within the file. Seeing as you can only access a given range exclusively to modify it, inserting an empty mapping into the tree as a range lock gives an effective method of allowing safe parallel reads, writes and allocation into the file. The fsblocks and the vm page cache interface cannot be used to facilitate this because a radix tree is the wrong type of tree to store this information in. A sparse, range based tree (e.g. btree) is the right way to do this and it matches very well with a range based API. I'm really not against the extent based page cache idea, but I kind of assumed it would be too big a change for this kind of generic setup. At any rate, if we'd like to do it, it may be best to ditch the idea of attach mapping information to a page, and switch to lookup mapping information and range locking for a page. Well the get_block equivalent API is extent based one now, and I'll look at what is required in making map_fsblock a more generic call that could be used for an extent-based scheme. An extent based thing IMO really isn't appropriate as the main generic layer here though. If it is really useful and popular, then it could be turned into generic code and sit along side fsblock or underneath fsblock... Lets look at a typical example of how IO actually gets done today, starting with sys_write(): Yes, this is very inefficient which is one of the reasons I don't use the generic file write helpers in NTFS. The other reasons are that supporting larger logical block sizes than PAGE_CACHE_SIZE becomes a pain if it is not done this way when the write targets a hole as that requires all pages in the hole to be locked simultaneously which would mean dropping the page lock to acquire the others that are of lower page index and to then re-take the page lock which is horrible - much better to lock all at once from the outset and the other reason is that in NTFS there is such a thing as the initialized size of an attribute which basically states anything past this byte offset must be returned as 0 on read, i.e. it does not have to be read from disk at all, and on write beyond the initialized_size you have to zero on disk everything between the old initialized size and the start of the write before you begin writing and certainly before you update the initalized_size otherwise a concurrent read would see random old data from the disk. For NTFS this effectively becomes: sys_write(file, buffer, 1MB) allocate space for the entire 1MB write if write offset past the initialized_size zero out on disk starting at initialized_size up to the start offset for the write and update the initialized size to be equal to the start offset of the write do { if (current position is in a hole and the NTFS logical block size is PAGE_CACHE_SIZE) { work on (NTFS logical block size / PAGE_CACHE_SIZE) pages in one go; do_pages = vol-cluster_size / PAGE_CACHE_SIZE; } else { work on only one page; do_pages = 1; } fault in for read (do_pages*PAGE_CACHE_SIZE) bytes worth of source pages grab do_pages worth of pages prepare_write - attach buffers to grabbed pages copy data from source to grabbedprepared pages commit_write the copied pages by dirtying their buffers } while (data left to write); The allocation in advance is a huge win both in terms of avoiding fragmentation (NTFS still uses a very simple/stupid allocator so you get a lot of fragmentation if two processes write to different files simultaneously and do so in small chunks) and in terms of performance. I have wondered whether I should perhaps turn on the multi page stuff on for all writes rather than just for ones that go into a hole and the logical size is greater than the PAGE_CACHE_SIZE as that might improve performance even further but I haven't had the time/ inclination to experiment... And I have also wondered whether to go direct to bio/wholes pages at once instead of bothering with dirtying each buffer but the buffers (which are always 512 bytes on NTFS) allow me to easily support dirtying smaller parts of the page which is desired at least on volumes with a logical block size PAGE_CACHE_SIZE as different bits of the page could then reside on completely different locations on disk so writing out unneeded bits of the page could result in a lot of wasted disk head seek times. Best regards, Anton for each page:
Re: [RFC] fsblock
On Thu, Jun 28, 2007 at 08:35:48AM +1000, David Chinner wrote: On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote: Lets look at a typical example of how IO actually gets done today, starting with sys_write(): sys_write(file, buffer, 1MB) for each page: prepare_write() allocate contiguous chunks of disk attach buffers copy_from_user() commit_write() dirty buffers pdflush: writepages() find pages with contiguous chunks of disk build and submit large bios So, we replace prepare_write and commit_write with an extent based api, but we keep the dirty each buffer part. writepages has to turn that back into extents (bio sized), and the result is completely full of dark dark corner cases. That's true but I don't think an extent data structure means we can become too far divorced from the pagecache or the native block size -- what will end up happening is that often we'll need stuff to map between all those as well, even if it is only at IO-time. But the point is taken, and I do believe that at least for APIs, extent based seems like the best way to go. And that should allow fsblock to be replaced or augmented in future without _too_ much pain. Yup - I've been on the painful end of those dark corner cases several times in the last few months. It's also worth pointing out that mpage_readpages() already works on an extent basis - it overloads bufferheads to provide a map_bh that can point to a range of blocks in the same state. The code then iterates the map_bh range a page at a time building bios (i.e. not even using buffer heads) from that map.. One issue I have with the current nobh and mpage stuff is that it requires multiple calls into get_block (first to prepare write, then to writepage), it doesn't allow filesystems to attach resources required for writeout at prepare_write time, and it doesn't play nicely with buffers in general. (not to mention that nobh error handling is buggy). I haven't done any mpage-like code for fsblocks yet, but I think they wouldn't be too much trouble, and wouldn't have any of the above problems... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote: David Chinner wrote: On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote: I'm announcing fsblock now because it is quite intrusive and so I'd like to get some thoughts about significantly changing this core part of the kernel. Can you rename it to something other than shorthand for filesystem block? e.g. When you say: - In line with the above item, filesystem block allocation is performed What are we actually talking aout here? filesystem block allocation is something a filesystem does to allocate blocks on disk, not allocate a mapping structure in memory. Realistically, this is not about filesystem blocks, this is about file offset to disk blocks. i.e. it's a mapping. Yeah, fsblock ~= the layer between the fs and the block layers. Sure, but it's not a filesystem block which is what you are calling it. IMO, it's overloading a well known term with something different, and that's just confusing. Can we call it a block mapping layer or something like that? e.g. struct blkmap? Probably better would be to move towards offset,length rather than page based fs APIs where everything can be batched up nicely and this sort of non-trivial locking can be more optimal. If we are going to turn over the API completely like this, can we seriously look at moving to this sort of interface at the same time? Yeah we can move to anything. But note that fsblock is perfectly happy with = PAGE_CACHE_SIZE blocks today, and isn't _terrible_ at . Extent based block mapping is entirely independent of block size. Please don't confuse the two With a offset/len interface, we can start to track contiguous ranges of blocks rather than persisting with a structure per filesystem block. If you want to save memory, thet's where we need to go. XFS uses iomaps for this purpose - it's basically: - start offset into file - start block on disk - length of mapping - state With special disk blocks for indicating delayed allocation blocks (-1) and unwritten extents (-2). Worst case we end up with is an iomap per filesystem block. I was thinking about doing an extent based scheme, but it has some issues as well. Block based is light weight and simple, it aligns nicely with the pagecache structures. Yes. Block based is simple, but has flexibility and scalability problems. e.g the number of fsblocks that are required to map large files. It's not uncommon for use to have millions of bufferheads lying around after writing a single large file that only has a handful of extents. That's 5-6 orders of magnitude difference there in memory usage and as memory and disk sizes get larger, this will become more of a problem If we allow iomaps to be split and combined along with range locking, we can parallelise read and write access to each file on an iomap basis, etc. There's plenty of goodness that comes from indexing by range Some operations AFAIKS will always need to be per-page (eg. in the core VM it wants to lock a single page to fault it in, or wait for a single page to writeout etc). So I didn't see a huge gain in a one-lock-per-extent type arrangement. For VM operations, no, but they would continue to be locked on a per-page basis. However, we can do filesystem block operations without needing to hold page locks. e.g. space reservation and allocation.. If you're worried about parallelisability, then I don't see what iomaps give you that buffer heads or fsblocks do not? In fact they would be worse because there are fewer of them? :) No, that's wrong. I'm not talking about VM parallelisation, I want to be able to support multiple writers to a single file. i.e. removing the i_mutex restriction on writes. To do that you've got to have a range locking scheme integrated into the block map for the file so that concurrent lookups and allocations don't trip over each other. iomaps can double as range locks simply because iomaps are expressions of ranges within the file. Seeing as you can only access a given range exclusively to modify it, inserting an empty mapping into the tree as a range lock gives an effective method of allowing safe parallel reads, writes and allocation into the file. The fsblocks and the vm page cache interface cannot be used to facilitate this because a radix tree is the wrong type of tree to store this information in. A sparse, range based tree (e.g. btree) is the right way to do this and it matches very well with a range based API. None of what I'm talking about requires any changes to the existing page cache or VM address space. I'm proposing that we should be treat the block mapping as an address space in it's own right. i.e. perhaps the struct page should not have block mapping objects attached to it at all. By separating out the block mapping from the page cache, we make the page cache completely independent of filesystem block size,
Re: [RFC] fsblock
On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote: On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote: Realistically, this is not about filesystem blocks, this is about file offset to disk blocks. i.e. it's a mapping. Yeah, fsblock ~= the layer between the fs and the block layers. Sure, but it's not a filesystem block which is what you are calling it. IMO, it's overloading a well known term with something different, and that's just confusing. Well it is the metadata used to manage the filesystem block for the given bit of pagecache (even if the block is not actually allocated or even a hole, it is deemed to be so by the filesystem). Can we call it a block mapping layer or something like that? e.g. struct blkmap? I'm not fixed on fsblock, but blkmap doesn't grab me either. It is a map from the pagecache to the block layer, but blkmap sounds like it is a map from the block to somewhere. fsblkmap ;) Probably better would be to move towards offset,length rather than page based fs APIs where everything can be batched up nicely and this sort of non-trivial locking can be more optimal. If we are going to turn over the API completely like this, can we seriously look at moving to this sort of interface at the same time? Yeah we can move to anything. But note that fsblock is perfectly happy with = PAGE_CACHE_SIZE blocks today, and isn't _terrible_ at . Extent based block mapping is entirely independent of block size. Please don't confuse the two I'm not, but it seemed like you were confused that fsblock is tied to changing the aops APIs. It is not, but they can be changed to give improvements in a good number of areas (*including* better large block support). With special disk blocks for indicating delayed allocation blocks (-1) and unwritten extents (-2). Worst case we end up with is an iomap per filesystem block. I was thinking about doing an extent based scheme, but it has some issues as well. Block based is light weight and simple, it aligns nicely with the pagecache structures. Yes. Block based is simple, but has flexibility and scalability problems. e.g the number of fsblocks that are required to map large files. It's not uncommon for use to have millions of bufferheads lying around after writing a single large file that only has a handful of extents. That's 5-6 orders of magnitude difference there in memory usage and as memory and disk sizes get larger, this will become more of a problem I guess fsblock is 3 times smaller and you would probably have 16 times fewer of them for such a filesystem (given a 4K page size) still leaves a few orders of magnitude ;) However, fsblock has this nice feature where it can drop the blocks when the last reference goes away, so you really only have fsblocks around for dirty or currently-being-read blocks... But you give me a good idea: I'll gear the filesystem-side APIs to be more extent based as well (eg. fsblock's get_block equivalent). That way it should be much easier to change over to such extents in future or even have an extent based representation sitting in front of the fsblock one and acting as a high density cache in your above situation. If we allow iomaps to be split and combined along with range locking, we can parallelise read and write access to each file on an iomap basis, etc. There's plenty of goodness that comes from indexing by range Some operations AFAIKS will always need to be per-page (eg. in the core VM it wants to lock a single page to fault it in, or wait for a single page to writeout etc). So I didn't see a huge gain in a one-lock-per-extent type arrangement. For VM operations, no, but they would continue to be locked on a per-page basis. However, we can do filesystem block operations without needing to hold page locks. e.g. space reservation and allocation.. You could do that without holding the page locks as well AFAIKS. Actually again it might be a bit troublesome with the current aops APIs, but I don't think fsblock stands in your way there either. If you're worried about parallelisability, then I don't see what iomaps give you that buffer heads or fsblocks do not? In fact they would be worse because there are fewer of them? :) No, that's wrong. I'm not talking about VM parallelisation, I want to be able to support multiple writers to a single file. i.e. removing the i_mutex restriction on writes. To do that you've got to have a range locking scheme integrated into the block map for the file so that concurrent lookups and allocations don't trip over each other. iomaps can double as range locks simply because iomaps are expressions of ranges within the file. Seeing as you can only access a given range exclusively to modify it, inserting an empty mapping into the tree as a range lock gives an effective method of allowing safe parallel reads, writes and allocation
Re: [RFC] fsblock
On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote: On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote: [ ... fsblocks vs extent range mapping ] iomaps can double as range locks simply because iomaps are expressions of ranges within the file. Seeing as you can only access a given range exclusively to modify it, inserting an empty mapping into the tree as a range lock gives an effective method of allowing safe parallel reads, writes and allocation into the file. The fsblocks and the vm page cache interface cannot be used to facilitate this because a radix tree is the wrong type of tree to store this information in. A sparse, range based tree (e.g. btree) is the right way to do this and it matches very well with a range based API. I'm really not against the extent based page cache idea, but I kind of assumed it would be too big a change for this kind of generic setup. At any rate, if we'd like to do it, it may be best to ditch the idea of attach mapping information to a page, and switch to lookup mapping information and range locking for a page. A btree could be used to hold the range mapping and locking, but it could just as easily be a radix tree where you do a gang lookup for the end of the range (the same way my placeholder patch did). It'll still find intersecting range locks but is much faster for random insertion/deletion than the btrees. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote: On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote: On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote: [ ... fsblocks vs extent range mapping ] iomaps can double as range locks simply because iomaps are expressions of ranges within the file. Seeing as you can only access a given range exclusively to modify it, inserting an empty mapping into the tree as a range lock gives an effective method of allowing safe parallel reads, writes and allocation into the file. The fsblocks and the vm page cache interface cannot be used to facilitate this because a radix tree is the wrong type of tree to store this information in. A sparse, range based tree (e.g. btree) is the right way to do this and it matches very well with a range based API. I'm really not against the extent based page cache idea, but I kind of assumed it would be too big a change for this kind of generic setup. At any rate, if we'd like to do it, it may be best to ditch the idea of attach mapping information to a page, and switch to lookup mapping information and range locking for a page. Well the get_block equivalent API is extent based one now, and I'll look at what is required in making map_fsblock a more generic call that could be used for an extent-based scheme. An extent based thing IMO really isn't appropriate as the main generic layer here though. If it is really useful and popular, then it could be turned into generic code and sit along side fsblock or underneath fsblock... It definitely isn't trivial to drive the IO directly from something like that which doesn't correspond to filesystem block size. Splitting parts of your extent tree when things go dirty or uptodate or partially under IO, etc.. joining things back up again when they are mergable. Not that it would be impossible, but it would be a lot more heavyweight than fsblock. I think using fsblock to drive the IO and keep the pagecache flags uptodate and using a btree in the filesystem to manage extents of block allocations wouldn't be a bad idea though. Do any filesystems actually do this? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
Andi Kleen wrote: Nick Piggin [EMAIL PROTECTED] writes: - Structure packing. A page gets a number of buffer heads that are allocated in a linked list. fsblocks are allocated contiguously, so cacheline footprint is smaller in the above situation. It would be interesting to test if that makes a difference for database benchmarks running over file systems. Databases eat a lot of cache so in theory any cache improvements in the kernel which often runs cache cold then should be beneficial. But I guess it would need at least ext2 to test; Minix is probably not good enough. Yeah, you are right. ext2 would be cool to port as it would be a reasonable platform for basic performance testing and comparisons. In general have you benchmarked the CPU overhead of old vs new code? e.g. when we went to BIO scalability went up, but CPU costs of a single request also went up. It would be nice to not continue or better reverse that trend. At the moment there are still a few silly things in the code, such as always calling the insert_mapping indirect function (which is the get_block equivalent). And it does a bit more RMWing than it should still. Also, it always goes to the pagecache radix-tree to find fsblocks, wheras the buffer layer has a per-CPU cache front-end... so in that regard, fsblock is really designed with lockless pagecache in mind, where find_get_page is much faster even in the serial case (though fsblock shouldn't exactly be slow with the current pagecache). However, I don't think there are any fundamental performance problems with fsblock. It even uses one less layer of locking to do regular IO compared with buffer.c, so in theory it might even have some advantage. Single threaded performance of request submission is something I will definitely try to keep optimal. - Large block support. I can mount and run an 8K block size minix3 fs on my 4K page system and it didn't require anything special in the fs. We can go up to about 32MB blocks now, and gigabyte+ blocks would only require one more bit in the fsblock flags. fsblock_superpage blocks are PAGE_CACHE_SIZE, midpage ==, and subpage . Can it be cleanly ifdefed or optimized away? Yeah, it pretty well stays out of the way when using = PAGE_CACHE_SIZE size blocks, generally just a single test and branch of an already-used cacheline. It can be optimised away completely by commenting out #define BLOCK_SUPERPAGE_SUPPORT from fsblock.h. Unless the fragmentation problem is not solved it would seem rather pointless to me. Also I personally still think the right way to approach this is larger softpage size. It does not suffer from a fragmentation problem. It will do scatter gather IO if the pagecache of that block is not contiguous. My naming may be a little confusing: fsblock_superpage (which is a function that returns true if the given fsblock is larger than PAGE_CACHE_SIZE) is just named as to whether the fsblock is larger than a page, rather than having a connection to VM superpages. Don't get me wrong, I think soft page size is a good idea for other reasons as well (less page metadata and page operations), and that 8 or 16K would probably be a good sweet spot for today's x86 systems. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
Chris Mason wrote: On Sun, Jun 24, 2007 at 05:47:55AM +0200, Nick Piggin wrote: My gut feeling is that there are several problem areas you haven't hit yet, with the new code. I would agree with your gut :) Without having read the code yet (light reading for monday morning ;), ext3 and reiserfs use buffers heads for data=ordered to help them do deadlock free writeback. Basically they need to be able to write out the pending data=ordered pages, potentially with the transaction lock held (or if not held, while blocking new transactions from starting). But, writepage, prepare_write and commit_write all need to start a transaction with the page lock already held. So, if the page lock were used for data=ordered writeback, there would be a lock inversion between the transaction lock and the page lock. Ah, thanks for that information. Using buffer heads instead allows the FS to send file data down inside the transaction code, without taking the page lock. So, locking wrt data=ordered is definitely going to be tricky. The best long term option may be making the locking order transaction - page lock, and change writepage to punt to some other queue when it needs to start a transaction. Yeah, that's what I would like, and I think it would come naturally if we move away from these pass down a single, locked page APIs in the VM, and let the filesystem do the locking and potentially batching of larger ranges. write_begin/write_end is a step in that direction (and it helps OCFS and GFS quite a bit). I think there is also not much reason for writepage sites to require the page to lock the page and clear the dirty bit themselves (which has seems ugly to me). So yes, I definitely want to move the aops API along with fsblock. That I have tried to keep it within the existing API for the moment is just because that makes things a bit easier... -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Mon, Jun 25, 2007 at 04:58:48PM +1000, Nick Piggin wrote: Using buffer heads instead allows the FS to send file data down inside the transaction code, without taking the page lock. So, locking wrt data=ordered is definitely going to be tricky. The best long term option may be making the locking order transaction - page lock, and change writepage to punt to some other queue when it needs to start a transaction. Yeah, that's what I would like, and I think it would come naturally if we move away from these pass down a single, locked page APIs in the VM, and let the filesystem do the locking and potentially batching of larger ranges. Definitely. write_begin/write_end is a step in that direction (and it helps OCFS and GFS quite a bit). I think there is also not much reason for writepage sites to require the page to lock the page and clear the dirty bit themselves (which has seems ugly to me). If we keep the page mapping information with the page all the time (ie writepage doesn't have to call get_block ever), it may be possible to avoid sending down a locked page. But, I don't know the delayed allocation internals well enough to say for sure if that is true. Either way, writepage is the easiest of the bunch because it can be deferred. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
David Chinner wrote: On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote: I'm announcing fsblock now because it is quite intrusive and so I'd like to get some thoughts about significantly changing this core part of the kernel. Can you rename it to something other than shorthand for filesystem block? e.g. When you say: - In line with the above item, filesystem block allocation is performed What are we actually talking aout here? filesystem block allocation is something a filesystem does to allocate blocks on disk, not allocate a mapping structure in memory. Realistically, this is not about filesystem blocks, this is about file offset to disk blocks. i.e. it's a mapping. Yeah, fsblock ~= the layer between the fs and the block layers. But don't take the name too literally, like a struct page isn't actually a page of memory ;) Probably better would be to move towards offset,length rather than page based fs APIs where everything can be batched up nicely and this sort of non-trivial locking can be more optimal. If we are going to turn over the API completely like this, can we seriously look at moving to this sort of interface at the same time? Yeah we can move to anything. But note that fsblock is perfectly happy with = PAGE_CACHE_SIZE blocks today, and isn't _terrible_ at . With a offset/len interface, we can start to track contiguous ranges of blocks rather than persisting with a structure per filesystem block. If you want to save memory, thet's where we need to go. XFS uses iomaps for this purpose - it's basically: - start offset into file - start block on disk - length of mapping - state With special disk blocks for indicating delayed allocation blocks (-1) and unwritten extents (-2). Worst case we end up with is an iomap per filesystem block. I was thinking about doing an extent based scheme, but it has some issues as well. Block based is light weight and simple, it aligns nicely with the pagecache structures. If we allow iomaps to be split and combined along with range locking, we can parallelise read and write access to each file on an iomap basis, etc. There's plenty of goodness that comes from indexing by range Some operations AFAIKS will always need to be per-page (eg. in the core VM it wants to lock a single page to fault it in, or wait for a single page to writeout etc). So I didn't see a huge gain in a one-lock-per-extent type arrangement. If you're worried about parallelisability, then I don't see what iomaps give you that buffer heads or fsblocks do not? In fact they would be worse because there are fewer of them? :) But remember that once the filesystems have accessor APIs and can handle multiple pages per fsblock, that would already be most of the work done for the fs and the mm to go to an extent based representation. FWIW, I really see little point in making all the filesystems work with fsblocks if the plan is to change the API again in a major way a year down the track. Let's get all the changes we think are necessary in one basket first, and then work out a coherent plan to implement them ;) The aops API changes and the fsblock layer are kind of two seperate things. I'm slowly implementing things as I go (eg. see perform_write aop, which is exactly the offset,length based API that I'm talking about). fsblocks can be implemented on the old or the new APIs. New APIs won't invalidate work to convert a filesystem to fsblocks. - Large block support. I can mount and run an 8K block size minix3 fs on my 4K page system and it didn't require anything special in the fs. We can go up to about 32MB blocks now, and gigabyte+ blocks would only require one more bit in the fsblock flags. fsblock_superpage blocks are PAGE_CACHE_SIZE, midpage ==, and subpage . My 2c worth - this is a damn complex way of introducing large block size support. It has all the problems I pointed out that it would have (locking issues, vmap overhead, every filesystem needs needs major changes and it's not very efficient) and it's going to take quite some time to stabilise. What locking issues? It locks pages in pagecache offset ascending order, which already has precedent and is really the only sane way to do it so it's not like it precludes other possible sane lock orderings. vmap overhead is an issue, however I did it mainly for easy of conversion. I guess things like superblocks and such would make use of it happily. Most other things should be able to be implemented with page based helpers (just a couple of bitops helpers would pretty much cover minix). If it is still a problem, then I can implement a proper vmap cache. But the major changes in the filesystem are not for vmaps, but for page accessors. As I said, this allows blkdev to move out of lowmem and also closes CPU cache coherency problems. (as well as not having to carry around a vmem pointer of course). If this is the only real feature that fsblocks are going to
Re: [RFC] fsblock
Nick Piggin [EMAIL PROTECTED] writes: - Structure packing. A page gets a number of buffer heads that are allocated in a linked list. fsblocks are allocated contiguously, so cacheline footprint is smaller in the above situation. It would be interesting to test if that makes a difference for database benchmarks running over file systems. Databases eat a lot of cache so in theory any cache improvements in the kernel which often runs cache cold then should be beneficial. But I guess it would need at least ext2 to test; Minix is probably not good enough. In general have you benchmarked the CPU overhead of old vs new code? e.g. when we went to BIO scalability went up, but CPU costs of a single request also went up. It would be nice to not continue or better reverse that trend. - Large block support. I can mount and run an 8K block size minix3 fs on my 4K page system and it didn't require anything special in the fs. We can go up to about 32MB blocks now, and gigabyte+ blocks would only require one more bit in the fsblock flags. fsblock_superpage blocks are PAGE_CACHE_SIZE, midpage ==, and subpage . Can it be cleanly ifdefed or optimized away? Unless the fragmentation problem is not solved it would seem rather pointless to me. Also I personally still think the right way to approach this is larger softpage size. -Andi - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] fsblock
I'm announcing fsblock now because it is quite intrusive and so I'd like to get some thoughts about significantly changing this core part of the kernel. fsblock is a rewrite of the buffer layer (ding dong the witch is dead), which I have been working on, on and off and is now at the stage where some of the basics are working-ish. This email is going to be long... Firstly, what is the buffer layer? The buffer layer isn't really a buffer layer as in the buffer cache of unix: the block device cache is unified with the pagecache (in terms of the pagecache, a blkdev file is just like any other, but with a 1:1 mapping between offset and block). There are filesystem APIs to access the block device, but these go through the block device pagecache as well. These don't exactly define the buffer layer either. The buffer layer is a layer between the pagecache and the block device for block based filesystems. It keeps a translation between logical offset and physical block number, as well as meta information such as locks, dirtyness, and IO status of each block. This information is tracked via the buffer_head structure. Why rewrite the buffer layer? Lots of people have had a desire to completely rip out the buffer layer, but we can't do that[*] because it does actually serve a useful purpose. Why the bad rap? Because the code is old and crufty, and buffer_head is an awful name. It must be among the oldest code in the core fs/vm, and the main reason is because of the inertia of so many and such complex filesystems. [*] About the furthest we could go is use the struct page for the information otherwise stored in the buffer_head, but this would be tricky and suboptimal for filesystems with non page sized blocks and would probably bloat the struct page as well. So why rewrite rather than incremental improvements? Incremental improvements are logically the correct way to do this, and we probably could go from buffer.c to fsblock.c in steps. But I didn't do this because: a) the blinding pace at which things move in this area would make me an old man before it would be complete; b) I didn't actually know exactly what it was going to look like before starting on it; c) I wanted stable root filesystems and such when testing it; and d) I found it reasonably easy to have both layers coexist (it uses an extra page flag, but even that wouldn't be needed if the old buffer layer was better decoupled from the page cache). I started this as an exercise to see how the buffer layer could be improved, and I think it is working out OK so far. The name is fsblock because it basically ties the fs layer to the block layer. I think Andrew has wanted to rename buffer_head to block before, but block is too clashy, and it isn't a great deal more descriptive than buffer_head. I believe fsblock is. I'll go through a list of things where I have hopefully improved on the buffer layer, off the top of my head. The big caveat here is that minix is the only real filesystem I have converted so far, and complex journalled filesystems might pose some problems that water down its goodness (I don't know). - data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on 64-bit (could easily be 32 if we can have int bitops). Compare this to around 50 and 100ish for struct buffer_head. With a 4K page and 1K blocks, IO requires 10% RAM overhead in buffer heads alone. With fsblocks you're down to around 3%. - Structure packing. A page gets a number of buffer heads that are allocated in a linked list. fsblocks are allocated contiguously, so cacheline footprint is smaller in the above situation. - Data / metadata separation. I have a struct fsblock and a struct fsblock_meta, so we could put more stuff into the usually less used fsblock_meta without bloating it up too much. After a few tricks, these are no longer any different in my code, and dirty up the typing quite a lot (and I'm aware it still has some warnings, thanks). So if not useful this could be taken out. - Locking. fsblocks completely use the pagecache for locking and lookups. The page lock is used, but there is no extra per-inode lock that buffer has. Would go very nicely with lockless pagecache. RCU is used for one non-blocking fsblock lookup (find_get_block), but I'd really rather hope filesystems can tolerate that blocking, and get rid of RCU completely. (actually this is not quite true because mapping-private_lock is still used for mark_buffer_dirty_inode equivalent, but that's a relatively rare operation). - Coupling with pagecache metadata. Pagecache pages contain some metadata that is logically redundant because it is tracked in buffers as well (eg. a page is dirty if one or more buffers are dirty, or uptodate if all buffers are uptodate). This is great because means we can avoid that layer in some situations, but they can get out of sync. eg. if a filesystem writes a buffer out by hand, its pagecache page will stay dirty,
Re: [RFC] fsblock
Just clarify a few things. Don't you hate rereading a long work you wrote? (oh, you're supposed to do that *before* you press send?). On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote: I'm announcing fsblock now because it is quite intrusive and so I'd like to get some thoughts about significantly changing this core part of the kernel. fsblock is a rewrite of the buffer layer (ding dong the witch is dead), which I have been working on, on and off and is now at the stage where some of the basics are working-ish. This email is going to be long... Firstly, what is the buffer layer? The buffer layer isn't really a buffer layer as in the buffer cache of unix: the block device cache is unified with the pagecache (in terms of the pagecache, a blkdev file is just like any other, but with a 1:1 mapping between offset and block). I mean, in Linux, the block device cache is unified. UNIX I believe did all its caching in a buffer cache, below the filesystem. - Large block support. I can mount and run an 8K block size minix3 fs on my 4K page system and it didn't require anything special in the fs. We Oh, and I don't have a Linux mkfs that makes minixv3 filesystems. I had an image kindly made for me because I don't use minix. If you want to test large block support, I won't email it to you though: you can just convert ext2 or ext3 to fsblock ;) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
Nick Piggin wrote: - No deadlocks (hopefully). The buffer layer is technically deadlocky by design, because it can require memory allocations at page writeout-time. It also has one path that cannot tolerate memory allocation failures. No such problems for fsblock, which keeps fsblock metadata around for as long as a page is dirty (this still has problems vs get_user_pages, but that's going to require an audit of all get_user_pages sites. Phew). - In line with the above item, filesystem block allocation is performed before a page is dirtied. In the buffer layer, mmap writes can dirty a page with no backing blocks which is a problem if the filesystem is ENOSPC (patches exist for buffer.c for this). This raises an eyebrow... The handling of ENOSPC prior to mmap write is more an ABI behavior, so I don't see how this can be fixed with internal changes, yet without changing behavior currently exported to userland (and thus affecting code based on such assumptions). - An inode's metadata must be tracked per-inode in order for fsync to work correctly. buffer contains helpers to do this for basic filesystems, but any block can be only the metadata for a single inode. This is not really correct for things like inode descriptor blocks. fsblock can track multiple inodes per block. (This is non trivial, and it may be overkill so it could be reverted to a simpler scheme like buffer). hrm; no specific comment but this seems like an idea/area that needs to be fleshed out more, by converting some of the more advanced filesystems. - Large block support. I can mount and run an 8K block size minix3 fs on my 4K page system and it didn't require anything special in the fs. We can go up to about 32MB blocks now, and gigabyte+ blocks would only require one more bit in the fsblock flags. fsblock_superpage blocks are PAGE_CACHE_SIZE, midpage ==, and subpage . definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, like I've been planning. So. Comments? Is this something we want? If yes, then how would we transition from buffer.c to fsblock.c? Your work is definitely interesting, but I think it will be even more interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are converted. My gut feeling is that there are several problem areas you haven't hit yet, with the new code. Also, once things are converted, the question of transitioning from buffer.c will undoubtedly answer itself. That's the way several of us handle transitions: finish all the work, then look with fresh eyes and conceive a path from the current code to your enhanced code. Jeff - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote: Nick Piggin wrote: - No deadlocks (hopefully). The buffer layer is technically deadlocky by design, because it can require memory allocations at page writeout-time. It also has one path that cannot tolerate memory allocation failures. No such problems for fsblock, which keeps fsblock metadata around for as long as a page is dirty (this still has problems vs get_user_pages, but that's going to require an audit of all get_user_pages sites. Phew). - In line with the above item, filesystem block allocation is performed before a page is dirtied. In the buffer layer, mmap writes can dirty a page with no backing blocks which is a problem if the filesystem is ENOSPC (patches exist for buffer.c for this). This raises an eyebrow... The handling of ENOSPC prior to mmap write is more an ABI behavior, so I don't see how this can be fixed with internal changes, yet without changing behavior currently exported to userland (and thus affecting code based on such assumptions). I believe people are happy to have it SIGBUS (which is how the VM is already set up with page_mkwrite, and what fsblock does). - An inode's metadata must be tracked per-inode in order for fsync to work correctly. buffer contains helpers to do this for basic filesystems, but any block can be only the metadata for a single inode. This is not really correct for things like inode descriptor blocks. fsblock can track multiple inodes per block. (This is non trivial, and it may be overkill so it could be reverted to a simpler scheme like buffer). hrm; no specific comment but this seems like an idea/area that needs to be fleshed out more, by converting some of the more advanced filesystems. Yep. It's conceptually fairly simple though, and it might be easier than having filesystems implement their own complex syncing that finds and syncs everything themselves. - Large block support. I can mount and run an 8K block size minix3 fs on my 4K page system and it didn't require anything special in the fs. We can go up to about 32MB blocks now, and gigabyte+ blocks would only require one more bit in the fsblock flags. fsblock_superpage blocks are PAGE_CACHE_SIZE, midpage ==, and subpage . definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, like I've been planning. Yeah, it wasn't the primary motivation for the rewrite, but it would be negligent to not even consider large blocks in such a rewrite, I think. So. Comments? Is this something we want? If yes, then how would we transition from buffer.c to fsblock.c? Your work is definitely interesting, but I think it will be even more interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are converted. Well minix has dir in pagecache ;) But you're completely right: ext2 will be the next step and then ext3 and things like XFS and NTFS will be the real test. I think I could eventually get ext2 done (one of the biggest headaches is simply just converting -b_data accesses), however unlikely a journalling one. My gut feeling is that there are several problem areas you haven't hit yet, with the new code. I would agree with your gut :) Also, once things are converted, the question of transitioning from buffer.c will undoubtedly answer itself. That's the way several of us handle transitions: finish all the work, then look with fresh eyes and conceive a path from the current code to your enhanced code. Yeah that would be nice. It's very difficult because of so much filesystem code. I'd say it would be feasible to step buffer.c into fsblock.c, however if we were to track all (or even the common) filesystems along with that it would introduce a huge number of kind-of-redundant changes that I don't think all fs maintainers would have time to write (and as I said, I can't do it myself). Anyway, let's cross that bridge if and when we come to it. For now, the big thing that needs to be done is convert a big fs and see if the results tell us that it's workable. Thanks for the comments Jeff. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote: fsblock is a rewrite of the buffer layer (ding dong the witch is dead), which I have been working on, on and off and is now at the stage where some of the basics are working-ish. This email is going to be long... Long overdue. Thank you. -- wli - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html