Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > I have the updated patches ready which take care of Andrew's comments. > Will run some tests and post them soon. > > But, before submitting these patches, I think it will be better to finalize > on certain things which might be worth some discussion here: > > 1) Should the file size change when preallocation is done beyond EOF ? >- Andreas and Chris Wedgwood are in favor of not changing the > file size in this case. I also tend to agree with them. Does anyone > has an argument in favor of changing the filesize ? > If not, I will remove the code which changes the filesize, before I > resubmit the concerned ext4 patch. I think there needs to be both. If we don't have a mechanism to atomically change the file size with the preallocation, then applications that use stat() to work out if they need to preallocate more space will end up racing. > 2) For FA_UNALLOCATE mode, should the file system allow unallocation >of normal (non-preallocated) blocks (blocks allocated via >regular write/truncate operations) also (i.e. work as punch()) ? Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what i did for FA_UNALLOCATE as well. >- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > we need to finalize on the convention here as a general guideline > to all the filesystems that implement fallocate. > > 3) If above is true, the file size will need to be changed >for "unallocation" when block holding the EOF gets unallocated. No - we punch a hole. If you want the filesize to change, then you use ftruncate() to remove the blocks at EOF and change the file size atomically. > 4) Should we update mtime & ctime on a successfull allocation/ >unallocation ? >- David Chinner raised this question in following post: > http://lkml.org/lkml/2007/4/29/407 > I think it makes sense to update the [mc]time for a successfull > preallocation/unallocation. Does anyone feel otherwise ? > It will be interesting to know how XFS behaves currently. Does XFS > update [mc]time for preallocation ? No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size changes. If the filesize changes, it behaves exactly the same way that ftruncate() behaves. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Wed, 9 May 2007 14:51:41 -0500, Matt Mackall wrote: > On Wed, May 09, 2007 at 11:59:23AM -0700, Valerie Henson wrote: > > > > Hrm. Can you help me understand how you would check i_size then? > > That's pretty straightforward, I think. When we check an inode, we > have to check whether it has a block that corresponds with i_size, and > none beyond that. i_size is indeed simple, but that got me thinking. i_blocks is a much harder problem. Luckily it is almost the same problem as the free/used block count for the filesystem. And again the solution would be to have a tree structure and have a sub-total for each node in the tree. Now, inodes already have a tree structure, the indirect blocks. So indirect blocks would need to get an extra field somewhere to store how many used blocks are below them somewhere. Only problem is: indirect blocks have a nice power-of-two size and no spare space around. > That begs the question of when we check various pieces of data. It > seems the best time to check the various elements of an inode is when > we're checking the tile it lives on. This is when we'd check i_size, > that link counts made sense and that the ring of hardlinks was > correct, etc. Yup. Checking i_size costs O(log(n)), i_count with above method is O(log(n)) as well. The hardlink ring is O(number of links). For most people that don't have a forest of hard-linked kernel trees around, that should be fairly small as well. I believe for large files it is important not to check the complete file. We can divide&conquer the physical device, so we can do the same with files. Although I wonder if that would require a dirty bit for inodes as well. > We will, unfortunately, need to be able to check an entire directory > at once. There's no other efficient way to assure that there are no > duplicate names in a directory, for instance. There is. As long as directories are in htree or similar format, that is. Problem is the same as fast lookup. Jörn -- tglx1 thinks that joern should get a (TM) for "Thinking Is Hard" -- Thomas Gleixner - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] AFS: Implement basic file write support
David Howells wrote: +/* + * prepare a page for being written to + */ +static int afs_prepare_page(struct afs_vnode *vnode, struct page *page, + struct key *key, unsigned offset, unsigned to) +{ + unsigned eof, tail, start, stop, len; + loff_t i_size, pos; + void *p; + int ret; + + _enter(""); + + if (offset == 0 && to == PAGE_SIZE) + return 0; + + p = kmap(page); + + i_size = i_size_read(&vnode->vfs_inode); + pos = (loff_t) page->index << PAGE_SHIFT; + if (pos >= i_size) { + /* partial write, page beyond EOF */ + _debug("beyond"); + if (offset > 0) + memset(p, 0, offset); + if (to < PAGE_SIZE) + memset(p + to, 0, PAGE_SIZE - to); + kunmap(page); + return 0; + } + + if (i_size - pos >= PAGE_SIZE) { + /* partial write, page entirely before EOF */ + _debug("before"); + tail = eof = PAGE_SIZE; + } else { + /* partial write, page overlaps EOF */ + eof = i_size - pos; + _debug("overlap %u", eof); + tail = max(eof, to); + if (tail < PAGE_SIZE) + memset(p + tail, 0, PAGE_SIZE - tail); + if (offset > eof) + memset(p + eof, 0, PAGE_SIZE - eof); + } + + kunmap(p); + + ret = 0; + if (offset > 0 || eof > to) { + /* need to fill one or two bits that aren't going to be written +* (cover both fillers in one read if there are two) */ + start = (offset > 0) ? 0 : to; + stop = (eof > to) ? eof : offset; + len = stop - start; + _debug("wr=%u-%u av=0-%u [EMAIL PROTECTED]", + offset, to, eof, start, len); + ret = afs_fill_page(vnode, key, start, len, page); + } + + _leave(" = %d", ret); + return ret; +} + +/* + * prepare to perform part of a write to a page + * - the caller holds the page locked, preventing it from being written out or + * modified by anyone else + */ +int afs_prepare_write(struct file *file, struct page *page, + unsigned offset, unsigned to) +{ + struct afs_writeback *candidate, *wb; + struct afs_vnode *vnode = AFS_FS_I(file->f_dentry->d_inode); + struct key *key = file->private_data; + pgoff_t index; + int ret; + + _enter("{%x:%u},{%lx},%u,%u", + vnode->fid.vid, vnode->fid.vnode, page->index, offset, to); + + candidate = kzalloc(sizeof(*candidate), GFP_KERNEL); + if (!candidate) + return -ENOMEM; + candidate->vnode = vnode; + candidate->first = candidate->last = page->index; + candidate->offset_first = offset; + candidate->to_last = to; + candidate->usage = 1; + candidate->state = AFS_WBACK_PENDING; + init_waitqueue_head(&candidate->waitq); + + if (!PageUptodate(page)) { + _debug("not up to date"); + ret = afs_prepare_page(vnode, page, key, offset, to); + if (ret < 0) { + kfree(candidate); + _leave(" = %d [prep]", ret); + return ret; + } + SetPageUptodate(page); + } Why do you call SetPageUptodate when the page is not up to date? That leaks uninitialised data, AFAIKS. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] AF_RXRPC: Reduce debugging noise.
From: David Howells <[EMAIL PROTECTED]> Date: Wed, 09 May 2007 14:51:47 +0100 > Reduce debugging noise generated by AF_RXRPC. > > Signed-off-by: David Howells <[EMAIL PROTECTED]> Applied, thanks David. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Wed, May 09, 2007 at 12:01:13PM -0700, Valerie Henson wrote: > On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote: > > On Sun, Apr 29, 2007 at 07:23:49PM -0400, Theodore Tso wrote: > > > There are a number of filesystem corruptions this algorithm won't > > > catch. The most obvious is one where the directory tree isn't really > > > a tree, but an cyclic graph. What if you have something like this: > > > > > > A <+ > > > / \ | > > > B C^ > > >/ | > > > D->+ > > > > > > That is, what if D's parent is B, and B's parent is A, and A's parent > > > is... D? Assume for the sake of argument that each inode, A, B, C, D, > > > are in separate tiles. > > > > From the original message: > > > > Inodes have a backpointer to a directory that links them. Hardlinked > > files have two extra inode pointers in the directory structure, to > > the previous and next directories containing the link. Hardlinked > > inodes have a checksum of the members of that list. > > > > When we check directory D, D's inode has a backpointer (which had > > better match ".." in the directory itself if we keep that > > redundancy!). If we can follow this back to root (using a standard > > two-pointer cycle detection algorith), we have no cycle. As we also > > check that every inode pointed to by directory D also points back to > > D, any deviation from a valid tree can be detected. > > > > And again, a small cache of inodes known to be properly rooted will > > save a lot of checks. > > I really, really like this idea. I wonder how hard it would be to > prototype on something like ext3. Any bored grad students listening? It should be pretty straightforward as mods to filesystems go. Most of the work is in shimming the tile read/write layer under the rest of the FS with helper functions and teaching the file I/O code to pass down the reverse pointers to the tile layer. Truncate/rm will also have to null out reverse references on tiles, but that can be done at the same time bits are fixed up in the block bitmaps. Fixing up directories to record the hardlink ring should be easier. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Wed, May 09, 2007 at 11:59:23AM -0700, Valerie Henson wrote: > On Wed, May 09, 2007 at 12:06:52PM -0500, Matt Mackall wrote: > > On Wed, May 09, 2007 at 12:56:39AM -0700, Valerie Henson wrote: > > > On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote: > > > > > > > > This does mean that our time to make progress on a check is bounded at > > > > the top by the size of our largest file. If we have a degenerate > > > > filesystem filled with a single file, this will in fact take as long > > > > as a conventional fsck. If your filesystem has, say, 100 roughly > > > > equally-sized files, you're back in Chunkfs territory. > > > > > > Hm, I'm not sure that everyone understands, a particular subtlety of > > > how the fsck algorithm works in chunkfs. A lot of people seem to > > > think that you need to check *all* cross-chunk links, every time an > > > individual chunk is checked. That's not the case; you only need to > > > check the links that go into and out of the dirty chunk. You also > > > don't need to check the other parts of the file outside the chunk, > > > except for perhaps reading the byte range info for each continuation > > > node and making sure no two continuation inodes think they both have > > > the same range, but you don't check the indirect blocks, block > > > bitmaps, etc. > > > > My reference to chunkfs here is simply that the worst-case is checking ~1 > > chunk, which is about 1/100th of a volume. > > I understand that being the case if each file is only in one tile. > Does the fpos make this irrelevant as well? Fpos does make it irrelevant. > > > > So we should have no trouble checking an exabyte-sized filesystem on a > > > > 4MB box. Even if it has one exabyte-sized file! We check the first > > > > tile, see that it points to our file, then iterate through that file, > > > > checking that the forward and reverse pointers for each block match > > > > and all CRCs match, etc. We cache the file's inode as clean, finish > > > > checking anything else in the first tile, then mark it clean. When we > > > > get > > > > to the next tile (and the next billion after that!), we notice that > > > > each block points back to our cached inode and skip rechecking it. > > > > > > If I understand correctly then, if you do have a one exabyte sized > > > file, and any part of it is in a dirty tile, you will need to check > > > the whole file? Or will Joern's fpos proposal fix this? > > > > Yes, the original idea is you have to check every file that "covers" a > > tile in its entirety. With Joern's fpos piece, I think we can restrict > > our checks to just the section of the file that covers the tile. > > Hrm. Can you help me understand how you would check i_size then? That's pretty straightforward, I think. When we check an inode, we have to check whether it has a block that corresponds with i_size, and none beyond that. That begs the question of when we check various pieces of data. It seems the best time to check the various elements of an inode is when we're checking the tile it lives on. This is when we'd check i_size, that link counts made sense and that the ring of hardlinks was correct, etc. We would also check that direct and indirect pointers were sensible (ie pointing to data blocks on the disk). If so, we know we'll eventually verify those pointers when we check the corresponding back pointers from those blocks. Directory checks are a bit more problematic. But I think we can trigger a directory check each time we hit a tile data block that's part of a directory. Keeping a small cache of checked directories will keep this from being expensive. We will, unfortunately, need to be able to check an entire directory at once. There's no other efficient way to assure that there are no duplicate names in a directory, for instance. In summary, checking a tile requires trivial checks on all the inodes and directories that point into a tile. Inodes, directories, and data that are inside a tile get checked more thoroughly but still don't need to do much pointer chasing. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
Valerie Henson writes: [...] > > You're right about needing to read the equivalent data-structure - for > other reasons, each continuation inode will need an easily accessible > list of byte ranges covered by that inode. (Sounds like, hey, > extents!) The important part is that you don't have go walk all the I see. I was under impression that idea was to use indirect blocks themselves as that data-structure, e.g., block number 0 to mark holes, block number 1 to mark "block not in this continuation", and all other block numbers for real blocks. > indirect blocks or check your bitmap. > > -VAL Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote: > On Sun, Apr 29, 2007 at 07:23:49PM -0400, Theodore Tso wrote: > > There are a number of filesystem corruptions this algorithm won't > > catch. The most obvious is one where the directory tree isn't really > > a tree, but an cyclic graph. What if you have something like this: > > > > A <+ > >/ \ | > > B C^ > > / | > > D->+ > > > > That is, what if D's parent is B, and B's parent is A, and A's parent > > is... D? Assume for the sake of argument that each inode, A, B, C, D, > > are in separate tiles. > > From the original message: > > Inodes have a backpointer to a directory that links them. Hardlinked > files have two extra inode pointers in the directory structure, to > the previous and next directories containing the link. Hardlinked > inodes have a checksum of the members of that list. > > When we check directory D, D's inode has a backpointer (which had > better match ".." in the directory itself if we keep that > redundancy!). If we can follow this back to root (using a standard > two-pointer cycle detection algorith), we have no cycle. As we also > check that every inode pointed to by directory D also points back to > D, any deviation from a valid tree can be detected. > > And again, a small cache of inodes known to be properly rooted will > save a lot of checks. I really, really like this idea. I wonder how hard it would be to prototype on something like ext3. Any bored grad students listening? -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Wed, May 09, 2007 at 12:06:52PM -0500, Matt Mackall wrote: > On Wed, May 09, 2007 at 12:56:39AM -0700, Valerie Henson wrote: > > On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote: > > > > > > This does mean that our time to make progress on a check is bounded at > > > the top by the size of our largest file. If we have a degenerate > > > filesystem filled with a single file, this will in fact take as long > > > as a conventional fsck. If your filesystem has, say, 100 roughly > > > equally-sized files, you're back in Chunkfs territory. > > > > Hm, I'm not sure that everyone understands, a particular subtlety of > > how the fsck algorithm works in chunkfs. A lot of people seem to > > think that you need to check *all* cross-chunk links, every time an > > individual chunk is checked. That's not the case; you only need to > > check the links that go into and out of the dirty chunk. You also > > don't need to check the other parts of the file outside the chunk, > > except for perhaps reading the byte range info for each continuation > > node and making sure no two continuation inodes think they both have > > the same range, but you don't check the indirect blocks, block > > bitmaps, etc. > > My reference to chunkfs here is simply that the worst-case is checking ~1 > chunk, which is about 1/100th of a volume. I understand that being the case if each file is only in one tile. Does the fpos make this irrelevant as well? > > > So we should have no trouble checking an exabyte-sized filesystem on a > > > 4MB box. Even if it has one exabyte-sized file! We check the first > > > tile, see that it points to our file, then iterate through that file, > > > checking that the forward and reverse pointers for each block match > > > and all CRCs match, etc. We cache the file's inode as clean, finish > > > checking anything else in the first tile, then mark it clean. When we get > > > to the next tile (and the next billion after that!), we notice that > > > each block points back to our cached inode and skip rechecking it. > > > > If I understand correctly then, if you do have a one exabyte sized > > file, and any part of it is in a dirty tile, you will need to check > > the whole file? Or will Joern's fpos proposal fix this? > > Yes, the original idea is you have to check every file that "covers" a > tile in its entirety. With Joern's fpos piece, I think we can restrict > our checks to just the section of the file that covers the tile. Hrm. Can you help me understand how you would check i_size then? -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Wed, May 09, 2007 at 03:16:41PM +0400, Nikita Danilov wrote: > > I guess I miss something. If chunkfs maintains "at most one continuation > per chunk" invariant, then continuation inode might end up with multiple > byte ranges, and to check that they do not overlap one has to read > indirect blocks (or some equivalent data-structure). You're right about needing to read the equivalent data-structure - for other reasons, each continuation inode will need an easily accessible list of byte ranges covered by that inode. (Sounds like, hey, extents!) The important part is that you don't have go walk all the indirect blocks or check your bitmap. -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, 2007-05-09 at 21:31 +0530, Amit K. Arora wrote: > I have the updated patches ready which take care of Andrew's comments. > Will run some tests and post them soon. > > But, before submitting these patches, I think it will be better to finalize > on certain things which might be worth some discussion here: > > 1) Should the file size change when preallocation is done beyond EOF ? >- Andreas and Chris Wedgwood are in favor of not changing the > file size in this case. I also tend to agree with them. Does anyone > has an argument in favor of changing the filesize ? > If not, I will remove the code which changes the filesize, before I > resubmit the concerned ext4 patch. > If we chose not to update the file size beyong EOF, then for filesystem without fallocate() support (ext2,3 currently), posix_fallocate() will follow the hard way(zero-out) to do preallocation. Then we will get different behavior on filesystems w/o fallocate() support. It make sense to be consistent, IMO. My point of view, preallocation is just a efficient way to allocating blocks for files without zero-out, other than this, the new behavior should be consistent with the old way: file size update,mtime/ctime, ENOSPC etc. Mingming - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Wed, May 09, 2007 at 12:56:39AM -0700, Valerie Henson wrote: > On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote: > > > > This does mean that our time to make progress on a check is bounded at > > the top by the size of our largest file. If we have a degenerate > > filesystem filled with a single file, this will in fact take as long > > as a conventional fsck. If your filesystem has, say, 100 roughly > > equally-sized files, you're back in Chunkfs territory. > > Hm, I'm not sure that everyone understands, a particular subtlety of > how the fsck algorithm works in chunkfs. A lot of people seem to > think that you need to check *all* cross-chunk links, every time an > individual chunk is checked. That's not the case; you only need to > check the links that go into and out of the dirty chunk. You also > don't need to check the other parts of the file outside the chunk, > except for perhaps reading the byte range info for each continuation > node and making sure no two continuation inodes think they both have > the same range, but you don't check the indirect blocks, block > bitmaps, etc. My reference to chunkfs here is simply that the worst-case is checking ~1 chunk, which is about 1/100th of a volume. > > So we should have no trouble checking an exabyte-sized filesystem on a > > 4MB box. Even if it has one exabyte-sized file! We check the first > > tile, see that it points to our file, then iterate through that file, > > checking that the forward and reverse pointers for each block match > > and all CRCs match, etc. We cache the file's inode as clean, finish > > checking anything else in the first tile, then mark it clean. When we get > > to the next tile (and the next billion after that!), we notice that > > each block points back to our cached inode and skip rechecking it. > > If I understand correctly then, if you do have a one exabyte sized > file, and any part of it is in a dirty tile, you will need to check > the whole file? Or will Joern's fpos proposal fix this? Yes, the original idea is you have to check every file that "covers" a tile in its entirety. With Joern's fpos piece, I think we can restrict our checks to just the section of the file that covers the tile. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On May 09, 2007 21:31 +0530, Amit K. Arora wrote: > 2) For FA_UNALLOCATE mode, should the file system allow unallocation >of normal (non-preallocated) blocks (blocks allocated via >regular write/truncate operations) also (i.e. work as punch()) ? >- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > we need to finalize on the convention here as a general guideline > to all the filesystems that implement fallocate. I would only allow this on FA_ALLOCATE extents. That means it won't be possible to do this for filesystems that don't understand unwritten extents unless there are blocks allocated beyond EOF. > 3) If above is true, the file size will need to be changed >for "unallocation" when block holding the EOF gets unallocated. >- If we do not "unallocate" normal (non-preallocated) blocks and we > do not change the file size on preallocation, then this is a > non-issue. Not necessarily. That will just make the file sparse. If FA_ALLOCATE does not change the file size, why should FA_UNALLOCATE. > 4) Should we update mtime & ctime on a successfull allocation/ >unallocation ? I would say yes. If glibc does the fallback fallocate via write() the mtime/ctime will be updated, so it makes sense to be consistent for both methods. Also, it just makes sense from the "this file was modified" point of view. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] AFS: Implement basic file write support
On Wed, 09 May 2007 12:07:39 +0100 David Howells <[EMAIL PROTECTED]> wrote: > Andrew Morton <[EMAIL PROTECTED]> wrote: > > > set_page_dirty() will set I_DIRTY_PAGES only. ie: the inode has dirty > > pagecache data. > > > > To tell the VFS that the inode itself is dirty one needs to run > > mark_inode_dirty(). > > But what's the difference in this case? I don't need to write the inode back > per se, and the inode attributes can be updated by the mechanism of data > storage. > Ah. Well if you don't need to write the inode back then sure, there shouldn't be a need to mark it dirty. That's what I was asking ;) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
I have the updated patches ready which take care of Andrew's comments. Will run some tests and post them soon. But, before submitting these patches, I think it will be better to finalize on certain things which might be worth some discussion here: 1) Should the file size change when preallocation is done beyond EOF ? - Andreas and Chris Wedgwood are in favor of not changing the file size in this case. I also tend to agree with them. Does anyone has an argument in favor of changing the filesize ? If not, I will remove the code which changes the filesize, before I resubmit the concerned ext4 patch. 2) For FA_UNALLOCATE mode, should the file system allow unallocation of normal (non-preallocated) blocks (blocks allocated via regular write/truncate operations) also (i.e. work as punch()) ? - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still we need to finalize on the convention here as a general guideline to all the filesystems that implement fallocate. 3) If above is true, the file size will need to be changed for "unallocation" when block holding the EOF gets unallocated. - If we do not "unallocate" normal (non-preallocated) blocks and we do not change the file size on preallocation, then this is a non-issue. 4) Should we update mtime & ctime on a successfull allocation/ unallocation ? - David Chinner raised this question in following post: http://lkml.org/lkml/2007/4/29/407 I think it makes sense to update the [mc]time for a successfull preallocation/unallocation. Does anyone feel otherwise ? It will be interesting to know how XFS behaves currently. Does XFS update [mc]time for preallocation ? -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] AF_RXRPC: Reduce debugging noise.
Reduce debugging noise generated by AF_RXRPC. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- net/rxrpc/ar-peer.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/rxrpc/ar-peer.c b/net/rxrpc/ar-peer.c index ce08b78..90fa107 100644 --- a/net/rxrpc/ar-peer.c +++ b/net/rxrpc/ar-peer.c @@ -59,14 +59,14 @@ static void rxrpc_assess_MTU_size(struct rxrpc_peer *peer) ret = ip_route_output_key(&rt, &fl); if (ret < 0) { - kleave(" [route err %d]", ret); + _leave(" [route err %d]", ret); return; } peer->if_mtu = dst_mtu(&rt->u.dst); dst_release(&rt->u.dst); - kleave(" [if_mtu %u]", peer->if_mtu); + _leave(" [if_mtu %u]", peer->if_mtu); } /* - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] AFS: Further write support fixes
Further fixes for AFS write support: (1) The afs_send_pages() outer loop must do an extra iteration if it ends with 'first == last' because 'last' is inclusive in the page set otherwise it fails to send the last page and complete the RxRPC op under some circumstances. (2) Similarly, the outer loop in afs_pages_written_back() must also do an extra iteration if it ends with 'first == last', otherwise it fails to clear PG_writeback on the last page under some circumstances. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/afs/rxrpc.c |2 +- fs/afs/write.c |4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c index 04189c4..1b36f45 100644 --- a/fs/afs/rxrpc.c +++ b/fs/afs/rxrpc.c @@ -294,7 +294,7 @@ int afs_send_pages(struct afs_call *call, struct msghdr *msg, struct kvec *iov) put_page(pages[loop]); if (ret < 0) break; - } while (first < last); + } while (first <= last); _leave(" = %d", ret); return ret; diff --git a/fs/afs/write.c b/fs/afs/write.c index aa03d43..67ae4db 100644 --- a/fs/afs/write.c +++ b/fs/afs/write.c @@ -669,7 +669,7 @@ void afs_pages_written_back(struct afs_vnode *vnode, struct afs_call *call) pagevec_init(&pv, 0); do { - _debug("attach %lx-%lx", first, last); + _debug("done %lx-%lx", first, last); count = last - first + 1; if (count > PAGEVEC_SIZE) @@ -701,7 +701,7 @@ void afs_pages_written_back(struct afs_vnode *vnode, struct afs_call *call) } __pagevec_release(&pv); - } while (first < last); + } while (first <= last); _leave(""); } - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] LogFS proper
On May 8 2007 20:17, Evgeniy Polyakov wrote: >> > >> +static int __logfs_readdir(struct file *file, void *buf, filldir_t >> > >> filldir) >> > >> +{ >> > >> + err = read_dir(dir, &dd, pos); >> > >> + if (err == -EOF) >> > >> + break; >> > > >> > > -EOF results in a return code 0 ? >> > >> > Results in a return code -256. >> >> Really ? It breaks out of the loop and returns 0 ! See, it's so confusing! Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] AFS: Write support fixes
AFS write support fixes: (1) Support large files using the 64-bit file access operations if available on the server. (2) Use kmap_atomic() rather than kmap() in afs_prepare_page(). (3) Don't do stuff in afs_writepage() that's done by the caller. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- fs/afs/afs_fs.h |2 fs/afs/fsclient.c | 217 - fs/afs/write.c| 14 +-- 3 files changed, 216 insertions(+), 17 deletions(-) diff --git a/fs/afs/afs_fs.h b/fs/afs/afs_fs.h index 2198006..d963ef4 100644 --- a/fs/afs/afs_fs.h +++ b/fs/afs/afs_fs.h @@ -31,6 +31,8 @@ enum AFS_FS_Operations { FSGETVOLUMEINFO = 148, /* AFS Get root volume information */ FSGETROOTVOLUME = 151, /* AFS Get root volume name */ FSLOOKUP= 161, /* AFS lookup file in directory */ + FSFETCHDATA64 = 65537, /* AFS Fetch file data */ + FSSTOREDATA64 = 65538, /* AFS Store file data */ }; enum AFS_FS_Errors { diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c index a552699..8817076 100644 --- a/fs/afs/fsclient.c +++ b/fs/afs/fsclient.c @@ -293,9 +293,33 @@ static int afs_deliver_fs_fetch_data(struct afs_call *call, case 0: call->offset = 0; call->unmarshall++; + if (call->operation_ID != FSFETCHDATA64) { + call->unmarshall++; + goto no_msw; + } - /* extract the returned data length */ + /* extract the upper part of the returned data length of an +* FSFETCHDATA64 op (which should always be 0 using this +* client) */ case 1: + _debug("extract data length (MSW)"); + ret = afs_extract_data(call, skb, last, &call->tmp, 4); + switch (ret) { + case 0: break; + case -EAGAIN: return 0; + default:return ret; + } + + call->count = ntohl(call->tmp); + _debug("DATA length MSW: %u", call->count); + if (call->count > 0) + return -EBADMSG; + call->offset = 0; + call->unmarshall++; + + no_msw: + /* extract the returned data length */ + case 2: _debug("extract data length"); ret = afs_extract_data(call, skb, last, &call->tmp, 4); switch (ret) { @@ -312,7 +336,7 @@ static int afs_deliver_fs_fetch_data(struct afs_call *call, call->unmarshall++; /* extract the returned data */ - case 2: + case 3: _debug("extract data"); if (call->count > 0) { page = call->reply3; @@ -331,7 +355,7 @@ static int afs_deliver_fs_fetch_data(struct afs_call *call, call->unmarshall++; /* extract the metadata */ - case 3: + case 4: ret = afs_extract_data(call, skb, last, call->buffer, (21 + 3 + 6) * 4); switch (ret) { @@ -349,7 +373,7 @@ static int afs_deliver_fs_fetch_data(struct afs_call *call, call->offset = 0; call->unmarshall++; - case 4: + case 5: _debug("trailer"); if (skb->len != 0) return -EBADMSG; @@ -381,6 +405,56 @@ static const struct afs_call_type afs_RXFSFetchData = { .destructor = afs_flat_call_destructor, }; +static const struct afs_call_type afs_RXFSFetchData64 = { + .name = "FS.FetchData64", + .deliver= afs_deliver_fs_fetch_data, + .abort_to_error = afs_abort_to_error, + .destructor = afs_flat_call_destructor, +}; + +/* + * fetch data from a very large file + */ +static int afs_fs_fetch_data64(struct afs_server *server, + struct key *key, + struct afs_vnode *vnode, + off_t offset, size_t length, + struct page *buffer, + const struct afs_wait_mode *wait_mode) +{ + struct afs_call *call; + __be32 *bp; + + _enter(""); + + ASSERTCMP(length, <, ULONG_MAX); + + call = afs_alloc_flat_call(&afs_RXFSFetchData64, 32, (21 + 3 + 6) * 4); + if (!call) + return -ENOMEM; + + call->key = key; + call->reply = vnode; + call->reply2 = NULL; /* volsync */ + call->reply3 = buffer; + call->service_id = FS_SERVICE; + call->port = htons(AFS_FS_PORT); + call->operation_ID = FSFETCHDATA64; + + /* marshall the parameters */ + bp = call->request; + bp[0] = htonl(FSFETCHDATA64); + bp[1] = htonl(vnode->fid.vid); + bp[2] = htonl(vnode->fid.vnod
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 09:37:22PM +1000, Paul Mackerras wrote: > Suparna Bhattacharya writes: > > > > Of course the interface used by an application program would have the > > > fd first. Glibc can do the translation. > > > > I think that was understood. > > OK, then what does it matter what the glibc/kernel interface is, as > long as it works? > > It's only a minor point; the order of arguments can vary between > architectures if necessary, but it's nicer if they don't have to. > 32-bit powerpc will need to have the two int arguments adjacent in > order to avoid using more than 6 argument registers at the user/kernel > boundary, and s390 will need to avoid having a 64-bit argument last > (if I understand it correctly). You are right to say that. But, it may not be _that_ a minor point, especially for the arch which is getting affected. It has other implications like what Heiko noticed in his post below: http://lkml.org/lkml/2007/4/27/377 - implications like modifying glibc and *trace utilities for a particular arch. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On 5/9/07, Paul Mackerras <[EMAIL PROTECTED]> wrote: Suparna Bhattacharya writes: > > Of course the interface used by an application program would have the > > fd first. Glibc can do the translation. > > I think that was understood. OK, then what does it matter what the glibc/kernel interface is, as long as it works? It's only a minor point; the order of arguments can vary between architectures if necessary, but it's nicer if they don't have to. 32-bit powerpc will need to have the two int arguments adjacent in order to avoid using more than 6 argument registers at the user/kernel boundary, and s390 will need to avoid having a 64-bit argument last (if I understand it correctly). Ah, almost but not quite the point. But I admit it is hard to understand.. The trouble started with the futex call which has been the first system call with 6 arguments. s390 supported only 5 arguments up to that point (%r2 - %r6). For futex we added a wrapper to the glibc that loaded the 6th argument to %r7. In entry.S we set up things so that %r7 gets stored to the kernel stack where normal C code expects the first overflow argument. This enabled us to use the standard futex system call with 6 arguments. fallocate now has an additional problem: the last argument is a 64 bit integers AND registers %r2-%r5 are already used. In this case the 64 bit number would have to be split into the high part in %r6 and the low part on the stack so that the glibc wrapper can load the low part to %r7. But the C compiler will skip %r6 and store the 64 bit number on the stack. If the order of the arguments if modified so that %r6 is assigned to a 32-bit argument, then the entry.S magic with %r7 would work. -- blue skies, Martin - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Suparna Bhattacharya writes: > > Of course the interface used by an application program would have the > > fd first. Glibc can do the translation. > > I think that was understood. OK, then what does it matter what the glibc/kernel interface is, as long as it works? It's only a minor point; the order of arguments can vary between architectures if necessary, but it's nicer if they don't have to. 32-bit powerpc will need to have the two int arguments adjacent in order to avoid using more than 6 argument registers at the user/kernel boundary, and s390 will need to avoid having a 64-bit argument last (if I understand it correctly). Paul. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
Valerie Henson writes: [...] > > Hm, I'm not sure that everyone understands, a particular subtlety of > how the fsck algorithm works in chunkfs. A lot of people seem to > think that you need to check *all* cross-chunk links, every time an > individual chunk is checked. That's not the case; you only need to > check the links that go into and out of the dirty chunk. You also > don't need to check the other parts of the file outside the chunk, > except for perhaps reading the byte range info for each continuation > node and making sure no two continuation inodes think they both have > the same range, but you don't check the indirect blocks, block > bitmaps, etc. I guess I miss something. If chunkfs maintains "at most one continuation per chunk" invariant, then continuation inode might end up with multiple byte ranges, and to check that they do not overlap one has to read indirect blocks (or some equivalent data-structure). Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Implement renaming for debugfs
On Mon 07-05-07 09:28:30, Greg KH wrote: > On Fri, May 04, 2007 at 04:14:28PM +0200, Jan Kara wrote: > > On Thu 03-05-07 17:16:02, Greg KH wrote: > > > On Thu, May 03, 2007 at 11:54:52AM +0200, Jan Kara wrote: > > > > On Tue 01-05-07 20:26:27, Greg KH wrote: > > > > > On Mon, Apr 30, 2007 at 07:55:36PM +0200, Jan Kara wrote: > > > > > > Hello, > > > > > > > > > > > > attached patch implements renaming for debugfs. I was asked for > > > > > > this > > > > > > feature by WLAN guys and I guess it makes sence (they have some > > > > > > debug info > > > > > > in the directory identified by interface name and that can > > > > > > change...). > > > > > > Could someone have a look at what I wrote whether it looks > > > > > > reasonable? > > > > > > Thanks. > > > > > > > > > > > > Honza > > > > > > > > > > > > -- > > > > > > Jan Kara <[EMAIL PROTECTED]> > > > > > > SuSE CR Labs > > > > > > > > > > > Implement debugfs_rename() to allow renaming files/directories in > > > > > > debugfs. > > > > > > > > > > I think you are going to need more infrastructure here, the caller > > > > > doesn't want to have to allocate a new dentry themselves, they just > > > > > want > > > > > to pass in the new filename :) > > > > Actually, I wanted the call to be in the spirit of other debugfs > > > > calls. > > > > So we have for example: > > > > void debugfs_remove(struct dentry *dentry) > > > > > > That is because 'debugfs_create' returns a dentry. > > > > > > > struct dentry *debugfs_create_dir(const char *name, struct dentry > > > > *parent) > > > > etc. > > > > > > Same here, you already have a dentry to place this directory into, _and_ > > > all the user needs to provide is a name for the new directory. They > > > don't ever create a dentry themselves, which is what your function > > > required them to do. > > > > > > Try using your function and you'll see what I mean :) > > I've tried it when testing the function :). The code looked like: > > dir1 = debugfs_create_dir("dir1", NULL); > > dir2 = debugfs_create_dir("dir2", NULL); > > file1 = debugfs_create_file("file1", 0644, dir1, NULL, NULL); > > file2 = debugfs_rename(dir1, file1, dir2, "new_name"); > > No new dentries needed to be created... > > Ah, ok, sorry, that makes more sense, I missed that the dentry's passed > in was the new directory location. This will still work if you use the > same directory like: > debugfs_rename(dir1, file1, dir1, "new_name"); > > right? Yes, or even: debugfs_rename(file1->d_parent, file1, file1->d_parent, "new_name"); (given you have no hardlinks to the file...). So renaming should be really simple. Actually, I like the original interface slightly more because no new dentry has to be created (and then dput()) in case you already have the dentry of the file to rename (which usually seems to be the case). Attached is the patch using the original interface - I've fixed some bugs in it since the first version I've posted... Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs Implement debugfs_rename() to allow renaming files/directories in debugfs. Signed-off-by: Jan Kara <[EMAIL PROTECTED]> diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/debugfs/inode.c linux-2.6.21-rc6-1-debugfs_rename/fs/debugfs/inode.c --- linux-2.6.21-rc6/fs/debugfs/inode.c 2007-04-10 17:09:55.0 +0200 +++ linux-2.6.21-rc6-1-debugfs_rename/fs/debugfs/inode.c 2007-05-09 12:57:44.0 +0200 @@ -368,6 +368,69 @@ void debugfs_remove(struct dentry *dentr } EXPORT_SYMBOL_GPL(debugfs_remove); +/** + * debugfs_rename - rename a file/directory in the debugfs filesystem + * @old_dir: a pointer to the parent dentry for the renamed object. This + * should be a directory dentry. + * @old_dentry: dentry of an object to be renamed. + * @new_dir: a pointer to the parent dentry where the object should be + * moved. This should be a directory dentry. + * @new_name: a pointer to a string containing the target name. + * + * This function renames a file/directory in debugfs. The target must not + * exist for rename to succeed. + * + * This function will return a pointer to old_dentry (which is updated to + * reflect renaming) if it succeeds. If an error occurs, %NULL will be + * returned. + * + * If debugfs is not enabled in the kernel, the value -%ENODEV will be + * returned. + */ +struct dentry *debugfs_rename(struct dentry *old_dir, struct dentry *old_dentry, + struct dentry *new_dir, const char *new_name) +{ + int error; + struct dentry *dentry = NULL, *trap; + const char *old_name; + + trap = lock_rename(new_dir, old_dir); + /* Source or destination directories don't exist? */ + if (!old_dir->d_inode || !new_dir->d_inode) + goto exit; + /* Source does not exist, cyclic rename, or mountpoint? */ + if (!old_dentry->d_inode || old_dentry == trap || + d_mountpoint
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote: > Suparna Bhattacharya writes: > > > > This looks like it will have the same problem on s390 as > > > sys_sync_file_range. Maybe the prototype should be: > > > > > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) > > > > Yes, but the trouble is that there was a contrary viewpoint preferring that > > fd > > first be maintained as a convention like other syscalls (see the following > > posts) > > Of course the interface used by an application program would have the > fd first. Glibc can do the translation. I think that was understood. Regards Suparna > > Paul. -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] AFS: Implement basic file write support
Andrew Morton <[EMAIL PROTECTED]> wrote: > set_page_dirty() will set I_DIRTY_PAGES only. ie: the inode has dirty > pagecache data. > > To tell the VFS that the inode itself is dirty one needs to run > mark_inode_dirty(). But what's the difference in this case? I don't need to write the inode back per se, and the inode attributes can be updated by the mechanism of data storage. David - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
Suparna Bhattacharya writes: > > This looks like it will have the same problem on s390 as > > sys_sync_file_range. Maybe the prototype should be: > > > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) > > Yes, but the trouble is that there was a contrary viewpoint preferring that fd > first be maintained as a convention like other syscalls (see the following > posts) Of course the interface used by an application program would have the fd first. Glibc can do the translation. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] AFS: Implement basic file write support
On Wed, 09 May 2007 11:25:47 +0100 David Howells <[EMAIL PROTECTED]> wrote: > > > + set_page_dirty(page); > > > + > > > + if (PageDirty(page)) > > > + _debug("dirtied"); > > > + > > > + return 0; > > > +} > > > > One would normally run mark_inode_dirty() after any i_size_write()? > > Not in this case, I assume, because set_page_dirty() ultimately calls > __mark_inode_dirty(), but I could be wrong. set_page_dirty() will set I_DIRTY_PAGES only. ie: the inode has dirty pagecache data. To tell the VFS that the inode itself is dirty one needs to run mark_inode_dirty(). Or maybe mark_inode_dirty_sync() but I can never for the life of me remember what that thing does. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] LogFS proper
On Tue, 8 May 2007 17:01:01 -0700, Greg KH wrote: > On Wed, May 09, 2007 at 01:10:09AM +0200, J??rn Engel wrote: > > > > The remaining question is how to deal with kernel-only code that uses > > be64. Convert that to __be64 as well? Or introduce be64 in > > include/linix/types.h instead? > > I say leave it alone for now, it's not that common :) Using a fairly lame grep, there are 10k instances versus 60k for u64 and friends. Sustract about 2.5k used in include/ and possibly part of userspace interfaces, that leaves about 7.5k. [EMAIL PROTECTED]:/usr/src/kernel/logfs$ sgrep '\' .|wc 60306 313780 3960665 [EMAIL PROTECTED]:/usr/src/kernel/logfs$ sgrep '\<__[lb]e[136][246]\>' .|wc 10013 52235 635047 [EMAIL PROTECTED]:/usr/src/kernel/logfs$ sgrep '\<__[lb]e[136][246]\>' include|wc 2624 15100 173176 Actually going through them all, the overwhelming majority is used for structures. I seem to be quite the oddball indeed. Will change. Jörn -- The grand essentials of happiness are: something to do, something to love, and something to hope for. -- Allan K. Chalmers - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] AFS: Implement basic file write support
Andrew Morton <[EMAIL PROTECTED]> wrote: > > + BUG_ON(i_size > 0x); // TODO: use 64-bit store > > You're sure this isn't user-triggerable? Hmmm... I'm not. I'll whip up a patch for this. > kmap_atomic() could be used here and is better. Yeah. It used to have something that slept in the middle of it, but that's no longer there. I'll add to the patch. > We have this zero_user_page() thing heading in which could perhaps be used > here also. Okay. I'll have a look at it once it's there. > > + ASSERTRANGE(wb->first, <=, index, <=, wb->last); > > wow. :-) The assertions I've put in have been very useful. > > + set_page_dirty(page); > > + > > + if (PageDirty(page)) > > + _debug("dirtied"); > > + > > + return 0; > > +} > > One would normally run mark_inode_dirty() after any i_size_write()? Not in this case, I assume, because set_page_dirty() ultimately calls __mark_inode_dirty(), but I could be wrong. > We can invalidate pages and we can truncate them and we can clean them. > But here we have a new operation, "killing". I wonder what that is. I can call it invalidation if you like, though that name is already reserved as it were:-/ I suppose it might actually make sense for me to call invalidatepage() myself. > > + if (wbc->sync_mode != WB_SYNC_NONE) > > + wait_on_page_writeback(page); > > Didn't the VFS already do that? I'm not entirely sure. Looking at generic_writepages(), I guess so. I'll patch it out. > > + if (PageWriteback(page) || !PageDirty(page)) { > > + unlock_page(page); > > + return 0; > > + } > > And some of that? Yeah. Seems so. I'll patch that out too. What I'd like to do is ditch writepage() entirely - I'm not sure it's entirely necessary with the availability of writepages() - but I'll look at that another time. > I have this vague prehistoric memory that something can go wrong at the VFS > level if the address_space writes back more pages than it was asked to. > But I forget what the issue was and it would be silly to have an issue > with that anyway. Something to keep an eye out for. Okay. Thanks for the 'cherry-pick'. I'll hopefully have a revision patch for you soon. David - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Tue, 8 May 2007 22:56:09 -0700, Valerie Henson wrote: > > I like it too, especially the rmap stuff, but I don't think it solves > some of the problems chunkfs solves. The really nice thing about > chunkfs is that it tries hard to isolate each chunk from all the other > chunks. You can think of regular file systems as an OS one big shared > address space - any process can potentially modify any other process's > address space, including the kernel's - and chunkfs as the modern UNIX > private address space model. Except in rare worst case models (the > equivalent of a kernel bug or writing /dev/mem), the only way one > chunk can affect another chunk is through the narrow little interface > of the continuation inode. This severely limits the ability of one > chunk to corrupt another - the worse you can do is end up with the > wrong link count on an inode pointed to from another chunk. This leaves me a bit confused. Imo a filesystem equivalent of process's address spaces would be permissions and quotas. Indeed there is no guarantee where any address spaces pages may physically reside. They can be in any zone, node or even swap or regular files. Otoh, each physical page does have an rmap of some sorts - enough to figure out why currently owns this page. Does your own analogy work against you? Back to chunkfs, the really smart idea behind it imo is to take just a small part of the filesystem, assume that everything else is flawless, and check the small part under that assumption. The assumption may be wrong. If that wrongness would effect the minimal fsck, it should get detected as well. Otherwise it doesn't matter right now. What I never liked about chunkfs were two things. First it splits the filesystem into an array of chunks. With sufficiently large devices, either the number or the size of chunks will come close to problematic again. Some sort of tree arrangement intuitively makes more sense. Secondly, the cnodes are... weird, complicated, not well understood, a hack. Pick a term. Avoiding cnodes is harder than avoiding regular fragmentation and the recent defragment patches seem to imply we're doing a bad job at that already. Linked lists of cnodes - yuck. Not directly a chunkfs problem, but still unfortunate is that it still cannot detect medium errors. There are no checksums. Checksums cost performance, so they obviously have to be optional at user's choice. But not even having the option is quite 80's. Matt's proposal is an alternative solution that can address all of my concerns. Instead of cnodes it has the rmap. That is a very simple structure I can explain to my nephews. It allows for checksums, which is nice as well. And it does allow for a tree structure of tiles. Tree structure means that each tile can have free space counters. A supertile (or whatever one may call it) can have a free space counter that is the sum of all member free space counters. And so forth upwards. Same for dirty bits and anything else I've forgotten. So individual tiles can be significantly smaller than chunks in chunkfs. Checking them is significantly faster than checking a chunk. There will be more dirty tiles at any given time, but a better way to look at it is to say that for any dirty chunk in chunkfs, tilefs has some dirty and some clean tiles. So the overall ratio of dirty space is never higher and almost always lower. Overall I almost envy Matt for having this idea. In hindsight it should have been obvious to me. But then again, in hindsight the fsck problem and using divide and conquer should have been obvious to everyone and iirc you were the only one who seriously persued the idea and got all this frenzy started. :) Jörn -- Rules of Optimization: Rule 1: Don't do it. Rule 2 (for experts only): Don't do it yet. -- M.A. Jackson - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote: > Andrew Morton writes: > > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> > > wrote: > > > > > This patch implements the fallocate() system call and adds support for > > > i386, x86_64 and powerpc. > > > > > > ... > > > > > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t > > > len) > > > > Please add a comment over this function which specifies its behaviour. > > Really it should be enough material from which a full manpage can be > > written. > > This looks like it will have the same problem on s390 as > sys_sync_file_range. Maybe the prototype should be: > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Yes, but the trouble is that there was a contrary viewpoint preferring that fd first be maintained as a convention like other syscalls (see the following posts) http://marc.info/?l=linux-fsdevel&m=117585330016809&w=2 (Andreas) http://marc.info/?l=linux-fsdevel&m=117690157917378&w=2 (Andreas) http://marc.info/?l=linux-fsdevel&m=117578821827323&w=2 (Randy) So we are kind of deadlocked, aren't we ? The debates on the proposed solution for s390 http://marc.info/?l=linux-fsdevel&m=117760995610639&w=2 http://marc.info/?l=linux-fsdevel&m=117708124913098&w=2 http://marc.info/?l=linux-fsdevel&m=117767607229807&w=2 Are there any better ideas ? Regards Suparna > > Paul. > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote: > > This does mean that our time to make progress on a check is bounded at > the top by the size of our largest file. If we have a degenerate > filesystem filled with a single file, this will in fact take as long > as a conventional fsck. If your filesystem has, say, 100 roughly > equally-sized files, you're back in Chunkfs territory. Hm, I'm not sure that everyone understands, a particular subtlety of how the fsck algorithm works in chunkfs. A lot of people seem to think that you need to check *all* cross-chunk links, every time an individual chunk is checked. That's not the case; you only need to check the links that go into and out of the dirty chunk. You also don't need to check the other parts of the file outside the chunk, except for perhaps reading the byte range info for each continuation node and making sure no two continuation inodes think they both have the same range, but you don't check the indirect blocks, block bitmaps, etc. There is one case where you do need to do a full check of all links that cross chunks - if a continuation inode's pointers have been corrupted, and you might end up with orphan continuation inodes or dangling links in other chunks. I expect that to be relatively rare. > So we should have no trouble checking an exabyte-sized filesystem on a > 4MB box. Even if it has one exabyte-sized file! We check the first > tile, see that it points to our file, then iterate through that file, > checking that the forward and reverse pointers for each block match > and all CRCs match, etc. We cache the file's inode as clean, finish > checking anything else in the first tile, then mark it clean. When we get > to the next tile (and the next billion after that!), we notice that > each block points back to our cached inode and skip rechecking it. If I understand correctly then, if you do have a one exabyte sized file, and any part of it is in a dirty tile, you will need to check the whole file? Or will Joern's fpos proposal fix this? This is interesting stuff, thanks! -VAL - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html