Re: ChunkFS - measuring cross-chunk references
Hi, The tool estimates the cross-chunk references from an extt2/3 file system. It considers a block group as one chunk and calcuates how many block groups does a file span across. So, the block group size gives the estimate of chunk size. The file systems were aged for about 3-4 months on a developers laptop. Should have given the background before. Below is the explanations for the tool. Valh and others came up with this idea. - Chunkfs will only work if we have few cross-chunk references. We can estimate the effect of chunk size on the number of these references using an existing ext2/3 file system and treating the block groups as though they are chunks. The basic idea is that we figure out what the block group boundaries are and then find out which files and directories span two or more block groups. Step 1: --- Get a real-world ext2/3 file system. A file system which has been in use is required. One from a laptop or a server of any sort will do fine. Step 2: --- Figure out where the block group boundaries are on disk. Two things are to be known: 1. Which inode numbers are in which block group? 2. Which blocks are in which block group? At the end of this step we should have a list that looks something like: Block group 1: Inodes 11-343, blocks 1000-2 Block group 2: Inodes 344-576, blocks 2-4 [...] Step 3: --- For each file, get the inode number and use mapping from step 2 to figure out which block group it is in. Now use bmap() on each block in the file, and find out the block number. Use mapping from step 2 to figure out which block groups it has data in. For each file, record the list of all block groups. For each directory, get the inode number and map that to a block group. Then get the inode numbers of all entries in the directory (ignore symlinks) and map them to a block group. For each directory, record the list of all block groups. Step 4: --- Count the number of cross-chunk references this file system would need. This is done by going through each directory and file, and adding up the number of block groups it uses MINUS one. So if a file was in block groups 3, 7, and 24, then you would add 2 to the total number of cross-chunk references. If a file was only in block group 2, then you would add 0 to the total. On 4/22/07, Amit Gud [EMAIL PROTECTED] wrote: Karuna sagar K wrote: Hi, The attached code contains program to estimate the cross-chunk references for ChunkFS file system (idea from Valh). Below are the results: Nice to see some numbers! But would be really nice to know: - what the chunk size is - how the files were created or, more vaguely, how 'aged' the fs is - what is the chunk allocation algorithm Best, AG -- May the source be with you. http://www.cis.ksu.edu/~gud Thanks, Karuna - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Testing framework
On Mon, 2007-04-23 at 02:16 +0530, Karuna sagar K wrote: Hi, For some time I had been working on this file system test framework. Now I have a implementation for the same and below is the explanation. Any comments are welcome. Introduction: The testing tools and benchmarks available around do not take into account the repair and recovery aspects of file systems. The test framework described here focuses on repair and recovery capabilities of file systems. Since most file systems use 'fsck' to recover from file system inconsistencies, the test framework characterizes file systems based on outcomes of running 'fsck'. snip Higher level perspective/approach: In this approach the file system is viewed as a tree of nodes, where nodes are either files or directories. The metadata information corresponding to some randomly chosen nodes of the tree are corrupted. Nodes which are corrupted are marked or recorded to be able to replay later. This file system is called source file system while the file system on which we need to replay the corruption is called target file system. The assumption is that the target file system contains a set of files and directories which is a superset of that in the source file system. Hence to replay the corruption we need point out which nodes in the source file system were corrupted in the source file system and corrupt the corresponding nodes in the target file system. A major disadvantage with this approach is that on-disk structures (like superblocks, block group descriptors, etc.) are not considered for corruption. Lower level perspective/approach: The file system is looked upon as a set of blocks (more precisely metadata blocks). We randomly choose from this set of blocks to corrupt. Hence we would be able to overcome the deficiency of the previous approach. However this approach makes it difficult to have a replayable corruption. Further thought about this approach has to be given. Fill a test filesystem with data and save it. Corrupt it by copying a chunk of data from random locations A to B. Save positions A and B so that you can reproduce the corruption. Or corrupt random bits (ideally in metadata blocks) and maintain the list of the bit numbers for reproducing the corruption. We could have a blend of both the approaches in the program to compromise between corruption and replayability. Repair Phase: The corrupted file system is repaired and recovered with 'fsck' or any other tools; this phase considers the repair and recovery action on the file system as a black box. The time taken to repair by the tool is measured. I see that you are running fsck just once on the test filesystem. It might be a good idea to run it twice and if second fsck does not find the filesystem to be completely clean that means it is a bug in fsck. snip Summary Phase: This is the final phase in the model. A report file is prepared which summarizes the result of this test run. The summary contains: Average time taken for recovery Number of files lost at the end of each iteration Number of files with metadata corruption at the end of each iteration Number of files with data corruption at the end of each iteration Number of files lost and found at the end of each iteration Putting it all together: The Corruption, Repair and Comparison phases could be repeated a number of times (each repetition is called an iteration) before the summary of that test run is prepared. TODO: Account for files in the lost+found directory during the comparison phase. Support for other file systems (only ext2 is supported currently) State of the either file system is stored, which may be huge, time consuming and not necessary. So, we could have better ways of storing the state. Also, people may want to test with different mount options, so something like mount -t $fstype -o loop,$MOUNT_OPTIONS $imgname $mountpt may be useful. Similarly it may also be useful to have MKFS_OPTIONS while formatting the filesystem. Thanks, Kalpak. Comments are welcome!! Thanks, Karuna - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Testing framework
On 4/23/07, Kalpak Shah [EMAIL PROTECTED] wrote: On Mon, 2007-04-23 at 02:16 +0530, Karuna sagar K wrote: . The file system is looked upon as a set of blocks (more precisely metadata blocks). We randomly choose from this set of blocks to corrupt. Hence we would be able to overcome the deficiency of the previous approach. However this approach makes it difficult to have a replayable corruption. Further thought about this approach has to be given. Fill a test filesystem with data and save it. Corrupt it by copying a chunk of data from random locations A to B. Save positions A and B so that you can reproduce the corruption. Hey, thats a nice idea :). But, this woundnt reproduce the same corruption right? Because, say, on first run of the tool there is metadata stored at locations A and B and then on the second run there may be user data present. I mean the allocation may be different. Or corrupt random bits (ideally in metadata blocks) and maintain the list of the bit numbers for reproducing the corruption. . The corrupted file system is repaired and recovered with 'fsck' or any other tools; this phase considers the repair and recovery action on the file system as a black box. The time taken to repair by the tool is measured I see that you are running fsck just once on the test filesystem. It might be a good idea to run it twice and if second fsck does not find the filesystem to be completely clean that means it is a bug in fsck. You are right. Will modify that. snip .. State of the either file system is stored, which may be huge, time consuming and not necessary. So, we could have better ways of storing the state. Also, people may want to test with different mount options, so something like mount -t $fstype -o loop,$MOUNT_OPTIONS $imgname $mountpt may be useful. Similarly it may also be useful to have MKFS_OPTIONS while formatting the filesystem. Right. I didnt think of that. Will look into it. Thanks, Kalpak. Comments are welcome!! Thanks, Karuna Thanks, Karuna - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH] ChunkFS: fs fission for faster fsck
This is an initial implementation of ChunkFS technique, briefly discussed at: http://lwn.net/Articles/190222 and http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf This implementation is done within ext2 driver. Every chunk is an independent ext2 file system. The knowledge about chunks is kept within ext2 and 'continuation inodes', which are used to allow files and directories span across multiple chunks, are managed within ext2. At mount time, super blocks for all the chunks are created and linked with the global super_blocks list maintained by VFS. This allows independent behavior or individual chunks and also helps writebacks to happen seamlessly. Apart from this, chunkfs code in ext2 effectively only provides knowledge of: - what inode's which block number to look for, for a given file's logical block number - in which chunk to allocate next inode / block - number of inodes to scan when a directory is being read To maintain the ext2's inode number uniqueness property, 8 msb bits of inode number are used to indicate the chunk number in which it resides. As said, this is a preliminary implementation and lots of changes are expected before this code is even sanely usable. Some known issues and obvious optimizations are listed in the TODO file in the chunkfs patch. http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch - one big patch - applies to 2.6.18 Attached - ext2-chunkfs-diff.patch.gz - since the code is a spin-off of ext2, this patch explains better what has changed from the ext2. git://cislinux.cis.ksu.edu/chunkfs-tools - mkfs, and fsck for chunkfs. http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml - config file used; tested mostly on UML with loopback file systems. NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP should be no for clean compile. Please comment, suggest, criticize. Patches most welcome. Best, AG -- May the source be with you. http://www.cis.ksu.edu/~gud ext2-chunkfs-diff.patch.gz Description: Binary data
Re: 2.6.21-rc7 new aops patchset
Nick, Thanks for converting fuse, and testing. Here's a minor update to fs-fuse-aops.patch. Miklos Convert fuse to new aops. [mszeredi] - don't send zero length write requests - it is not legal for the filesystem to return with zero written bytes Signed-off-by: Nick Piggin [EMAIL PROTECTED] Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] Index: linux/fs/fuse/file.c === --- linux.orig/fs/fuse/file.c 2007-04-23 12:04:10.0 +0200 +++ linux/fs/fuse/file.c2007-04-23 13:56:48.0 +0200 @@ -443,22 +443,25 @@ static size_t fuse_send_write(struct fus return outarg.size; } -static int fuse_prepare_write(struct file *file, struct page *page, - unsigned offset, unsigned to) -{ - /* No op */ +static int fuse_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + pgoff_t index = pos PAGE_CACHE_SHIFT; + + *pagep = __grab_cache_page(mapping, index); + if (!*pagep) + return -ENOMEM; return 0; } -static int fuse_commit_write(struct file *file, struct page *page, -unsigned offset, unsigned to) +static int fuse_buffered_write(struct file *file, struct inode *inode, + loff_t pos, unsigned count, struct page *page) { int err; size_t nres; - unsigned count = to - offset; - struct inode *inode = page-mapping-host; struct fuse_conn *fc = get_fuse_conn(inode); - loff_t pos = page_offset(page) + offset; + unsigned offset = pos (PAGE_CACHE_SIZE - 1); struct fuse_req *req; if (is_bad_inode(inode)) @@ -474,20 +477,35 @@ static int fuse_commit_write(struct file nres = fuse_send_write(req, file, inode, pos, count); err = req-out.h.error; fuse_put_request(fc, req); - if (!err nres != count) + if (!err !nres) err = -EIO; if (!err) { - pos += count; + pos += nres; spin_lock(fc-lock); if (pos inode-i_size) i_size_write(inode, pos); spin_unlock(fc-lock); - if (offset == 0 to == PAGE_CACHE_SIZE) + if (count == PAGE_CACHE_SIZE) SetPageUptodate(page); } fuse_invalidate_attr(inode); - return err; + return err ? err : nres; +} + +static int fuse_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) +{ + struct inode *inode = mapping-host; + int res = 0; + + if (copied) + res = fuse_buffered_write(file, inode, pos, copied, page); + + unlock_page(page); + page_cache_release(page); + return res; } static void fuse_release_user_pages(struct fuse_req *req, int write) @@ -817,8 +835,8 @@ static const struct file_operations fuse static const struct address_space_operations fuse_file_aops = { .readpage = fuse_readpage, - .prepare_write = fuse_prepare_write, - .commit_write = fuse_commit_write, + .write_begin= fuse_write_begin, + .write_end = fuse_write_end, .readpages = fuse_readpages, .set_page_dirty = fuse_set_page_dirty, .bmap = fuse_bmap, - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Testing framework
On Mon, 2007-04-23 at 02:16 +0530, Karuna sagar K wrote: For some time I had been working on this file system test framework. Now I have a implementation for the same and below is the explanation. Any comments are welcome. snip You may want to check out the paper EXPLODE: A Lightweight, General System for Finding Serious Storage System Errors from OSDI 2006 (if you haven't already). The idea sounds very similar to me, although I haven't read all the details of your proposal. Avishay - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote: This is an initial implementation of ChunkFS technique, briefly discussed at: http://lwn.net/Articles/190222 and http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf This implementation is done within ext2 driver. Every chunk is an independent ext2 file system. The knowledge about chunks is kept within ext2 and 'continuation inodes', which are used to allow files and directories span across multiple chunks, are managed within ext2. At mount time, super blocks for all the chunks are created and linked with the global super_blocks list maintained by VFS. This allows independent behavior or individual chunks and also helps writebacks to happen seamlessly. Apart from this, chunkfs code in ext2 effectively only provides knowledge of: - what inode's which block number to look for, for a given file's logical block number - in which chunk to allocate next inode / block - number of inodes to scan when a directory is being read To maintain the ext2's inode number uniqueness property, 8 msb bits of inode number are used to indicate the chunk number in which it resides. As said, this is a preliminary implementation and lots of changes are expected before this code is even sanely usable. Some known issues and obvious optimizations are listed in the TODO file in the chunkfs patch. http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch - one big patch - applies to 2.6.18 Could you send this out as a patch to ext2 codebase, so we can just look at the changes for chunkfs ? That might also make it small enough to inline your patch in email for review. What kind of results are you planning to gather to evaluate/optimize this ? Regards Suparna Attached - ext2-chunkfs-diff.patch.gz - since the code is a spin-off of ext2, this patch explains better what has changed from the ext2. git://cislinux.cis.ksu.edu/chunkfs-tools - mkfs, and fsck for chunkfs. http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml - config file used; tested mostly on UML with loopback file systems. NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP should be no for clean compile. Please comment, suggest, criticize. Patches most welcome. Best, AG -- May the source be with you. http://www.cis.ksu.edu/~gud -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Mon, Apr 23, 2007 at 09:58:49PM +0530, Suparna Bhattacharya wrote: On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote: This is an initial implementation of ChunkFS technique, briefly discussed at: http://lwn.net/Articles/190222 and http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf This implementation is done within ext2 driver. Every chunk is an independent ext2 file system. The knowledge about chunks is kept within ext2 and 'continuation inodes', which are used to allow files and directories span across multiple chunks, are managed within ext2. At mount time, super blocks for all the chunks are created and linked with the global super_blocks list maintained by VFS. This allows independent behavior or individual chunks and also helps writebacks to happen seamlessly. Apart from this, chunkfs code in ext2 effectively only provides knowledge of: - what inode's which block number to look for, for a given file's logical block number - in which chunk to allocate next inode / block - number of inodes to scan when a directory is being read To maintain the ext2's inode number uniqueness property, 8 msb bits of inode number are used to indicate the chunk number in which it resides. As said, this is a preliminary implementation and lots of changes are expected before this code is even sanely usable. Some known issues and obvious optimizations are listed in the TODO file in the chunkfs patch. http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch - one big patch - applies to 2.6.18 Could you send this out as a patch to ext2 codebase, so we can just look at the changes for chunkfs ? That might also make it small enough to inline your patch in email for review. Sorry, I missed the part about ext2-chunkfs-diff below. Regards suparna What kind of results are you planning to gather to evaluate/optimize this ? Regards Suparna Attached - ext2-chunkfs-diff.patch.gz - since the code is a spin-off of ext2, this patch explains better what has changed from the ext2. git://cislinux.cis.ksu.edu/chunkfs-tools - mkfs, and fsck for chunkfs. http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml - config file used; tested mostly on UML with loopback file systems. NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP should be no for clean compile. Please comment, suggest, criticize. Patches most welcome. Best, AG -- May the source be with you. http://www.cis.ksu.edu/~gud -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChunkFS - measuring cross-chunk references
On Apr 23, 2007 15:04 +0530, Kalpak Shah wrote: On Mon, 2007-04-23 at 12:49 +0530, Karuna sagar K wrote: The tool estimates the cross-chunk references from an extt2/3 file system. It considers a block group as one chunk and calcuates how many block groups does a file span across. So, the block group size gives the estimate of chunk size. The file systems were aged for about 3-4 months on a developers laptop. With a blocksize of 4KB, a block group would be 128 MB. In the original Chunkfs paper, Valh had mentioned 1GB chunks and I believe it will be possible to use 2GB, 4GB or 8GB chunks in the future. As the chunk size increases the number of cross-chunk references will reduce and hence it might be a good idea to present these statistics considering different chunk sizes starting from 512MB upto 2GB. Also, given that cross-chunk references will be more expensive to fix, I can imagine the allocation policy for chunkfs will try to avoid this if possible, further reducing the number of cross-chunk inodes. I guess it should be more clear whether the cross-chunk references are due to inode block references, or because of e.g. directories referencing inodes in another chunk. Also, is it considered a cross-chunk reference if a directory entry is referencing an inode in another group? Should there be a continuation inode in the local group, or is the directory entry itself enough? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Suparna Bhattacharya wrote: Could you send this out as a patch to ext2 codebase, so we can just look at the changes for chunkfs ? That might also make it small enough to inline your patch in email for review. What kind of results are you planning to gather to evaluate/optimize this ? Mainly I'm trying to gather following: - Graph of continuation inodes vs. the file system fragmentation (or aging) factor with varying configurations of chunk sizes - Graph of wall clock time vs. disk size + data on the disk with both chunkfs and native ext2, and/or other file systems AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Testing framework
Avishay Traeger wrote: On Mon, 2007-04-23 at 02:16 +0530, Karuna sagar K wrote: For some time I had been working on this file system test framework. Now I have a implementation for the same and below is the explanation. Any comments are welcome. snip You may want to check out the paper EXPLODE: A Lightweight, General System for Finding Serious Storage System Errors from OSDI 2006 (if you haven't already). The idea sounds very similar to me, although I haven't read all the details of your proposal. Avishay It would also be interesting to use the disk error injection patches that Mark Lord sent out recently to introduce real sector level corruption. When your file systems are large enough and old enough, getting bad sectors and IO errors during an fsck stresses things in interesting ways ;-) ric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChunkFS - measuring cross-chunk references
On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote: With a blocksize of 4KB, a block group would be 128 MB. In the original Chunkfs paper, Valh had mentioned 1GB chunks and I believe it will be possible to use 2GB, 4GB or 8GB chunks in the future. As the chunk size increases the number of cross-chunk references will reduce and hence it might be a good idea to present these statistics considering different chunk sizes starting from 512MB upto 2GB. Also, given that cross-chunk references will be more expensive to fix, I can imagine the allocation policy for chunkfs will try to avoid this if possible, further reducing the number of cross-chunk inodes. I guess it should be more clear whether the cross-chunk references are due to inode block references, or because of e.g. directories referencing inodes in another chunk. It would also be good to distinguish between directories referencing files in another chunk, and directories referencing subdirectories in another chunk (which would be simpler to handle, given the topological restrictions on directories, as compared to files and hard links). There may also be special things we will need to do to handle scenarios such as BackupPC, where if it looks like a directory contains a huge number of hard links to a particular chunk, we'll need to make sure that directory is either created in the right chunk (possibly with hints from the application) or migrated to the right chunk (but this might cause the inode number of the directory to change --- maybe we allow this as long as the directory has never been stat'ed, so that the inode number has never been observed). The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow it on 64-bit systems, or we need to consider a migration so that even on 32-bit platforms, stat() functions like stat64(), insofar that it uses a stat structure which returns a 64-bit ino_t. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21-rc7 new aops patchset
On Mon, Apr 23, 2007 at 02:17:55PM +0200, Miklos Szeredi wrote: Nick, Thanks for converting fuse, and testing. Here's a minor update to fs-fuse-aops.patch. Miklos Convert fuse to new aops. [mszeredi] - don't send zero length write requests - it is not legal for the filesystem to return with zero written bytes Signed-off-by: Nick Piggin [EMAIL PROTECTED] Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] Thanks, applied. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChunkFS - measuring cross-chunk references
The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow does it? I'd think it needs a chunk space number and a 32 bit local inode number ;) (same for blocks) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChunkFS - measuring cross-chunk references
On Mon, 23 Apr 2007, Arjan van de Ven wrote: The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow does it? I'd think it needs a chunk space number and a 32 bit local inode number ;) (same for blocks) For inodes, yes, either 64-bit inode or some field for the chunk id in which the inode is. But for block numbers, you don't. Because individual chunks manage part of the whole file system in an independent way. They have their block bitmaps starting at an offset. Inode bitmaps, however, remains same. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChunkFS - measuring cross-chunk references
On Mon, 23 Apr 2007, Amit Gud wrote: On Mon, 23 Apr 2007, Arjan van de Ven wrote: The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow does it? I'd think it needs a chunk space number and a 32 bit local inode number ;) (same for blocks) For inodes, yes, either 64-bit inode or some field for the chunk id in which the inode is. But for block numbers, you don't. Because individual chunks manage part of the whole file system in an independent way. They have their block bitmaps starting at an offset. Inode bitmaps, however, remains same. In that sense, we also can do away without having chunk identifier encoded into inode number and chunkfs would still be fine with it. But we will then loose inode uniqueness property, which could well be OK as it is with other file systems in which inode number is not sufficient for unique identification of an inode. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1
Hi, these patches are against 2.6.21-rc6-mm1. Aside from OCFS2, there were no major clashes between -mm and mainline diffs, which is nice. These patches aim to solve the long standing buffered write deadlocks, and then go on to introduce a pair of new write a_op methods which allow the deadlock to be solved without taking the performance hit of the backwards compatible solutions using the old APIs. Reiserfs (and Reiser4, in -mm) are the only filesystems left unconverted, although there are a number of less common ones still untested. Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 05/44] mm: debug write deadlocks
Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the difficult race where the page may be unmapped before calling copy_from_user. Makes the race much easier to hit. This is useful for demonstration and testing purposes, but is removed in a subsequent patch. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c |2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1984,6 +1984,7 @@ generic_file_buffered_write(struct kiocb if (maxlen bytes) maxlen = bytes; +#ifndef CONFIG_DEBUG_VM /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the @@ -1991,6 +1992,7 @@ generic_file_buffered_write(struct kiocb * up-to-date. */ fault_in_pages_readable(buf, maxlen); +#endif page = __grab_cache_page(mapping,index,cached_page,lru_pvec); if (!page) { -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 02/44] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6
From: Andrew Morton [EMAIL PROTECTED] This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we also revert. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c |9 + mm/filemap.h |4 ++-- 2 files changed, 3 insertions(+), 10 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -2001,12 +2001,6 @@ generic_file_buffered_write(struct kiocb break; } - if (unlikely(bytes == 0)) { - status = 0; - copied = 0; - goto zero_length_segment; - } - status = a_ops-prepare_write(file, page, offset, offset+bytes); if (unlikely(status)) { loff_t isize = i_size_read(inode); @@ -2036,8 +2030,7 @@ generic_file_buffered_write(struct kiocb page_cache_release(page); continue; } -zero_length_segment: - if (likely(copied = 0)) { + if (likely(copied 0)) { if (!status) status = copied; Index: linux-2.6/mm/filemap.h === --- linux-2.6.orig/mm/filemap.h +++ linux-2.6/mm/filemap.h @@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove const struct iovec *iov = *iovp; size_t base = *basep; - do { + while (bytes) { int copy = min(bytes, iov-iov_len - base); bytes -= copy; @@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove iov++; base = 0; } - } while (bytes); + } *iovp = iov; *basep = base; } -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 01/44] mm: revert KERNEL_DS buffered write optimisation
Revert the patch from Neil Brown to optimise NFSD writev handling. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Cc: Neil Brown [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c | 32 +--- 1 file changed, 13 insertions(+), 19 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1980,27 +1980,21 @@ generic_file_buffered_write(struct kiocb /* Limit the size of the copy to the caller's write size */ bytes = min(bytes, count); - /* We only need to worry about prefaulting when writes are from -* user-space. NFSd uses vfs_writev with several non-aligned -* segments in the vector, and limiting to one segment a time is -* a noticeable performance for re-write + /* +* Limit the size of the copy to that of the current segment, +* because fault_in_pages_readable() doesn't know how to walk +* segments. */ - if (!segment_eq(get_fs(), KERNEL_DS)) { - /* -* Limit the size of the copy to that of the current -* segment, because fault_in_pages_readable() doesn't -* know how to walk segments. -*/ - bytes = min(bytes, cur_iov-iov_len - iov_base); + bytes = min(bytes, cur_iov-iov_len - iov_base); + + /* +* Bring in the user page that we will copy from _first_. +* Otherwise there's a nasty deadlock on copying from the +* same page as we're writing to, without it being marked +* up-to-date. +*/ + fault_in_pages_readable(buf, bytes); - /* -* Bring in the user page that we will copy from -* _first_. Otherwise there's a nasty deadlock on -* copying from the same page as we're writing to, -* without it being marked up-to-date. -*/ - fault_in_pages_readable(buf, bytes); - } page = __grab_cache_page(mapping,index,cached_page,lru_pvec); if (!page) { status = -ENOMEM; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 04/44] mm: clean up buffered write code
From: Andrew Morton [EMAIL PROTECTED] Rename some variables and fix some types. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c | 35 ++- 1 file changed, 18 insertions(+), 17 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1944,16 +1944,15 @@ generic_file_buffered_write(struct kiocb size_t count, ssize_t written) { struct file *file = iocb-ki_filp; - struct address_space * mapping = file-f_mapping; + struct address_space *mapping = file-f_mapping; const struct address_space_operations *a_ops = mapping-a_ops; struct inode*inode = mapping-host; longstatus = 0; struct page *page; struct page *cached_page = NULL; - size_t bytes; struct pagevec lru_pvec; const struct iovec *cur_iov = iov; /* current iovec */ - size_t iov_base = 0; /* offset in the current iovec */ + size_t iov_offset = 0;/* offset in the current iovec */ char __user *buf; pagevec_init(lru_pvec, 0); @@ -1964,31 +1963,33 @@ generic_file_buffered_write(struct kiocb if (likely(nr_segs == 1)) buf = iov-iov_base + written; else { - filemap_set_next_iovec(cur_iov, iov_base, written); - buf = cur_iov-iov_base + iov_base; + filemap_set_next_iovec(cur_iov, iov_offset, written); + buf = cur_iov-iov_base + iov_offset; } do { - unsigned long index; - unsigned long offset; - unsigned long maxlen; - size_t copied; + pgoff_t index; /* Pagecache index for current page */ + unsigned long offset; /* Offset into pagecache page */ + unsigned long maxlen; /* Bytes remaining in current iovec */ + size_t bytes; /* Bytes to write to page */ + size_t copied; /* Bytes copied from user */ - offset = (pos (PAGE_CACHE_SIZE -1)); /* Within page */ + offset = (pos (PAGE_CACHE_SIZE - 1)); index = pos PAGE_CACHE_SHIFT; bytes = PAGE_CACHE_SIZE - offset; if (bytes count) bytes = count; + maxlen = cur_iov-iov_len - iov_offset; + if (maxlen bytes) + maxlen = bytes; + /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the * same page as we're writing to, without it being marked * up-to-date. */ - maxlen = cur_iov-iov_len - iov_base; - if (maxlen bytes) - maxlen = bytes; fault_in_pages_readable(buf, maxlen); page = __grab_cache_page(mapping,index,cached_page,lru_pvec); @@ -2019,7 +2020,7 @@ generic_file_buffered_write(struct kiocb buf, bytes); else copied = filemap_copy_from_user_iovec(page, offset, - cur_iov, iov_base, bytes); + cur_iov, iov_offset, bytes); flush_dcache_page(page); status = a_ops-commit_write(file, page, offset, offset+bytes); if (status == AOP_TRUNCATED_PAGE) { @@ -2037,12 +2038,12 @@ generic_file_buffered_write(struct kiocb buf += status; if (unlikely(nr_segs 1)) { filemap_set_next_iovec(cur_iov, - iov_base, status); + iov_offset, status); if (count) buf = cur_iov-iov_base + - iov_base; + iov_offset; } else { - iov_base += status; + iov_offset += status; } } } -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 06/44] mm: trim more holes
If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then we may have failed the write operation despite prepare_write having instantiated blocks past i_size. Fix this, and consolidate the trimming into one place. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c | 80 +-- 1 file changed, 40 insertions(+), 40 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -2001,22 +2001,9 @@ generic_file_buffered_write(struct kiocb } status = a_ops-prepare_write(file, page, offset, offset+bytes); - if (unlikely(status)) { - loff_t isize = i_size_read(inode); + if (unlikely(status)) + goto fs_write_aop_error; - if (status != AOP_TRUNCATED_PAGE) - unlock_page(page); - page_cache_release(page); - if (status == AOP_TRUNCATED_PAGE) - continue; - /* -* prepare_write() may have instantiated a few blocks -* outside i_size. Trim these off again. -*/ - if (pos + bytes isize) - vmtruncate(inode, isize); - break; - } if (likely(nr_segs == 1)) copied = filemap_copy_from_user(page, offset, buf, bytes); @@ -2025,40 +2012,53 @@ generic_file_buffered_write(struct kiocb cur_iov, iov_offset, bytes); flush_dcache_page(page); status = a_ops-commit_write(file, page, offset, offset+bytes); - if (status == AOP_TRUNCATED_PAGE) { - page_cache_release(page); - continue; + if (unlikely(status 0)) + goto fs_write_aop_error; + if (unlikely(copied != bytes)) { + status = -EFAULT; + goto fs_write_aop_error; } - if (likely(copied 0)) { - if (!status) - status = copied; + if (unlikely(status 0)) /* filesystem did partial write */ + copied = status; - if (status = 0) { - written += status; - count -= status; - pos += status; - buf += status; - if (unlikely(nr_segs 1)) { - filemap_set_next_iovec(cur_iov, - iov_offset, status); - if (count) - buf = cur_iov-iov_base + - iov_offset; - } else { - iov_offset += status; - } + if (likely(copied 0)) { + written += copied; + count -= copied; + pos += copied; + buf += copied; + if (unlikely(nr_segs 1)) { + filemap_set_next_iovec(cur_iov, + iov_offset, copied); + if (count) + buf = cur_iov-iov_base + iov_offset; + } else { + iov_offset += copied; } } - if (unlikely(copied != bytes)) - if (status = 0) - status = -EFAULT; unlock_page(page); mark_page_accessed(page); page_cache_release(page); - if (status 0) - break; balance_dirty_pages_ratelimited(mapping); cond_resched(); + continue; + +fs_write_aop_error: + if (status != AOP_TRUNCATED_PAGE) + unlock_page(page); + page_cache_release(page); + + /* +* prepare_write() may have instantiated a few blocks +* outside i_size. Trim these off again. Don't need +* i_size_read because we hold i_mutex. +*/ + if (pos + bytes inode-i_size) +
[patch 03/44] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83
From: Andrew Morton [EMAIL PROTECTED] This patch fixed the following bug: When prefaulting in the pages in generic_file_buffered_write(), we only faulted in the pages for the firts segment of the iovec. If the second of successive segment described a mmapping of the page into which we're write()ing, and that page is not up-to-date, the fault handler tries to lock the already-locked page (to bring it up to date) and deadlocks. An exploit for this bug is in writev-deadlock-demo.c, in http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz. (These demos assume blocksize PAGE_CACHE_SIZE). The problem with this fix is that it takes the kernel back to doing a single prepare_write()/commit_write() per iovec segment. So in the worst case we'll run prepare_write+commit_write 1024 times where we previously would have run it once. The other problem with the fix is that it fix all the locking problems. insert numbers obtained via ext3-tools's writev-speed.c here And apparently this change killed NFS overwrite performance, because, I suppose, it talks to the server for each prepare_write+commit_write. So just back that patch out - we'll be fixing the deadlock by other means. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Andrew Morton [EMAIL PROTECTED] Nick says: also it only ever actually papered over the bug, because after faulting in the pages, they might be unmapped or reclaimed. Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c | 18 +++--- 1 file changed, 7 insertions(+), 11 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1971,21 +1971,14 @@ generic_file_buffered_write(struct kiocb do { unsigned long index; unsigned long offset; + unsigned long maxlen; size_t copied; offset = (pos (PAGE_CACHE_SIZE -1)); /* Within page */ index = pos PAGE_CACHE_SHIFT; bytes = PAGE_CACHE_SIZE - offset; - - /* Limit the size of the copy to the caller's write size */ - bytes = min(bytes, count); - - /* -* Limit the size of the copy to that of the current segment, -* because fault_in_pages_readable() doesn't know how to walk -* segments. -*/ - bytes = min(bytes, cur_iov-iov_len - iov_base); + if (bytes count) + bytes = count; /* * Bring in the user page that we will copy from _first_. @@ -1993,7 +1986,10 @@ generic_file_buffered_write(struct kiocb * same page as we're writing to, without it being marked * up-to-date. */ - fault_in_pages_readable(buf, bytes); + maxlen = cur_iov-iov_len - iov_base; + if (maxlen bytes) + maxlen = bytes; + fault_in_pages_readable(buf, maxlen); page = __grab_cache_page(mapping,index,cached_page,lru_pvec); if (!page) { -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 07/44] mm: buffered write cleanup
Quite a bit of code is used in maintaining these cached pages that are probably pretty unlikely to get used. It would require a narrow race where the page is inserted concurrently while this process is allocating a page in order to create the spare page. Then a multi-page write into an uncached part of the file, to make use of it. Next, the buffered write path (and others) uses its own LRU pagevec when it should be just using the per-CPU LRU pagevec (which will cut down on both data and code size cacheline footprint). Also, these private LRU pagevecs are emptied after just a very short time, in contrast with the per-CPU pagevecs that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required to add the pages to pagecache for a bulk write (in 4K chunks). [this gets rid of some cond_resched() calls in readahead.c and mpage.c due to clashes in -mm. What put them there, and why? ] Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/mpage.c | 12 mm/filemap.c | 144 ++--- mm/readahead.c | 28 +++ 3 files changed, 66 insertions(+), 118 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -689,26 +689,22 @@ EXPORT_SYMBOL(probe_page); struct page *find_or_create_page(struct address_space *mapping, unsigned long index, gfp_t gfp_mask) { - struct page *page, *cached_page = NULL; + struct page *page; int err; repeat: page = find_lock_page(mapping, index); if (!page) { - if (!cached_page) { - cached_page = alloc_page(gfp_mask); - if (!cached_page) - return NULL; - } - err = add_to_page_cache_lru(cached_page, mapping, - index, gfp_mask); - if (!err) { - page = cached_page; - cached_page = NULL; - } else if (err == -EEXIST) - goto repeat; + page = alloc_page(gfp_mask); + if (!page) + return NULL; + err = add_to_page_cache_lru(page, mapping, index, gfp_mask); + if (unlikely(err)) { + page_cache_release(page); + page = NULL; + if (err == -EEXIST) + goto repeat; + } } - if (cached_page) - page_cache_release(cached_page); return page; } EXPORT_SYMBOL(find_or_create_page); @@ -903,11 +899,9 @@ void do_generic_mapping_read(struct addr unsigned long next_index; unsigned long prev_index; loff_t isize; - struct page *cached_page; int error; struct file_ra_state ra = *_ra; - cached_page = NULL; index = *ppos PAGE_CACHE_SHIFT; next_index = index; prev_index = ra.prev_page; @@ -1084,23 +1078,20 @@ no_cached_page: * Ok, it wasn't cached, so we need to create a new * page.. */ - if (!cached_page) { - cached_page = page_cache_alloc_cold(mapping); - if (!cached_page) { - desc-error = -ENOMEM; - goto out; - } + page = page_cache_alloc_cold(mapping); + if (!page) { + desc-error = -ENOMEM; + goto out; } - error = add_to_page_cache_lru(cached_page, mapping, + error = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL); if (error) { + page_cache_release(page); if (error == -EEXIST) goto find_page; desc-error = error; goto out; } - page = cached_page; - cached_page = NULL; goto readpage; } @@ -1110,8 +1101,6 @@ out: _ra-prev_page = prev_index; *ppos = ((loff_t) index PAGE_CACHE_SHIFT) + offset; - if (cached_page) - page_cache_release(cached_page); if (filp) file_accessed(filp); } @@ -1605,35 +1594,28 @@ static struct page *__read_cache_page(st int (*filler)(void *,struct page*), void *data) { - struct page *page, *cached_page = NULL; + struct page *page; int err; repeat: page = find_get_page(mapping, index);
[patch 10/44] mm: buffered write iterator
Add an iterator data structure to operate over an iovec. Add usercopy operators needed by generic_file_buffered_write, and convert that function over. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] include/linux/fs.h | 33 mm/filemap.c | 144 +++-- mm/filemap.h | 103 - 3 files changed, 150 insertions(+), 130 deletions(-) Index: linux-2.6/include/linux/fs.h === --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -398,6 +398,39 @@ struct page; struct address_space; struct writeback_control; +struct iov_iter { + const struct iovec *iov; + unsigned long nr_segs; + size_t iov_offset; + size_t count; +}; + +size_t iov_iter_copy_from_user_atomic(struct page *page, + struct iov_iter *i, unsigned long offset, size_t bytes); +size_t iov_iter_copy_from_user(struct page *page, + struct iov_iter *i, unsigned long offset, size_t bytes); +void iov_iter_advance(struct iov_iter *i, size_t bytes); +int iov_iter_fault_in_readable(struct iov_iter *i); +size_t iov_iter_single_seg_count(struct iov_iter *i); + +static inline void iov_iter_init(struct iov_iter *i, + const struct iovec *iov, unsigned long nr_segs, + size_t count, size_t written) +{ + i-iov = iov; + i-nr_segs = nr_segs; + i-iov_offset = 0; + i-count = count + written; + + iov_iter_advance(i, written); +} + +static inline size_t iov_iter_count(struct iov_iter *i) +{ + return i-count; +} + + struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -30,7 +30,7 @@ #include linux/security.h #include linux/syscalls.h #include linux/cpuset.h -#include filemap.h +#include linux/hardirq.h /* for BUG_ON(!in_atomic()) only */ #include internal.h /* @@ -1740,8 +1740,7 @@ int remove_suid(struct dentry *dentry) } EXPORT_SYMBOL(remove_suid); -size_t -__filemap_copy_from_user_iovec_inatomic(char *vaddr, +static size_t __iovec_copy_from_user_inatomic(char *vaddr, const struct iovec *iov, size_t base, size_t bytes) { size_t copied = 0, left = 0; @@ -1764,6 +1763,110 @@ __filemap_copy_from_user_iovec_inatomic( } /* + * Copy as much as we can into the page and return the number of bytes which + * were sucessfully copied. If a fault is encountered then return the number of + * bytes which were copied. + */ +size_t iov_iter_copy_from_user_atomic(struct page *page, + struct iov_iter *i, unsigned long offset, size_t bytes) +{ + char *kaddr; + size_t copied; + + BUG_ON(!in_atomic()); + kaddr = kmap_atomic(page, KM_USER0); + if (likely(i-nr_segs == 1)) { + int left; + char __user *buf = i-iov-iov_base + i-iov_offset; + left = __copy_from_user_inatomic_nocache(kaddr + offset, + buf, bytes); + copied = bytes - left; + } else { + copied = __iovec_copy_from_user_inatomic(kaddr + offset, + i-iov, i-iov_offset, bytes); + } + kunmap_atomic(kaddr, KM_USER0); + + return copied; +} + +/* + * This has the same sideeffects and return value as + * iov_iter_copy_from_user_atomic(). + * The difference is that it attempts to resolve faults. + * Page must not be locked. + */ +size_t iov_iter_copy_from_user(struct page *page, + struct iov_iter *i, unsigned long offset, size_t bytes) +{ + char *kaddr; + size_t copied; + + kaddr = kmap(page); + if (likely(i-nr_segs == 1)) { + int left; + char __user *buf = i-iov-iov_base + i-iov_offset; + left = __copy_from_user_nocache(kaddr + offset, buf, bytes); + copied = bytes - left; + } else { + copied = __iovec_copy_from_user_inatomic(kaddr + offset, + i-iov, i-iov_offset, bytes); + } + kunmap(page); + return copied; +} + +static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes) +{ + if (likely(i-nr_segs == 1)) { + i-iov_offset += bytes; + } else { + const struct iovec *iov = i-iov; + size_t base = i-iov_offset; + + while (bytes) { + int copy = min(bytes, iov-iov_len - base); + + bytes -= copy; +
[patch 18/44] ext3 convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] Various fixes and improvements Signed-off-by: Badari Pulavarty [EMAIL PROTECTED] fs/ext3/inode.c | 136 1 file changed, 88 insertions(+), 48 deletions(-) Index: linux-2.6/fs/ext3/inode.c === --- linux-2.6.orig/fs/ext3/inode.c +++ linux-2.6/fs/ext3/inode.c @@ -1147,51 +1147,68 @@ static int do_journal_get_write_access(h return ext3_journal_get_write_access(handle, bh); } -static int ext3_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +static int ext3_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - struct inode *inode = page-mapping-host; + struct inode *inode = mapping-host; int ret, needed_blocks = ext3_writepage_trans_blocks(inode); handle_t *handle; int retries = 0; + struct page *page; + pgoff_t index; + unsigned from, to; + + index = pos PAGE_CACHE_SHIFT; + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; retry: + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; + handle = ext3_journal_start(inode, needed_blocks); if (IS_ERR(handle)) { + unlock_page(page); + page_cache_release(page); ret = PTR_ERR(handle); goto out; } - if (test_opt(inode-i_sb, NOBH) ext3_should_writeback_data(inode)) - ret = nobh_prepare_write(page, from, to, ext3_get_block); - else - ret = block_prepare_write(page, from, to, ext3_get_block); + ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + ext3_get_block); if (ret) - goto prepare_write_failed; + goto write_begin_failed; if (ext3_should_journal_data(inode)) { ret = walk_page_buffers(handle, page_buffers(page), from, to, NULL, do_journal_get_write_access); } -prepare_write_failed: - if (ret) +write_begin_failed: + if (ret) { ext3_journal_stop(handle); + unlock_page(page); + page_cache_release(page); + } if (ret == -ENOSPC ext3_should_retry_alloc(inode-i_sb, retries)) goto retry; out: return ret; } + int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh) { int err = journal_dirty_data(handle, bh); if (err) ext3_journal_abort_handle(__FUNCTION__, __FUNCTION__, - bh, handle,err); + bh, handle, err); return err; } -/* For commit_write() in data=journal mode */ -static int commit_write_fn(handle_t *handle, struct buffer_head *bh) +/* For write_end() in data=journal mode */ +static int write_end_fn(handle_t *handle, struct buffer_head *bh) { if (!buffer_mapped(bh) || buffer_freed(bh)) return 0; @@ -1206,78 +1223,100 @@ static int commit_write_fn(handle_t *han * ext3 never places buffers on inode-i_mapping-private_list. metadata * buffers are managed internally. */ -static int ext3_ordered_commit_write(struct file *file, struct page *page, -unsigned from, unsigned to) +static int ext3_ordered_write_end(struct file *file, + struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { handle_t *handle = ext3_journal_current_handle(); - struct inode *inode = page-mapping-host; + struct inode *inode = file-f_mapping-host; + unsigned from, to; int ret = 0, ret2; + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; + ret = walk_page_buffers(handle, page_buffers(page), from, to, NULL, ext3_journal_dirty_data); if (ret == 0) { /* -* generic_commit_write() will run mark_inode_dirty() if i_size +* generic_write_end() will run mark_inode_dirty() if i_size * changes. So let's piggyback the i_disksize mark_inode_dirty * into that. */ loff_t new_i_size; - new_i_size = ((loff_t)page-index PAGE_CACHE_SHIFT) + to; + new_i_size = pos + copied; if (new_i_size EXT3_I(inode)-i_disksize)
[patch 19/44] ext4 convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Convert ext4 to use write_begin()/write_end() methods. Signed-off-by: Badari Pulavarty [EMAIL PROTECTED] fs/ext4/inode.c | 147 +++- 1 file changed, 93 insertions(+), 54 deletions(-) Index: linux-2.6/fs/ext4/inode.c === --- linux-2.6.orig/fs/ext4/inode.c +++ linux-2.6/fs/ext4/inode.c @@ -1146,34 +1146,50 @@ static int do_journal_get_write_access(h return ext4_journal_get_write_access(handle, bh); } -static int ext4_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +static int ext4_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - struct inode *inode = page-mapping-host; + struct inode *inode = mapping-host; int ret, needed_blocks = ext4_writepage_trans_blocks(inode); handle_t *handle; int retries = 0; + struct page *page; + pgoff_t index; + unsigned from, to; + + index = pos PAGE_CACHE_SHIFT; + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; retry: - handle = ext4_journal_start(inode, needed_blocks); - if (IS_ERR(handle)) { - ret = PTR_ERR(handle); - goto out; + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; + + handle = ext4_journal_start(inode, needed_blocks); + if (IS_ERR(handle)) { + unlock_page(page); + page_cache_release(page); + ret = PTR_ERR(handle); + goto out; } - if (test_opt(inode-i_sb, NOBH) ext4_should_writeback_data(inode)) - ret = nobh_prepare_write(page, from, to, ext4_get_block); - else - ret = block_prepare_write(page, from, to, ext4_get_block); - if (ret) - goto prepare_write_failed; - if (ext4_should_journal_data(inode)) { + ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + ext4_get_block); + + if (!ret ext4_should_journal_data(inode)) { ret = walk_page_buffers(handle, page_buffers(page), from, to, NULL, do_journal_get_write_access); } -prepare_write_failed: - if (ret) + + if (ret) { ext4_journal_stop(handle); + unlock_page(page); + page_cache_release(page); + } + if (ret == -ENOSPC ext4_should_retry_alloc(inode-i_sb, retries)) goto retry; out: @@ -1185,12 +1201,12 @@ int ext4_journal_dirty_data(handle_t *ha int err = jbd2_journal_dirty_data(handle, bh); if (err) ext4_journal_abort_handle(__FUNCTION__, __FUNCTION__, - bh, handle,err); + bh, handle, err); return err; } -/* For commit_write() in data=journal mode */ -static int commit_write_fn(handle_t *handle, struct buffer_head *bh) +/* For write_end() in data=journal mode */ +static int write_end_fn(handle_t *handle, struct buffer_head *bh) { if (!buffer_mapped(bh) || buffer_freed(bh)) return 0; @@ -1205,78 +1221,100 @@ static int commit_write_fn(handle_t *han * ext4 never places buffers on inode-i_mapping-private_list. metadata * buffers are managed internally. */ -static int ext4_ordered_commit_write(struct file *file, struct page *page, -unsigned from, unsigned to) +static int ext4_ordered_write_end(struct file *file, + struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { handle_t *handle = ext4_journal_current_handle(); - struct inode *inode = page-mapping-host; + struct inode *inode = file-f_mapping-host; + unsigned from, to; int ret = 0, ret2; + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; + ret = walk_page_buffers(handle, page_buffers(page), from, to, NULL, ext4_journal_dirty_data); if (ret == 0) { /* -* generic_commit_write() will run mark_inode_dirty() if i_size +* generic_write_end() will run mark_inode_dirty() if i_size * changes. So let's piggyback the i_disksize mark_inode_dirty * into that. */ loff_t new_i_size; - new_i_size = ((loff_t)page-index PAGE_CACHE_SHIFT) + to; +
[patch 17/44] ext2 convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/ext2/dir.c | 47 +-- fs/ext2/ext2.h |3 +++ fs/ext2/inode.c | 24 +--- 3 files changed, 45 insertions(+), 29 deletions(-) Index: linux-2.6/fs/ext2/inode.c === --- linux-2.6.orig/fs/ext2/inode.c +++ linux-2.6/fs/ext2/inode.c @@ -726,18 +726,21 @@ ext2_readpages(struct file *file, struct return mpage_readpages(mapping, pages, nr_pages, ext2_get_block); } -static int -ext2_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +int __ext2_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return block_prepare_write(page,from,to,ext2_get_block); + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + ext2_get_block); } static int -ext2_nobh_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +ext2_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return nobh_prepare_write(page,from,to,ext2_get_block); + *pagep = NULL; + return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata); } static int ext2_nobh_writepage(struct page *page, @@ -773,8 +776,8 @@ const struct address_space_operations ex .readpages = ext2_readpages, .writepage = ext2_writepage, .sync_page = block_sync_page, - .prepare_write = ext2_prepare_write, - .commit_write = generic_commit_write, + .write_begin= ext2_write_begin, + .write_end = generic_write_end, .bmap = ext2_bmap, .direct_IO = ext2_direct_IO, .writepages = ext2_writepages, @@ -791,8 +794,7 @@ const struct address_space_operations ex .readpages = ext2_readpages, .writepage = ext2_nobh_writepage, .sync_page = block_sync_page, - .prepare_write = ext2_nobh_prepare_write, - .commit_write = nobh_commit_write, + /* XXX: todo */ .bmap = ext2_bmap, .direct_IO = ext2_direct_IO, .writepages = ext2_writepages, Index: linux-2.6/fs/ext2/dir.c === --- linux-2.6.orig/fs/ext2/dir.c +++ linux-2.6/fs/ext2/dir.c @@ -22,6 +22,7 @@ */ #include ext2.h +#include linux/buffer_head.h #include linux/pagemap.h typedef struct ext2_dir_entry_2 ext2_dirent; @@ -61,12 +62,14 @@ ext2_last_byte(struct inode *inode, unsi return last_byte; } -static int ext2_commit_chunk(struct page *page, unsigned from, unsigned to) +static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len) { - struct inode *dir = page-mapping-host; + struct address_space *mapping = page-mapping; + struct inode *dir = mapping-host; int err = 0; + dir-i_version++; - page-mapping-a_ops-commit_write(NULL, page, from, to); + block_write_end(NULL, mapping, pos, len, len, page, NULL); if (IS_DIRSYNC(dir)) err = write_one_page(page, 1); else @@ -412,16 +415,18 @@ ino_t ext2_inode_by_name(struct inode * void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de, struct page *page, struct inode *inode) { - unsigned from = (char *) de - (char *) page_address(page); - unsigned to = from + le16_to_cpu(de-rec_len); + loff_t pos = (page-index PAGE_CACHE_SHIFT) + + (char *) de - (char *) page_address(page); + unsigned len = le16_to_cpu(de-rec_len); int err; lock_page(page); - err = page-mapping-a_ops-prepare_write(NULL, page, from, to); + err = __ext2_write_begin(NULL, page-mapping, pos, len, + AOP_FLAG_UNINTERRUPTIBLE, page, NULL); BUG_ON(err); de-inode = cpu_to_le32(inode-i_ino); - ext2_set_de_type (de, inode); - err = ext2_commit_chunk(page, from, to); + ext2_set_de_type(de, inode); + err = ext2_commit_chunk(page, pos, len); ext2_put_page(page); dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC; EXT2_I(dir)-i_flags = ~EXT2_BTREE_FL; @@ -444,7 +449,7 @@ int ext2_add_link (struct dentry *dentry unsigned long npages = dir_pages(dir); unsigned long n; char *kaddr; -
[patch 16/44] rd convert to new aops
Also clean up various little things. I've got rid of the comment from akpm, because now that make_page_uptodate is only called from 2 places, it is pretty easy to see that the buffers are in an uptodate state at the time of the call. Actually, it was OK before my patch as well, because the memset is equivalent to reading from disk of course... however it is more explicit where the updates come from now. Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] drivers/block/rd.c | 125 ++--- 1 file changed, 73 insertions(+), 52 deletions(-) Index: linux-2.6/drivers/block/rd.c === --- linux-2.6.orig/drivers/block/rd.c +++ linux-2.6/drivers/block/rd.c @@ -104,50 +104,60 @@ static void make_page_uptodate(struct pa struct buffer_head *head = bh; do { - if (!buffer_uptodate(bh)) { - memset(bh-b_data, 0, bh-b_size); - /* -* akpm: I'm totally undecided about this. The -* buffer has just been magically brought up to -* date, but nobody should want to be reading -* it anyway, because it hasn't been used for -* anything yet. It is still in a not read -* from disk yet state. -* -* But non-uptodate buffers against an uptodate -* page are against the rules. So do it anyway. -*/ + if (!buffer_uptodate(bh)) set_buffer_uptodate(bh); - } } while ((bh = bh-b_this_page) != head); - } else { - memset(page_address(page), 0, PAGE_CACHE_SIZE); } - flush_dcache_page(page); SetPageUptodate(page); } static int ramdisk_readpage(struct file *file, struct page *page) { - if (!PageUptodate(page)) + if (!PageUptodate(page)) { + memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE); make_page_uptodate(page); + } unlock_page(page); return 0; } -static int ramdisk_prepare_write(struct file *file, struct page *page, - unsigned offset, unsigned to) -{ - if (!PageUptodate(page)) - make_page_uptodate(page); +static int ramdisk_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + struct page *page; + pgoff_t index = pos PAGE_CACHE_SHIFT; + + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; return 0; } -static int ramdisk_commit_write(struct file *file, struct page *page, - unsigned offset, unsigned to) -{ +static int ramdisk_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) +{ + if (!PageUptodate(page)) { + if (copied != PAGE_CACHE_SIZE) { + void *dst; + unsigned from = pos (PAGE_CACHE_SIZE - 1); + unsigned to = from + copied; + + dst = kmap_atomic(page, KM_USER0); + memset(dst, 0, from); + memset(dst + to, 0, PAGE_CACHE_SIZE - to); + flush_dcache_page(page); + kunmap_atomic(dst, KM_USER0); + } + make_page_uptodate(page); + } + set_page_dirty(page); - return 0; + unlock_page(page); + page_cache_release(page); + return copied; } /* @@ -191,8 +201,8 @@ static int ramdisk_set_page_dirty(struct static const struct address_space_operations ramdisk_aops = { .readpage = ramdisk_readpage, - .prepare_write = ramdisk_prepare_write, - .commit_write = ramdisk_commit_write, + .write_begin= ramdisk_write_begin, + .write_end = ramdisk_write_end, .writepage = ramdisk_writepage, .set_page_dirty = ramdisk_set_page_dirty, .writepages = ramdisk_writepages, @@ -201,13 +211,14 @@ static const struct address_space_operat static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector, struct address_space *mapping) { - pgoff_t index = sector (PAGE_CACHE_SHIFT - 9); + loff_t pos = sector 9; unsigned int vec_offset =
[patch 20/44] xfs convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/xfs/linux-2.6/xfs_aops.c | 19 --- fs/xfs/linux-2.6/xfs_lrw.c | 35 --- 2 files changed, 24 insertions(+), 30 deletions(-) Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c === --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c +++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c @@ -1414,13 +1414,18 @@ xfs_vm_direct_IO( } STATIC int -xfs_vm_prepare_write( +xfs_vm_write_begin( struct file *file, - struct page *page, - unsigned intfrom, - unsigned intto) + struct address_space*mapping, + loff_t pos, + unsignedlen, + unsignedflags, + struct page **pagep, + void**fsdata) { - return block_prepare_write(page, from, to, xfs_get_blocks); + *pagep = NULL; + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + xfs_get_blocks); } STATIC sector_t @@ -1474,8 +1479,8 @@ const struct address_space_operations xf .sync_page = block_sync_page, .releasepage= xfs_vm_releasepage, .invalidatepage = xfs_vm_invalidatepage, - .prepare_write = xfs_vm_prepare_write, - .commit_write = generic_commit_write, + .write_begin= xfs_vm_write_begin, + .write_end = generic_write_end, .bmap = xfs_vm_bmap, .direct_IO = xfs_vm_direct_IO, .migratepage= buffer_migrate_page, Index: linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c === --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_lrw.c +++ linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c @@ -134,45 +134,34 @@ xfs_iozero( loff_t pos,/* offset in file */ size_t count) /* size of data to zero */ { - unsignedbytes; struct page *page; struct address_space*mapping; int status; mapping = ip-i_mapping; do { - unsigned long index, offset; + unsigned offset, bytes; + void *fsdata; offset = (pos (PAGE_CACHE_SIZE -1)); /* Within page */ - index = pos PAGE_CACHE_SHIFT; bytes = PAGE_CACHE_SIZE - offset; if (bytes count) bytes = count; - status = -ENOMEM; - page = grab_cache_page(mapping, index); - if (!page) - break; - - status = mapping-a_ops-prepare_write(NULL, page, offset, - offset + bytes); + status = pagecache_write_begin(NULL, mapping, pos, bytes, + AOP_FLAG_UNINTERRUPTIBLE, + page, fsdata); if (status) - goto unlock; + break; memclear_highpage_flush(page, offset, bytes); - status = mapping-a_ops-commit_write(NULL, page, offset, - offset + bytes); - if (!status) { - pos += bytes; - count -= bytes; - } - -unlock: - unlock_page(page); - page_cache_release(page); - if (status) - break; + status = pagecache_write_end(NULL, mapping, pos, bytes, bytes, + page, fsdata); + WARN_ON(status = 0); /* can't return less than zero! */ + pos += bytes; + count -= bytes; + status = 0; } while (count); return (-status); -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 13/44] mm: restore KERNEL_DS optimisations
Restore the KERNEL_DS optimisation, especially helpful to the 2copy write path. This may be a pretty questionable gain in most cases, especially after the legacy 2copy write path is removed, but it doesn't cost much. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -2157,7 +2157,7 @@ static ssize_t generic_perform_write_2co * cannot take a pagefault with the destination page locked. * So pin the source page to copy it. */ - if (!PageUptodate(page)) { + if (!PageUptodate(page) !segment_eq(get_fs(), KERNEL_DS)) { unlock_page(page); src_page = alloc_page(GFP_KERNEL); @@ -2282,6 +2282,13 @@ static ssize_t generic_perform_write(str const struct address_space_operations *a_ops = mapping-a_ops; long status = 0; ssize_t written = 0; + unsigned int flags = 0; + + /* +* Copies from kernel address space cannot fail (NFSD is a big user). +*/ + if (segment_eq(get_fs(), KERNEL_DS)) + flags |= AOP_FLAG_UNINTERRUPTIBLE; do { struct page *page; @@ -2313,7 +2320,7 @@ again: break; } - status = a_ops-write_begin(file, mapping, pos, bytes, 0, + status = a_ops-write_begin(file, mapping, pos, bytes, flags, page, fsdata); if (unlikely(status)) break; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 22/44] fat convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/fat/inode.c | 27 --- 1 file changed, 16 insertions(+), 11 deletions(-) Index: linux-2.6/fs/fat/inode.c === --- linux-2.6.orig/fs/fat/inode.c +++ linux-2.6/fs/fat/inode.c @@ -140,19 +140,24 @@ static int fat_readpages(struct file *fi return mpage_readpages(mapping, pages, nr_pages, fat_get_block); } -static int fat_prepare_write(struct file *file, struct page *page, -unsigned from, unsigned to) +static int fat_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return cont_prepare_write(page, from, to, fat_get_block, - MSDOS_I(page-mapping-host)-mmu_private); + *pagep = NULL; + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + fat_get_block, + MSDOS_I(mapping-host)-mmu_private); } -static int fat_commit_write(struct file *file, struct page *page, - unsigned from, unsigned to) +static int fat_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *pagep, void *fsdata) { - struct inode *inode = page-mapping-host; - int err = generic_commit_write(file, page, from, to); - if (!err !(MSDOS_I(inode)-i_attrs ATTR_ARCH)) { + struct inode *inode = mapping-host; + int err; + err = generic_write_end(file, mapping, pos, len, copied, pagep, fsdata); + if (!(err 0) !(MSDOS_I(inode)-i_attrs ATTR_ARCH)) { inode-i_mtime = inode-i_ctime = CURRENT_TIME_SEC; MSDOS_I(inode)-i_attrs |= ATTR_ARCH; mark_inode_dirty(inode); @@ -201,8 +206,8 @@ static const struct address_space_operat .writepage = fat_writepage, .writepages = fat_writepages, .sync_page = block_sync_page, - .prepare_write = fat_prepare_write, - .commit_write = fat_commit_write, + .write_begin= fat_write_begin, + .write_end = fat_write_end, .direct_IO = fat_direct_IO, .bmap = _fat_bmap }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 28/44] bfs convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/bfs/file.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) Index: linux-2.6/fs/bfs/file.c === --- linux-2.6.orig/fs/bfs/file.c +++ linux-2.6/fs/bfs/file.c @@ -145,9 +145,13 @@ static int bfs_readpage(struct file *fil return block_read_full_page(page, bfs_get_block); } -static int bfs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) +static int bfs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return block_prepare_write(page, from, to, bfs_get_block); + *pagep = NULL; + return block_write_begin(file, mapping, pos, len, flags, + pagep, fsdata, bfs_get_block); } static sector_t bfs_bmap(struct address_space *mapping, sector_t block) @@ -159,8 +163,8 @@ const struct address_space_operations bf .readpage = bfs_readpage, .writepage = bfs_writepage, .sync_page = block_sync_page, - .prepare_write = bfs_prepare_write, - .commit_write = generic_commit_write, + .write_begin= bfs_write_begin, + .write_end = generic_write_end, .bmap = bfs_bmap, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 26/44] hfsplus convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/hfsplus/extents.c | 21 + fs/hfsplus/inode.c | 20 2 files changed, 21 insertions(+), 20 deletions(-) Index: linux-2.6/fs/hfsplus/inode.c === --- linux-2.6.orig/fs/hfsplus/inode.c +++ linux-2.6/fs/hfsplus/inode.c @@ -26,10 +26,14 @@ static int hfsplus_writepage(struct page return block_write_full_page(page, hfsplus_get_block, wbc); } -static int hfsplus_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) -{ - return cont_prepare_write(page, from, to, hfsplus_get_block, - HFSPLUS_I(page-mapping-host).phys_size); +static int hfsplus_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + hfsplus_get_block, + HFSPLUS_I(mapping-host).phys_size); } static sector_t hfsplus_bmap(struct address_space *mapping, sector_t block) @@ -113,8 +117,8 @@ const struct address_space_operations hf .readpage = hfsplus_readpage, .writepage = hfsplus_writepage, .sync_page = block_sync_page, - .prepare_write = hfsplus_prepare_write, - .commit_write = generic_commit_write, + .write_begin= hfsplus_write_begin, + .write_end = generic_write_end, .bmap = hfsplus_bmap, .releasepage= hfsplus_releasepage, }; @@ -123,8 +127,8 @@ const struct address_space_operations hf .readpage = hfsplus_readpage, .writepage = hfsplus_writepage, .sync_page = block_sync_page, - .prepare_write = hfsplus_prepare_write, - .commit_write = generic_commit_write, + .write_begin= hfsplus_write_begin, + .write_end = generic_write_end, .bmap = hfsplus_bmap, .direct_IO = hfsplus_direct_IO, .writepages = hfsplus_writepages, Index: linux-2.6/fs/hfsplus/extents.c === --- linux-2.6.orig/fs/hfsplus/extents.c +++ linux-2.6/fs/hfsplus/extents.c @@ -443,21 +443,18 @@ void hfsplus_file_truncate(struct inode if (inode-i_size HFSPLUS_I(inode).phys_size) { struct address_space *mapping = inode-i_mapping; struct page *page; - u32 size = inode-i_size - 1; + void *fsdata; + u32 size = inode-i_size; int res; - page = grab_cache_page(mapping, size PAGE_CACHE_SHIFT); - if (!page) - return; - size = PAGE_CACHE_SIZE - 1; - size++; - res = mapping-a_ops-prepare_write(NULL, page, size, size); - if (!res) - res = mapping-a_ops-commit_write(NULL, page, size, size); + res = pagecache_write_begin(NULL, mapping, size, 0, + AOP_FLAG_UNINTERRUPTIBLE, + page, fsdata); if (res) - inode-i_size = HFSPLUS_I(inode).phys_size; - unlock_page(page); - page_cache_release(page); + return; + res = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata); + if (res 0) + return; mark_inode_dirty(inode); return; } else if (inode-i_size == HFSPLUS_I(inode).phys_size) -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 29/44] qnx4 convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/qnx4/inode.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) Index: linux-2.6/fs/qnx4/inode.c === --- linux-2.6.orig/fs/qnx4/inode.c +++ linux-2.6/fs/qnx4/inode.c @@ -433,16 +433,21 @@ static int qnx4_writepage(struct page *p { return block_write_full_page(page,qnx4_get_block, wbc); } + static int qnx4_readpage(struct file *file, struct page *page) { return block_read_full_page(page,qnx4_get_block); } -static int qnx4_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) -{ - struct qnx4_inode_info *qnx4_inode = qnx4_i(page-mapping-host); - return cont_prepare_write(page, from, to, qnx4_get_block, - qnx4_inode-mmu_private); + +static int qnx4_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + struct qnx4_inode_info *qnx4_inode = qnx4_i(mapping-host); + *pagep = NULL; + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + qnx4_get_block, + qnx4_inode-mmu_private); } static sector_t qnx4_bmap(struct address_space *mapping, sector_t block) { @@ -452,8 +457,8 @@ static const struct address_space_operat .readpage = qnx4_readpage, .writepage = qnx4_writepage, .sync_page = block_sync_page, - .prepare_write = qnx4_prepare_write, - .commit_write = generic_commit_write, + .write_begin= qnx4_write_begin, + .write_end = generic_write_end, .bmap = qnx4_bmap }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 37/44] hostfs convert to new aops
This also gets rid of a lot of useless read_file stuff. And also optimises the full page write case by marking a !uptodate page uptodate. Cc: Jeff Dike [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/hostfs/hostfs_kern.c | 70 +++- 1 file changed, 28 insertions(+), 42 deletions(-) Index: linux-2.6/fs/hostfs/hostfs_kern.c === --- linux-2.6.orig/fs/hostfs/hostfs_kern.c +++ linux-2.6/fs/hostfs/hostfs_kern.c @@ -461,56 +461,42 @@ int hostfs_readpage(struct file *file, s return(err); } -int hostfs_prepare_write(struct file *file, struct page *page, -unsigned int from, unsigned int to) +int hostfs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - char *buffer; - long long start, tmp; - int err; + pgoff_t index = pos PAGE_CACHE_SHIFT; - start = (long long) page-index PAGE_CACHE_SHIFT; - buffer = kmap(page); - if(from != 0){ - tmp = start; - err = read_file(FILE_HOSTFS_I(file)-fd, tmp, buffer, - from); - if(err 0) goto out; - } - if(to != PAGE_CACHE_SIZE){ - start += to; - err = read_file(FILE_HOSTFS_I(file)-fd, start, buffer + to, - PAGE_CACHE_SIZE - to); - if(err 0) goto out; - } - err = 0; - out: - kunmap(page); - return(err); + *pagep = __grab_cache_page(mapping, index); + if (!*pagep) + return -ENOMEM; + return 0; } -int hostfs_commit_write(struct file *file, struct page *page, unsigned from, -unsigned to) +int hostfs_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { - struct address_space *mapping = page-mapping; struct inode *inode = mapping-host; - char *buffer; - long long start; - int err = 0; + void *buffer; + unsigned from = pos (PAGE_CACHE_SIZE - 1); + int err; - start = (((long long) page-index) PAGE_CACHE_SHIFT) + from; buffer = kmap(page); - err = write_file(FILE_HOSTFS_I(file)-fd, start, buffer + from, -to - from); - if(err 0) err = 0; - - /* Actually, if !err, write_file has added to-from to start, so, despite -* the appearance, we are comparing i_size against the _last_ written -* location, as we should. */ + err = write_file(FILE_HOSTFS_I(file)-fd, pos, buffer + from, copied); + kunmap(page); + + if (!PageUptodate(page) err == PAGE_CACHE_SIZE) + SetPageUptodate(page); + unlock_page(page); + page_cache_release(page); - if(!err (start inode-i_size)) - inode-i_size = start; + /* If err 0, write_file has added err to pos, so we are comparing +* i_size against the last byte written. +*/ + if (err 0 (pos inode-i_size)) + inode-i_size = pos; - kunmap(page); return(err); } @@ -518,8 +504,8 @@ static const struct address_space_operat .writepage = hostfs_writepage, .readpage = hostfs_readpage, .set_page_dirty = __set_page_dirty_nobuffers, - .prepare_write = hostfs_prepare_write, - .commit_write = hostfs_commit_write + .write_begin= hostfs_write_begin, + .write_end = hostfs_write_end, }; static int init_inode(struct inode *inode, struct dentry *dentry) -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 39/44] cifs convert to new aops
Convert to new aops, and fix security hole where page is set uptodate before contents are uptodate. Cc: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/cifs/file.c | 89 - 1 file changed, 51 insertions(+), 38 deletions(-) Index: linux-2.6/fs/cifs/file.c === --- linux-2.6.orig/fs/cifs/file.c +++ linux-2.6/fs/cifs/file.c @@ -103,7 +103,7 @@ static inline int cifs_open_inode_helper /* want handles we can use to read with first in the list so we do not have to walk the - list to search for one in prepare_write */ + list to search for one in write_begin */ if ((file-f_flags O_ACCMODE) == O_WRONLY) { list_add_tail(pCifsFile-flist, pCifsInode-openFileList); @@ -1358,40 +1358,37 @@ static int cifs_writepage(struct page* p return rc; } -static int cifs_commit_write(struct file *file, struct page *page, - unsigned offset, unsigned to) +static int cifs_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { int xid; int rc = 0; - struct inode *inode = page-mapping-host; - loff_t position = ((loff_t)page-index PAGE_CACHE_SHIFT) + to; + struct inode *inode = mapping-host; + loff_t position = pos + copied; char *page_data; xid = GetXid(); - cFYI(1, (commit write for page %p up to position %lld for %d, -page, position, to)); + cFYI(1, (write end for page %p at pos %lld, copied %d, +page, pos, copied)); spin_lock(inode-i_lock); if (position inode-i_size) { i_size_write(inode, position); } spin_unlock(inode-i_lock); + if (!PageUptodate(page) copied == PAGE_CACHE_SIZE) + SetPageUptodate(page); + if (!PageUptodate(page)) { - position = ((loff_t)page-index PAGE_CACHE_SHIFT) + offset; - /* can not rely on (or let) writepage write this data */ - if (to offset) { - cFYI(1, (Illegal offsets, can not copy from %d to %d, - offset, to)); - FreeXid(xid); - return rc; - } + unsigned long offset = pos (PAGE_CACHE_SIZE - 1); + /* this is probably better than directly calling partialpage_write since in this function the file handle is known which we might as well leverage */ /* BB check if anything else missing out of ppw such as updating last write time */ page_data = kmap(page); - rc = cifs_write(file, page_data + offset, to-offset, - position); + rc = cifs_write(file, page_data + offset, copied, pos); if (rc 0) rc = 0; /* else if (rc 0) should we set writebehind rc? */ @@ -1399,9 +1396,12 @@ static int cifs_commit_write(struct file } else { set_page_dirty(page); } - FreeXid(xid); - return rc; + + unlock_page(page); + page_cache_release(page); + + return rc 0 ? rc : copied; } int cifs_fsync(struct file *file, struct dentry *dentry, int datasync) @@ -1928,34 +1928,47 @@ int is_size_safe_to_change(struct cifsIn return 1; } -static int cifs_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +static int cifs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { int rc = 0; loff_t i_size; loff_t offset; + pgoff_t index = pos PAGE_CACHE_SHIFT; + struct page *page; + + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; - cFYI(1, (prepare write for page %p from %d to %d,page,from,to)); + cFYI(1, (write begin for page %p at pos %lld, length %d, +page, pos, len)); if (PageUptodate(page)) return 0; - /* If we are writing a full page it will be up to date, - no need to read from the server */ - if ((to == PAGE_CACHE_SIZE) (from == 0)) { - SetPageUptodate(page); + /* If we are writing a full page it will become up to date, + no need to read from the server (although we may encounter a + short copy, so write_end has to handle this) */ + if (len
[patch 43/44] minix convert to new aops
Cc: Andries Brouwer [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/minix/dir.c | 43 +-- fs/minix/inode.c | 23 +++ 2 files changed, 44 insertions(+), 22 deletions(-) Index: linux-2.6/fs/minix/inode.c === --- linux-2.6.orig/fs/minix/inode.c +++ linux-2.6/fs/minix/inode.c @@ -348,24 +348,39 @@ static int minix_writepage(struct page * { return block_write_full_page(page, minix_get_block, wbc); } + static int minix_readpage(struct file *file, struct page *page) { return block_read_full_page(page,minix_get_block); } -static int minix_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) + +int __minix_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return block_prepare_write(page,from,to,minix_get_block); + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + minix_get_block); } + +static int minix_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return __minix_write_begin(file, mapping, pos, len, flags, pagep, fsdata); +} + static sector_t minix_bmap(struct address_space *mapping, sector_t block) { return generic_block_bmap(mapping,block,minix_get_block); } + static const struct address_space_operations minix_aops = { .readpage = minix_readpage, .writepage = minix_writepage, .sync_page = block_sync_page, - .prepare_write = minix_prepare_write, - .commit_write = generic_commit_write, + .write_begin = minix_write_begin, + .write_end = generic_write_end, .bmap = minix_bmap }; Index: linux-2.6/fs/minix/dir.c === --- linux-2.6.orig/fs/minix/dir.c +++ linux-2.6/fs/minix/dir.c @@ -9,6 +9,7 @@ */ #include minix.h +#include linux/buffer_head.h #include linux/highmem.h #include linux/smp_lock.h @@ -48,11 +49,12 @@ static inline unsigned long dir_pages(st return (inode-i_size+PAGE_CACHE_SIZE-1)PAGE_CACHE_SHIFT; } -static int dir_commit_chunk(struct page *page, unsigned from, unsigned to) +static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len) { - struct inode *dir = (struct inode *)page-mapping-host; + struct address_space *mapping = page-mapping; + struct inode *dir = mapping-host; int err = 0; - page-mapping-a_ops-commit_write(NULL, page, from, to); + block_write_end(NULL, mapping, pos, len, len, page, NULL); if (IS_DIRSYNC(dir)) err = write_one_page(page, 1); else @@ -220,7 +222,7 @@ int minix_add_link(struct dentry *dentry char *kaddr, *p; minix_dirent *de; minix3_dirent *de3; - unsigned from, to; + loff_t pos; int err; char *namx = NULL; __u32 inumber; @@ -272,9 +274,9 @@ int minix_add_link(struct dentry *dentry return -EINVAL; got_it: - from = p - (char*)page_address(page); - to = from + sbi-s_dirsize; - err = page-mapping-a_ops-prepare_write(NULL, page, from, to); + pos = (page-index PAGE_CACHE_SHIFT) + p - (char*)page_address(page); + err = __minix_write_begin(NULL, page-mapping, pos, sbi-s_dirsize, + AOP_FLAG_UNINTERRUPTIBLE, page, NULL); if (err) goto out_unlock; memcpy (namx, name, namelen); @@ -285,7 +287,7 @@ got_it: memset (namx + namelen, 0, sbi-s_dirsize - namelen - 2); de-inode = inode-i_ino; } - err = dir_commit_chunk(page, from, to); + err = dir_commit_chunk(page, pos, sbi-s_dirsize); dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(dir); out_put: @@ -302,15 +304,16 @@ int minix_delete_entry(struct minix_dir_ struct address_space *mapping = page-mapping; struct inode *inode = (struct inode*)mapping-host; char *kaddr = page_address(page); - unsigned from = (char*)de - kaddr; - unsigned to = from + minix_sb(inode-i_sb)-s_dirsize; + loff_t pos = (page-index PAGE_CACHE_SHIFT) + (char*)de - kaddr; + unsigned len = minix_sb(inode-i_sb)-s_dirsize; int err; lock_page(page); - err = mapping-a_ops-prepare_write(NULL, page, from, to); + err = __minix_write_begin(NULL, mapping, pos, len, + AOP_FLAG_UNINTERRUPTIBLE, page, NULL); if (err == 0) { de-inode = 0; -
[patch 24/44] affs convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/affs/file.c | 106 +++-- 1 file changed, 58 insertions(+), 48 deletions(-) Index: linux-2.6/fs/affs/file.c === --- linux-2.6.orig/fs/affs/file.c +++ linux-2.6/fs/affs/file.c @@ -395,25 +395,33 @@ static int affs_writepage(struct page *p { return block_write_full_page(page, affs_get_block, wbc); } + static int affs_readpage(struct file *file, struct page *page) { return block_read_full_page(page, affs_get_block); } -static int affs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) -{ - return cont_prepare_write(page, from, to, affs_get_block, - AFFS_I(page-mapping-host)-mmu_private); + +static int affs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + affs_get_block, + AFFS_I(mapping-host)-mmu_private); } + static sector_t _affs_bmap(struct address_space *mapping, sector_t block) { return generic_block_bmap(mapping,block,affs_get_block); } + const struct address_space_operations affs_aops = { .readpage = affs_readpage, .writepage = affs_writepage, .sync_page = block_sync_page, - .prepare_write = affs_prepare_write, - .commit_write = generic_commit_write, + .write_begin = affs_write_begin, + .write_end = generic_write_end, .bmap = _affs_bmap }; @@ -603,58 +611,65 @@ affs_readpage_ofs(struct file *file, str return err; } -static int affs_prepare_write_ofs(struct file *file, struct page *page, unsigned from, unsigned to) -{ - struct inode *inode = page-mapping-host; - u32 size, offset; - u32 tmp; +static int affs_write_begin_ofs(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + struct inode *inode = mapping-host; + struct page *page; + pgoff_t index; int err = 0; - pr_debug(AFFS: prepare_write(%u, %ld, %d, %d)\n, (u32)inode-i_ino, page-index, from, to); - offset = page-index PAGE_CACHE_SHIFT; - if (offset + from AFFS_I(inode)-mmu_private) { - err = affs_extent_file_ofs(inode, offset + from); + pr_debug(AFFS: write_begin(%u, %llu, %llu)\n, (u32)inode-i_ino, (unsigned long long)pos, (unsigned long long)pos + len); + if (pos AFFS_I(inode)-mmu_private) { + /* XXX: this probably leaves a too-big i_size in case of +* failure. Should really be updating i_size at write_end time +*/ + err = affs_extent_file_ofs(inode, pos); if (err) return err; } - size = inode-i_size; + + index = pos PAGE_CACHE_SHIFT; + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; if (PageUptodate(page)) return 0; - if (from) { - err = affs_do_readpage_ofs(file, page, 0, from); - if (err) - return err; - } - if (to PAGE_CACHE_SIZE) { - char *kaddr = kmap_atomic(page, KM_USER0); - - memset(kaddr + to, 0, PAGE_CACHE_SIZE - to); - flush_dcache_page(page); - kunmap_atomic(kaddr, KM_USER0); - if (size offset + to) { - if (size offset + PAGE_CACHE_SIZE) - tmp = size ~PAGE_CACHE_MASK; - else - tmp = PAGE_CACHE_SIZE; - err = affs_do_readpage_ofs(file, page, to, tmp); - } + /* XXX: inefficient but safe in the face of short writes */ + err = affs_do_readpage_ofs(file, page, 0, PAGE_CACHE_SIZE); + if (err) { + unlock_page(page); + page_cache_release(page); } return err; } -static int affs_commit_write_ofs(struct file *file, struct page *page, unsigned from, unsigned to) +static int affs_write_end_ofs(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { - struct inode *inode = page-mapping-host; + struct inode *inode = mapping-host; struct super_block *sb = inode-i_sb; struct buffer_head *bh, *prev_bh;
[patch 25/44] hfs convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/hfs/extent.c | 19 --- fs/hfs/inode.c | 20 2 files changed, 20 insertions(+), 19 deletions(-) Index: linux-2.6/fs/hfs/inode.c === --- linux-2.6.orig/fs/hfs/inode.c +++ linux-2.6/fs/hfs/inode.c @@ -34,10 +34,14 @@ static int hfs_readpage(struct file *fil return block_read_full_page(page, hfs_get_block); } -static int hfs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) -{ - return cont_prepare_write(page, from, to, hfs_get_block, - HFS_I(page-mapping-host)-phys_size); +static int hfs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + hfs_get_block, + HFS_I(mapping-host)-phys_size); } static sector_t hfs_bmap(struct address_space *mapping, sector_t block) @@ -118,8 +122,8 @@ const struct address_space_operations hf .readpage = hfs_readpage, .writepage = hfs_writepage, .sync_page = block_sync_page, - .prepare_write = hfs_prepare_write, - .commit_write = generic_commit_write, + .write_begin= hfs_write_begin, + .write_end = generic_write_end, .bmap = hfs_bmap, .releasepage= hfs_releasepage, }; @@ -128,8 +132,8 @@ const struct address_space_operations hf .readpage = hfs_readpage, .writepage = hfs_writepage, .sync_page = block_sync_page, - .prepare_write = hfs_prepare_write, - .commit_write = generic_commit_write, + .write_begin= hfs_write_begin, + .write_end = generic_write_end, .bmap = hfs_bmap, .direct_IO = hfs_direct_IO, .writepages = hfs_writepages, Index: linux-2.6/fs/hfs/extent.c === --- linux-2.6.orig/fs/hfs/extent.c +++ linux-2.6/fs/hfs/extent.c @@ -464,23 +464,20 @@ void hfs_file_truncate(struct inode *ino (long long)HFS_I(inode)-phys_size, inode-i_size); if (inode-i_size HFS_I(inode)-phys_size) { struct address_space *mapping = inode-i_mapping; + void *fsdata; struct page *page; int res; + /* XXX: Can use generic_cont_expand? */ size = inode-i_size - 1; - page = grab_cache_page(mapping, size PAGE_CACHE_SHIFT); - if (!page) - return; - size = PAGE_CACHE_SIZE - 1; - size++; - res = mapping-a_ops-prepare_write(NULL, page, size, size); - if (!res) - res = mapping-a_ops-commit_write(NULL, page, size, size); + res = pagecache_write_begin(NULL, mapping, size+1, 0, + AOP_FLAG_UNINTERRUPTIBLE, page, fsdata); + if (!res) { + res = pagecache_write_end(NULL, mapping, size+1, 0, 0, + page, fsdata); + } if (res) inode-i_size = HFS_I(inode)-phys_size; - unlock_page(page); - page_cache_release(page); - mark_inode_dirty(inode); return; } else if (inode-i_size == HFS_I(inode)-phys_size) return; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 30/44] nfs convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/nfs/file.c | 49 - 1 file changed, 36 insertions(+), 13 deletions(-) Index: linux-2.6/fs/nfs/file.c === --- linux-2.6.orig/fs/nfs/file.c +++ linux-2.6/fs/nfs/file.c @@ -282,27 +282,50 @@ nfs_fsync(struct file *file, struct dent } /* - * This does the real work of the write. The generic routine has - * allocated the page, locked it, done all the page alignment stuff - * calculations etc. Now we should just copy the data from user - * space and write it back to the real medium.. + * This does the real work of the write. We must allocate and lock the + * page to be sent back to the generic routine, which then copies the + * data from user space. * * If the writer ends up delaying the write, the writer needs to * increment the page use counts until he is done with the page. */ -static int nfs_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) -{ - return nfs_flush_incompatible(file, page); +static int nfs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + int ret; + pgoff_t index; + struct page *page; + index = pos PAGE_CACHE_SHIFT; + + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; + + ret = nfs_flush_incompatible(file, page); + if (ret) { + unlock_page(page); + page_cache_release(page); + } + return ret; } -static int nfs_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to) +static int nfs_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { - long status; + unsigned offset = pos (PAGE_CACHE_SIZE - 1); + int status; lock_kernel(); - status = nfs_updatepage(file, page, offset, to-offset); + status = nfs_updatepage(file, page, offset, copied); unlock_kernel(); - return status; + + unlock_page(page); + page_cache_release(page); + + return status 0 ? status : copied; } static void nfs_invalidate_page(struct page *page, unsigned long offset) @@ -330,8 +353,8 @@ const struct address_space_operations nf .set_page_dirty = nfs_set_page_dirty, .writepage = nfs_writepage, .writepages = nfs_writepages, - .prepare_write = nfs_prepare_write, - .commit_write = nfs_commit_write, + .write_begin = nfs_write_begin, + .write_end = nfs_write_end, .invalidatepage = nfs_invalidate_page, .releasepage = nfs_release_page, #ifdef CONFIG_NFS_DIRECTIO -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 23/44] adfs convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/adfs/inode.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) Index: linux-2.6/fs/adfs/inode.c === --- linux-2.6.orig/fs/adfs/inode.c +++ linux-2.6/fs/adfs/inode.c @@ -61,10 +61,14 @@ static int adfs_readpage(struct file *fi return block_read_full_page(page, adfs_get_block); } -static int adfs_prepare_write(struct file *file, struct page *page, unsigned int from, unsigned int to) +static int adfs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return cont_prepare_write(page, from, to, adfs_get_block, - ADFS_I(page-mapping-host)-mmu_private); + *pagep = NULL; + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + adfs_get_block, + ADFS_I(mapping-host)-mmu_private); } static sector_t _adfs_bmap(struct address_space *mapping, sector_t block) @@ -76,8 +80,8 @@ static const struct address_space_operat .readpage = adfs_readpage, .writepage = adfs_writepage, .sync_page = block_sync_page, - .prepare_write = adfs_prepare_write, - .commit_write = generic_commit_write, + .write_begin= adfs_write_begin, + .write_end = generic_write_end, .bmap = _adfs_bmap }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 41/44] udf convert to new aops
Convert udf to new aops. Also seem to have fixed pagecache corruption in udf_adinicb_commit_write -- page was marked uptodate when it is not. Also, fixed the silly setup where prepare_write was doing a kmap to be used in commit_write: just do kmap_atomic in write_end. Use libfs helpers to make this easier. Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/udf/file.c | 32 +--- fs/udf/inode.c | 11 +++ 2 files changed, 20 insertions(+), 23 deletions(-) Index: linux-2.6/fs/udf/file.c === --- linux-2.6.orig/fs/udf/file.c +++ linux-2.6/fs/udf/file.c @@ -73,34 +73,28 @@ static int udf_adinicb_writepage(struct return 0; } -static int udf_adinicb_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) +static int udf_adinicb_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { - kmap(page); - return 0; -} - -static int udf_adinicb_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to) -{ - struct inode *inode = page-mapping-host; - char *kaddr = page_address(page); + struct inode *inode = mapping-host; + unsigned offset = pos (PAGE_CACHE_SIZE - 1); + char *kaddr; + kaddr = kmap_atomic(page, KM_USER0); memcpy(UDF_I_DATA(inode) + UDF_I_LENEATTR(inode) + offset, - kaddr + offset, to - offset); - mark_inode_dirty(inode); - SetPageUptodate(page); - kunmap(page); - /* only one page here */ - if (to inode-i_size) - inode-i_size = to; - return 0; + kaddr + offset, copied); + kunmap_atomic(kaddr, KM_USER0); + + return simple_write_end(file, mapping, pos, len, copied, page, fsdata); } const struct address_space_operations udf_adinicb_aops = { .readpage = udf_adinicb_readpage, .writepage = udf_adinicb_writepage, .sync_page = block_sync_page, - .prepare_write = udf_adinicb_prepare_write, - .commit_write = udf_adinicb_commit_write, + .write_begin= simple_write_begin, + .write_end = udf_adinicb_write_end, }; static ssize_t udf_file_aio_write(struct kiocb *iocb, const struct iovec *iov, Index: linux-2.6/fs/udf/inode.c === --- linux-2.6.orig/fs/udf/inode.c +++ linux-2.6/fs/udf/inode.c @@ -122,9 +122,12 @@ static int udf_readpage(struct file *fil return block_read_full_page(page, udf_get_block); } -static int udf_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) +static int udf_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return block_prepare_write(page, from, to, udf_get_block); + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + udf_get_block); } static sector_t udf_bmap(struct address_space *mapping, sector_t block) @@ -136,8 +139,8 @@ const struct address_space_operations ud .readpage = udf_readpage, .writepage = udf_writepage, .sync_page = block_sync_page, - .prepare_write = udf_prepare_write, - .commit_write = generic_commit_write, + .write_begin= udf_write_begin, + .write_end = generic_write_end, .bmap = udf_bmap, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 35/44] ecryptfs convert to new aops
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Cc: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/ecryptfs/crypto.c | 32 +++--- fs/ecryptfs/ecryptfs_kernel.h |4 fs/ecryptfs/mmap.c| 213 +++--- 3 files changed, 119 insertions(+), 130 deletions(-) Index: linux-2.6/fs/ecryptfs/mmap.c === --- linux-2.6.orig/fs/ecryptfs/mmap.c +++ linux-2.6/fs/ecryptfs/mmap.c @@ -36,26 +36,6 @@ struct kmem_cache *ecryptfs_lower_page_cache; -/** - * ecryptfs_get1page - * - * Get one page from cache or lower f/s, return error otherwise. - * - * Returns unlocked and up-to-date page (if ok), with increased - * refcnt. - */ -static struct page *ecryptfs_get1page(struct file *file, int index) -{ - struct dentry *dentry; - struct inode *inode; - struct address_space *mapping; - - dentry = file-f_path.dentry; - inode = dentry-d_inode; - mapping = inode-i_mapping; - return read_mapping_page(mapping, index, (void *)file); -} - static int write_zeros(struct file *file, pgoff_t index, int start, int num_zeros); @@ -360,17 +340,14 @@ out: /** * Called with lower inode mutex held. */ -static int fill_zeros_to_end_of_page(struct page *page, unsigned int to) +static int fill_zeros_to_end_of_page(struct page *page, loff_t new_isize) { - struct inode *inode = page-mapping-host; int end_byte_in_page; char *page_virt; - if ((i_size_read(inode) / PAGE_CACHE_SIZE) != page-index) + if ((new_isize PAGE_CACHE_SHIFT) != page-index) goto out; - end_byte_in_page = i_size_read(inode) % PAGE_CACHE_SIZE; - if (to end_byte_in_page) - end_byte_in_page = to; + end_byte_in_page = new_isize % PAGE_CACHE_SIZE; page_virt = kmap_atomic(page, KM_USER0); memset((page_virt + end_byte_in_page), 0, (PAGE_CACHE_SIZE - end_byte_in_page)); @@ -380,16 +357,35 @@ out: return 0; } -static int ecryptfs_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +static int ecryptfs_write_begin(struct file *file,struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { + struct page *page; + pgoff_t index; int rc = 0; - if (from == 0 to == PAGE_CACHE_SIZE) - goto out; /* If we are writing a full page, it will be - up to date. */ - if (!PageUptodate(page)) - rc = ecryptfs_do_readpage(file, page, page-index); + index = pos PAGE_CACHE_SHIFT; + page = __grab_cache_page(mapping, index); + if (!page) { + rc = -ENOMEM; + goto out; + } + + /* +* If we are writing a full page (with no possibility of a short +* write), it will be guaranteed to end up being uptodate at +* write_end-time +*/ + if (flags AOP_FLAG_UNINTERRUPTIBLE len == PAGE_CACHE_SIZE) + goto out; + if (!PageUptodate(page)) { + rc = ecryptfs_do_readpage(file, page, index); + if (rc) { + unlock_page(page); + page_cache_release(page); + } + } out: return rc; } @@ -412,12 +408,6 @@ out: return rc; } -static void ecryptfs_release_lower_page(struct page *lower_page) -{ - unlock_page(lower_page); - page_cache_release(lower_page); -} - /** * ecryptfs_write_inode_size_to_header * @@ -431,23 +421,17 @@ static int ecryptfs_write_inode_size_to_ { int rc = 0; struct page *header_page; + void *fsdata; char *header_virt; - const struct address_space_operations *lower_a_ops; + struct address_space *lower_mapping = lower_inode-i_mapping; u64 file_size; - header_page = grab_cache_page(lower_inode-i_mapping, 0); - if (!header_page) { - ecryptfs_printk(KERN_ERR, grab_cache_page for - lower_page_index 0 failed\n); - rc = -EINVAL; - goto out; - } - lower_a_ops = lower_inode-i_mapping-a_ops; - rc = lower_a_ops-prepare_write(lower_file, header_page, 0, 8); - if (rc) { - ecryptfs_release_lower_page(header_page); + rc = pagecache_write_begin(lower_file, lower_mapping, 0, sizeof(u64), + AOP_FLAG_UNINTERRUPTIBLE, + header_page, fsdata); + if (rc) goto out; - } + file_size = (u64)i_size_read(inode); ecryptfs_printk(KERN_DEBUG, Writing size: [0x%.16x]\n,
[patch 32/44] ocfs2: convert to new aops
From: Mark Fasheh [EMAIL PROTECTED] Fix up ocfs2 to use -write_begin and -write_end. This lets us dump a large amount of code which was implementing our own write path while preserving the nice locking rules that were gained by moving away from -prepare_write. It makes use of the context back pointer to store information related to the write which the vfs normally doesn't know about. Most importantly this is an array of zero'd pages which might have to be written out for an allocating write. Of note is that I also stick the journal handle on there. Ocfs2 could use current-journal_info for that, but I think it's much cleaner to just pass that around as a file system specific context. I tested this on a couple nodes and things seem to be running smoothly. A couple of notes: * The ocfs2 write context is probably a bit big. I'm much more concerned with readability though as Ocfs2 has much more baggage to carry around than other file systems. * A ton of code was deleted :) The patch adds a bunch too, but that's mostly getting the old stuff to flow with -write_begin. Some assumptions about the size of the write that were made with my previous implemenation were no longer true (this is good). * I could probably clean this up some more, but I'd be fine if the patch went upstream as-is. Diff seems to have mangled this patch file enough that it's probably much easier to read once applied. * This doesn't use -perform_write (yet), so stuff is still being copied one page at a time. I _think_ things are pretty reasonably set up to allow larger writes though... Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Mark Fasheh [EMAIL PROTECTED] fs/ocfs2/aops.c | 779 +++- fs/ocfs2/aops.h | 52 --- fs/ocfs2/file.c | 246 - 3 files changed, 453 insertions(+), 624 deletions(-) Index: linux-2.6/fs/ocfs2/aops.c === --- linux-2.6.orig/fs/ocfs2/aops.c +++ linux-2.6/fs/ocfs2/aops.c @@ -677,6 +677,8 @@ int ocfs2_map_page_blocks(struct page *p bh = bh-b_this_page, block_start += bsize) { block_end = block_start + bsize; + clear_buffer_new(bh); + /* * Ignore blocks outside of our i/o range - * they may belong to unallocated clusters. @@ -691,9 +693,8 @@ int ocfs2_map_page_blocks(struct page *p * For an allocating write with cluster size = page * size, we always write the entire page. */ - - if (buffer_new(bh)) - clear_buffer_new(bh); + if (new) + set_buffer_new(bh); if (!buffer_mapped(bh)) { map_bh(bh, inode-i_sb, *p_blkno); @@ -754,217 +755,187 @@ next_bh: return ret; } +#if (PAGE_CACHE_SIZE = OCFS2_MAX_CLUSTERSIZE) +#define OCFS2_MAX_CTXT_PAGES 1 +#else +#define OCFS2_MAX_CTXT_PAGES (OCFS2_MAX_CLUSTERSIZE / PAGE_CACHE_SIZE) +#endif + +#define OCFS2_MAX_CLUSTERS_PER_PAGE(PAGE_CACHE_SIZE / OCFS2_MIN_CLUSTERSIZE) + /* - * This will copy user data from the buffer page in the splice - * context. - * - * For now, we ignore SPLICE_F_MOVE as that would require some extra - * communication out all the way to ocfs2_write(). + * Describe the state of a single cluster to be written to. */ -int ocfs2_map_and_write_splice_data(struct inode *inode, - struct ocfs2_write_ctxt *wc, u64 *p_blkno, - unsigned int *ret_from, unsigned int *ret_to) -{ - int ret; - unsigned int to, from, cluster_start, cluster_end; - char *src, *dst; - struct ocfs2_splice_write_priv *sp = wc-w_private; - struct pipe_buffer *buf = sp-s_buf; - unsigned long bytes, src_from; - struct ocfs2_super *osb = OCFS2_SB(inode-i_sb); +struct ocfs2_write_cluster_desc { + u32 c_cpos; + u32 c_phys; + /* +* Give this a unique field because c_phys eventually gets +* filled. +*/ + unsignedc_new; +}; - ocfs2_figure_cluster_boundaries(osb, wc-w_cpos, cluster_start, - cluster_end); +struct ocfs2_write_ctxt { + /* Logical cluster position / len of write */ + u32 w_cpos; + u32 w_clen; - from = sp-s_offset; - src_from = sp-s_buf_offset; - bytes = wc-w_count; + struct ocfs2_write_cluster_desc w_desc[OCFS2_MAX_CLUSTERS_PER_PAGE]; - if (wc-w_large_pages) { - /* -* For cluster size page size, we have to -* calculate pos within the cluster and obey -* the rightmost boundary. -*/ - bytes = min(bytes, (unsigned
[patch 21/44] fs: new cont helpers
Rework the generic block cont routines to handle the new aops. Supporting cont_prepare_write would take quite a lot of code to support, so remove it instead (and we later convert all filesystems to use it). write_begin gets passed AOP_FLAG_CONT_EXPAND when called from generic_cont_expand, so filesystems can avoid the old hacks they used. Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/buffer.c | 204 +--- include/linux/buffer_head.h |5 - include/linux/fs.h |1 mm/filemap.c|5 + 4 files changed, 110 insertions(+), 105 deletions(-) Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -2027,6 +2027,7 @@ int generic_write_end(struct file *file, loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata) { + struct inode *inode = mapping-host; copied = block_write_end(file, mapping, pos, len, copied, page, fsdata); unlock_page(page); @@ -2041,6 +2042,8 @@ int generic_write_end(struct file *file, i_size_write(inode, pos+copied); mark_inode_dirty(inode); } + + return copied; } EXPORT_SYMBOL(generic_write_end); @@ -2142,14 +2145,14 @@ int block_read_full_page(struct page *pa } /* utility function for filesystems that need to do work on expanding - * truncates. Uses prepare/commit_write to allow the filesystem to + * truncates. Uses filesystem pagecache writes to allow the filesystem to * deal with the hole. */ -static int __generic_cont_expand(struct inode *inode, loff_t size, -pgoff_t index, unsigned int offset) +int generic_cont_expand_simple(struct inode *inode, loff_t size) { struct address_space *mapping = inode-i_mapping; struct page *page; + void *fsdata; unsigned long limit; int err; @@ -2162,146 +2165,141 @@ static int __generic_cont_expand(struct if (size inode-i_sb-s_maxbytes) goto out; - err = -ENOMEM; - page = grab_cache_page(mapping, index); - if (!page) - goto out; - err = mapping-a_ops-prepare_write(NULL, page, offset, offset); - if (err) { - /* -* -prepare_write() may have instantiated a few blocks -* outside i_size. Trim these off again. -*/ - unlock_page(page); - page_cache_release(page); - vmtruncate(inode, inode-i_size); + err = pagecache_write_begin(NULL, mapping, size, 0, + AOP_FLAG_UNINTERRUPTIBLE|AOP_FLAG_CONT_EXPAND, + page, fsdata); + if (err) goto out; - } - err = mapping-a_ops-commit_write(NULL, page, offset, offset); + err = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata); + BUG_ON(err 0); - unlock_page(page); - page_cache_release(page); - if (err 0) - err = 0; out: return err; } int generic_cont_expand(struct inode *inode, loff_t size) { - pgoff_t index; unsigned int offset; offset = (size (PAGE_CACHE_SIZE - 1)); /* Within page */ /* ugh. in prepare/commit_write, if from==to==start of block, we - ** skip the prepare. make sure we never send an offset for the start - ** of a block - */ +* skip the prepare. make sure we never send an offset for the start +* of a block. +* XXX: actually, this should be handled in those filesystems by +* checking for the AOP_FLAG_CONT_EXPAND flag. +*/ if ((offset (inode-i_sb-s_blocksize - 1)) == 0) { /* caller must handle this extra byte. */ - offset++; + size++; } - index = size PAGE_CACHE_SHIFT; - - return __generic_cont_expand(inode, size, index, offset); + return generic_cont_expand_simple(inode, size); } -int generic_cont_expand_simple(struct inode *inode, loff_t size) +int cont_expand_zero(struct file *file, struct address_space *mapping, + loff_t pos, loff_t *bytes) { - loff_t pos = size - 1; - pgoff_t index = pos PAGE_CACHE_SHIFT; - unsigned int offset = (pos (PAGE_CACHE_SIZE - 1)) + 1; - - /* prepare/commit_write can handle even if from==to==start of block. */ - return __generic_cont_expand(inode, size, index, offset); -} - -/* - * For moronic filesystems that do not allow holes in file. - * We may have to extend the file. - */ - -int cont_prepare_write(struct page *page, unsigned offset, - unsigned to, get_block_t *get_block, loff_t *bytes) -{ - struct
[patch 40/44] ufs convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/ufs/dir.c | 50 +++--- fs/ufs/inode.c | 23 +++ 2 files changed, 50 insertions(+), 23 deletions(-) Index: linux-2.6/fs/ufs/inode.c === --- linux-2.6.orig/fs/ufs/inode.c +++ linux-2.6/fs/ufs/inode.c @@ -558,24 +558,39 @@ static int ufs_writepage(struct page *pa { return block_write_full_page(page,ufs_getfrag_block,wbc); } + static int ufs_readpage(struct file *file, struct page *page) { return block_read_full_page(page,ufs_getfrag_block); } -static int ufs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) + +int __ufs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return block_prepare_write(page,from,to,ufs_getfrag_block); + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + ufs_getfrag_block); } + +static int ufs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return __ufs_write_begin(file, mapping, pos, len, flags, pagep, fsdata); +} + static sector_t ufs_bmap(struct address_space *mapping, sector_t block) { return generic_block_bmap(mapping,block,ufs_getfrag_block); } + const struct address_space_operations ufs_aops = { .readpage = ufs_readpage, .writepage = ufs_writepage, .sync_page = block_sync_page, - .prepare_write = ufs_prepare_write, - .commit_write = generic_commit_write, + .write_begin = ufs_write_begin, + .write_end = generic_write_end, .bmap = ufs_bmap }; Index: linux-2.6/fs/ufs/dir.c === --- linux-2.6.orig/fs/ufs/dir.c +++ linux-2.6/fs/ufs/dir.c @@ -38,12 +38,14 @@ static inline int ufs_match(struct super return !memcmp(name, de-d_name, len); } -static int ufs_commit_chunk(struct page *page, unsigned from, unsigned to) +static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len) { - struct inode *dir = page-mapping-host; + struct address_space *mapping = page-mapping; + struct inode *dir = mapping-host; int err = 0; + dir-i_version++; - page-mapping-a_ops-commit_write(NULL, page, from, to); + block_write_end(NULL, mapping, pos, len, len, page, NULL); if (IS_DIRSYNC(dir)) err = write_one_page(page, 1); else @@ -81,16 +83,20 @@ ino_t ufs_inode_by_name(struct inode *di void ufs_set_link(struct inode *dir, struct ufs_dir_entry *de, struct page *page, struct inode *inode) { - unsigned from = (char *) de - (char *) page_address(page); - unsigned to = from + fs16_to_cpu(dir-i_sb, de-d_reclen); + loff_t pos = (page-index PAGE_CACHE_SHIFT) + + (char *) de - (char *) page_address(page); + unsigned len = fs16_to_cpu(dir-i_sb, de-d_reclen); int err; lock_page(page); - err = page-mapping-a_ops-prepare_write(NULL, page, from, to); + err = __ufs_write_begin(NULL, page-mapping, pos, len, + AOP_FLAG_UNINTERRUPTIBLE, page, NULL); BUG_ON(err); + de-d_ino = cpu_to_fs32(dir-i_sb, inode-i_ino); ufs_set_de_type(dir-i_sb, de, inode-i_mode); - err = ufs_commit_chunk(page, from, to); + + err = ufs_commit_chunk(page, pos, len); ufs_put_page(page); dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(dir); @@ -312,7 +318,7 @@ int ufs_add_link(struct dentry *dentry, unsigned long npages = ufs_dir_pages(dir); unsigned long n; char *kaddr; - unsigned from, to; + loff_t pos; int err; UFSD(ENTER, name %s, namelen %u\n, name, namelen); @@ -367,9 +373,10 @@ int ufs_add_link(struct dentry *dentry, return -EINVAL; got_it: - from = (char*)de - (char*)page_address(page); - to = from + rec_len; - err = page-mapping-a_ops-prepare_write(NULL, page, from, to); + pos = (page-index PAGE_CACHE_SHIFT) + + (char*)de - (char*)page_address(page); + err = __ufs_write_begin(NULL, page-mapping, pos, rec_len, + AOP_FLAG_UNINTERRUPTIBLE, page, NULL); if (err) goto out_unlock; if (de-d_ino) { @@ -386,7 +393,7 @@ got_it: de-d_ino = cpu_to_fs32(sb, inode-i_ino); ufs_set_de_type(sb, de, inode-i_mode); - err = ufs_commit_chunk(page, from,
[patch 42/44] sysv convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/sysv/dir.c | 45 + fs/sysv/itree.c | 23 +++ 2 files changed, 44 insertions(+), 24 deletions(-) Index: linux-2.6/fs/sysv/itree.c === --- linux-2.6.orig/fs/sysv/itree.c +++ linux-2.6/fs/sysv/itree.c @@ -453,23 +453,38 @@ static int sysv_writepage(struct page *p { return block_write_full_page(page,get_block,wbc); } + static int sysv_readpage(struct file *file, struct page *page) { return block_read_full_page(page,get_block); } -static int sysv_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) + +int __sysv_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return block_prepare_write(page,from,to,get_block); + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + get_block); } + +static int sysv_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return __sysv_write_begin(file, mapping, pos, len, flags, pagep, fsdata); +} + static sector_t sysv_bmap(struct address_space *mapping, sector_t block) { return generic_block_bmap(mapping,block,get_block); } + const struct address_space_operations sysv_aops = { .readpage = sysv_readpage, .writepage = sysv_writepage, .sync_page = block_sync_page, - .prepare_write = sysv_prepare_write, - .commit_write = generic_commit_write, + .write_begin = sysv_write_begin, + .write_end = generic_write_end, .bmap = sysv_bmap }; Index: linux-2.6/fs/sysv/dir.c === --- linux-2.6.orig/fs/sysv/dir.c +++ linux-2.6/fs/sysv/dir.c @@ -37,12 +37,13 @@ static inline unsigned long dir_pages(st return (inode-i_size+PAGE_CACHE_SIZE-1)PAGE_CACHE_SHIFT; } -static int dir_commit_chunk(struct page *page, unsigned from, unsigned to) +static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len) { - struct inode *dir = (struct inode *)page-mapping-host; + struct address_space *mapping = page-mapping; + struct inode *dir = mapping-host; int err = 0; - page-mapping-a_ops-commit_write(NULL, page, from, to); + block_write_end(NULL, mapping, pos, len, len, page, NULL); if (IS_DIRSYNC(dir)) err = write_one_page(page, 1); else @@ -186,7 +187,7 @@ int sysv_add_link(struct dentry *dentry, unsigned long npages = dir_pages(dir); unsigned long n; char *kaddr; - unsigned from, to; + loff_t pos; int err; /* We take care of directory expansion in the same loop */ @@ -212,16 +213,17 @@ int sysv_add_link(struct dentry *dentry, return -EINVAL; got_it: - from = (char*)de - (char*)page_address(page); - to = from + SYSV_DIRSIZE; + pos = (page-index PAGE_CACHE_SHIFT) + + (char*)de - (char*)page_address(page); lock_page(page); - err = page-mapping-a_ops-prepare_write(NULL, page, from, to); + err = __sysv_write_begin(NULL, page-mapping, pos, SYSV_DIRSIZE, + AOP_FLAG_UNINTERRUPTIBLE, page, NULL); if (err) goto out_unlock; memcpy (de-name, name, namelen); memset (de-name + namelen, 0, SYSV_DIRSIZE - namelen - 2); de-inode = cpu_to_fs16(SYSV_SB(inode-i_sb), inode-i_ino); - err = dir_commit_chunk(page, from, to); + err = dir_commit_chunk(page, pos, SYSV_DIRSIZE); dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(dir); out_page: @@ -238,15 +240,15 @@ int sysv_delete_entry(struct sysv_dir_en struct address_space *mapping = page-mapping; struct inode *inode = (struct inode*)mapping-host; char *kaddr = (char*)page_address(page); - unsigned from = (char*)de - kaddr; - unsigned to = from + SYSV_DIRSIZE; + loff_t pos = (page-index PAGE_CACHE_SHIFT) + (char *)de - kaddr; int err; lock_page(page); - err = mapping-a_ops-prepare_write(NULL, page, from, to); + err = __sysv_write_begin(NULL, mapping, pos, SYSV_DIRSIZE, + AOP_FLAG_UNINTERRUPTIBLE, page, NULL); BUG_ON(err); de-inode = 0; - err = dir_commit_chunk(page, from, to); + err = dir_commit_chunk(page, pos, SYSV_DIRSIZE); dir_put_page(page); inode-i_ctime = inode-i_mtime = CURRENT_TIME_SEC;
[patch 34/44] fs: no AOP_TRUNCATED_PAGE for writes
prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and GFS2 were converted to the new aops, so we can make some simplifications for that. Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] Documentation/filesystems/vfs.txt |6 - fs/ecryptfs/mmap.c| 39 +- include/linux/fs.h|2 - mm/filemap.c | 21 +++- 4 files changed, 20 insertions(+), 48 deletions(-) Index: linux-2.6/Documentation/filesystems/vfs.txt === --- linux-2.6.orig/Documentation/filesystems/vfs.txt +++ linux-2.6/Documentation/filesystems/vfs.txt @@ -619,11 +619,7 @@ struct address_space_operations { any basic-blocks on storage, then those blocks should be pre-read (if they haven't been read already) so that the updated blocks can be written out properly. - The page will be locked. If prepare_write wants to unlock the - page it, like readpage, may do so and return - AOP_TRUNCATED_PAGE. - In this case the prepare_write will be retried one the lock is - regained. + The page will be locked. Note: the page _must not_ be marked uptodate in this function (or anywhere else) unless it actually is uptodate right now. As Index: linux-2.6/fs/ecryptfs/mmap.c === --- linux-2.6.orig/fs/ecryptfs/mmap.c +++ linux-2.6/fs/ecryptfs/mmap.c @@ -412,11 +412,9 @@ out: return rc; } -static -void ecryptfs_release_lower_page(struct page *lower_page, int page_locked) +static void ecryptfs_release_lower_page(struct page *lower_page) { - if (page_locked) - unlock_page(lower_page); + unlock_page(lower_page); page_cache_release(lower_page); } @@ -437,7 +435,6 @@ static int ecryptfs_write_inode_size_to_ const struct address_space_operations *lower_a_ops; u64 file_size; -retry: header_page = grab_cache_page(lower_inode-i_mapping, 0); if (!header_page) { ecryptfs_printk(KERN_ERR, grab_cache_page for @@ -448,11 +445,7 @@ retry: lower_a_ops = lower_inode-i_mapping-a_ops; rc = lower_a_ops-prepare_write(lower_file, header_page, 0, 8); if (rc) { - if (rc == AOP_TRUNCATED_PAGE) { - ecryptfs_release_lower_page(header_page, 0); - goto retry; - } else - ecryptfs_release_lower_page(header_page, 1); + ecryptfs_release_lower_page(header_page); goto out; } file_size = (u64)i_size_read(inode); @@ -466,11 +459,7 @@ retry: if (rc 0) ecryptfs_printk(KERN_ERR, Error commiting header page write\n); - if (rc == AOP_TRUNCATED_PAGE) { - ecryptfs_release_lower_page(header_page, 0); - goto retry; - } else - ecryptfs_release_lower_page(header_page, 1); + ecryptfs_release_lower_page(header_page); lower_inode-i_mtime = lower_inode-i_ctime = CURRENT_TIME; mark_inode_dirty_sync(inode); out: @@ -573,16 +562,11 @@ retry: byte_offset, region_bytes); if (rc) { - if (rc == AOP_TRUNCATED_PAGE) { - ecryptfs_release_lower_page(*lower_page, 0); - goto retry; - } else { - ecryptfs_printk(KERN_ERR, prepare_write for - lower_page_index = [0x%.16x] failed; rc = - [%d]\n, lower_page_index, rc); - ecryptfs_release_lower_page(*lower_page, 1); - (*lower_page) = NULL; - } + ecryptfs_printk(KERN_ERR, prepare_write for + lower_page_index = [0x%.16x] failed; rc = + [%d]\n, lower_page_index, rc); + ecryptfs_release_lower_page(*lower_page); + (*lower_page) = NULL; } out: return rc; @@ -598,19 +582,16 @@ ecryptfs_commit_lower_page(struct page * struct file *lower_file, int byte_offset, int region_size) { - int page_locked = 1; int rc = 0; rc = lower_inode-i_mapping-a_ops-commit_write( lower_file, lower_page, byte_offset, region_size); - if (rc == AOP_TRUNCATED_PAGE) - page_locked = 0; if (rc 0) { ecryptfs_printk(KERN_ERR, Error committing write; rc = [%d]\n, rc); } else rc = 0; - ecryptfs_release_lower_page(lower_page,
[patch 33/44] gfs2 convert to new aops
From: Steven Whitehouse [EMAIL PROTECTED] (needs a SOB) Cc: Linux Filesystems linux-fsdevel@vger.kernel.org fs/gfs2/ops_address.c | 209 +- 1 file changed, 125 insertions(+), 84 deletions(-) Index: linux-2.6/fs/gfs2/ops_address.c === --- linux-2.6.orig/fs/gfs2/ops_address.c +++ linux-2.6/fs/gfs2/ops_address.c @@ -17,6 +17,7 @@ #include linux/mpage.h #include linux/fs.h #include linux/writeback.h +#include linux/swap.h #include linux/gfs2_ondisk.h #include linux/lm_interface.h @@ -337,45 +338,49 @@ out_unlock: } /** - * gfs2_prepare_write - Prepare to write a page to a file + * gfs2_write_begin - Begin to write to a file * @file: The file to write to - * @page: The page which is to be prepared for writing - * @from: From (byte range within page) - * @to: To (byte range within page) + * @mapping: The mapping in which to write + * @pos: The file offset at which to start writing + * @len: Length of the write + * @flags: Various flags + * @pagep: Pointer to return the page + * @fsdata: Pointer to return fs data (unused by GFS2) * * Returns: errno */ -static int gfs2_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +static int gfs2_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - struct gfs2_inode *ip = GFS2_I(page-mapping-host); - struct gfs2_sbd *sdp = GFS2_SB(page-mapping-host); + struct gfs2_inode *ip = GFS2_I(mapping-host); + struct gfs2_sbd *sdp = GFS2_SB(mapping-host); unsigned int data_blocks, ind_blocks, rblocks; int alloc_required; int error = 0; - loff_t pos = ((loff_t)page-index PAGE_CACHE_SHIFT) + from; - loff_t end = ((loff_t)page-index PAGE_CACHE_SHIFT) + to; struct gfs2_alloc *al; - unsigned int write_len = to - from; - + pgoff_t index = pos PAGE_CACHE_SHIFT; + unsigned from = pos (PAGE_CACHE_SIZE - 1); + unsigned to = from + len; + struct page *page; - gfs2_holder_init(ip-i_gl, LM_ST_EXCLUSIVE, GL_ATIME|LM_FLAG_TRY_1CB, ip-i_gh); + gfs2_holder_init(ip-i_gl, LM_ST_EXCLUSIVE, GL_ATIME, ip-i_gh); error = gfs2_glock_nq_atime(ip-i_gh); - if (unlikely(error)) { - if (error == GLR_TRYFAILED) { - unlock_page(page); - error = AOP_TRUNCATED_PAGE; - yield(); - } + if (unlikely(error)) goto out_uninit; - } - gfs2_write_calc_reserv(ip, write_len, data_blocks, ind_blocks); + error = -ENOMEM; + page = __grab_cache_page(mapping, index); + *pagep = page; + if (!page) + goto out_unlock; + + gfs2_write_calc_reserv(ip, len, data_blocks, ind_blocks); - error = gfs2_write_alloc_required(ip, pos, write_len, alloc_required); + error = gfs2_write_alloc_required(ip, pos, len, alloc_required); if (error) - goto out_unlock; + goto out_putpage; ip-i_alloc.al_requested = 0; @@ -407,7 +412,7 @@ static int gfs2_prepare_write(struct fil goto out; if (gfs2_is_stuffed(ip)) { - if (end sdp-sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) { + if (pos + len sdp-sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) { error = gfs2_unstuff_dinode(ip, page); if (error == 0) goto prepare_write; @@ -429,6 +434,10 @@ out_qunlock: out_alloc_put: gfs2_alloc_put(ip); } +out_putpage: + page_cache_release(page); + if (pos + len ip-i_inode.i_size) + vmtruncate(ip-i_inode, ip-i_inode.i_size); out_unlock: gfs2_glock_dq_m(1, ip-i_gh); out_uninit: @@ -439,96 +448,128 @@ out_uninit: } /** - * gfs2_commit_write - Commit write to a file + * gfs2_stuffed_write_end - Write end for stuffed files + * @inode: The inode + * @dibh: The buffer_head containing the on-disk inode + * @pos: The file position + * @len: The length of the write + * @copied: How much was actually copied by the VFS + * @page: The page + * + * This copies the data from the page into the inode block after + * the inode data structure itself. + * + * Returns: errno + */ +static int gfs2_stuffed_write_end(struct inode *inode, struct buffer_head *dibh, + loff_t pos, unsigned len, unsigned copied, + struct page *page) +{ + struct gfs2_inode *ip = GFS2_I(inode); + struct gfs2_sbd *sdp = GFS2_SB(inode); + u64 to = pos + copied; + void *kaddr; + unsigned char
[patch 36/44] fuse convert to new aops
[mszeredi] - don't send zero length write requests - it is not legal for the filesystem to return with zero written bytes Signed-off-by: Nick Piggin [EMAIL PROTECTED] Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] fs/fuse/file.c | 48 +--- 1 file changed, 33 insertions(+), 15 deletions(-) Index: linux-2.6/fs/fuse/file.c === --- linux-2.6.orig/fs/fuse/file.c +++ linux-2.6/fs/fuse/file.c @@ -443,22 +443,25 @@ static size_t fuse_send_write(struct fus return outarg.size; } -static int fuse_prepare_write(struct file *file, struct page *page, - unsigned offset, unsigned to) -{ - /* No op */ +static int fuse_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + pgoff_t index = pos PAGE_CACHE_SHIFT; + + *pagep = __grab_cache_page(mapping, index); + if (!*pagep) + return -ENOMEM; return 0; } -static int fuse_commit_write(struct file *file, struct page *page, -unsigned offset, unsigned to) +static int fuse_buffered_write(struct file *file, struct inode *inode, + loff_t pos, unsigned count, struct page *page) { int err; size_t nres; - unsigned count = to - offset; - struct inode *inode = page-mapping-host; struct fuse_conn *fc = get_fuse_conn(inode); - loff_t pos = page_offset(page) + offset; + unsigned offset = pos (PAGE_CACHE_SIZE - 1); struct fuse_req *req; if (is_bad_inode(inode)) @@ -474,20 +477,35 @@ static int fuse_commit_write(struct file nres = fuse_send_write(req, file, inode, pos, count); err = req-out.h.error; fuse_put_request(fc, req); - if (!err nres != count) + if (!err !nres) err = -EIO; if (!err) { - pos += count; + pos += nres; spin_lock(fc-lock); if (pos inode-i_size) i_size_write(inode, pos); spin_unlock(fc-lock); - if (offset == 0 to == PAGE_CACHE_SIZE) + if (count == PAGE_CACHE_SIZE) SetPageUptodate(page); } fuse_invalidate_attr(inode); - return err; + return err ? err : nres; +} + +static int fuse_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) +{ + struct inode *inode = mapping-host; + int res = 0; + + if (copied) + res = fuse_buffered_write(file, inode, pos, copied, page); + + unlock_page(page); + page_cache_release(page); + return res; } static void fuse_release_user_pages(struct fuse_req *req, int write) @@ -816,8 +834,8 @@ static const struct file_operations fuse static const struct address_space_operations fuse_file_aops = { .readpage = fuse_readpage, - .prepare_write = fuse_prepare_write, - .commit_write = fuse_commit_write, + .write_begin= fuse_write_begin, + .write_end = fuse_write_end, .readpages = fuse_readpages, .set_page_dirty = fuse_set_page_dirty, .bmap = fuse_bmap, -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 31/44] smb convert to new aops
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/smbfs/file.c | 34 +- 1 file changed, 25 insertions(+), 9 deletions(-) Index: linux-2.6/fs/smbfs/file.c === --- linux-2.6.orig/fs/smbfs/file.c +++ linux-2.6/fs/smbfs/file.c @@ -290,29 +290,45 @@ out: * If the writer ends up delaying the write, the writer needs to * increment the page use counts until he is done with the page. */ -static int smb_prepare_write(struct file *file, struct page *page, -unsigned offset, unsigned to) -{ +static int smb_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + pgoff_t index = pos PAGE_CACHE_SHIFT; + *pagep = __grab_cache_page(mapping, index); + if (!*pagep) + return -ENOMEM; return 0; } -static int smb_commit_write(struct file *file, struct page *page, - unsigned offset, unsigned to) +static int smb_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { int status; + unsigned offset = pos (PAGE_CACHE_SIZE - 1); - status = -EFAULT; lock_kernel(); - status = smb_updatepage(file, page, offset, to-offset); + status = smb_updatepage(file, page, offset, copied); unlock_kernel(); + + if (!status) { + if (!PageUptodate(page) copied == PAGE_CACHE_SIZE) + SetPageUptodate(page); + status = copied; + } + + unlock_page(page); + page_cache_release(page); + return status; } const struct address_space_operations smb_file_aops = { .readpage = smb_readpage, .writepage = smb_writepage, - .prepare_write = smb_prepare_write, - .commit_write = smb_commit_write + .write_begin = smb_write_begin, + .write_end = smb_write_end, }; /* -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 27/44] hpfs convert to new aops
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/hpfs/file.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) Index: linux-2.6/fs/hpfs/file.c === --- linux-2.6.orig/fs/hpfs/file.c +++ linux-2.6/fs/hpfs/file.c @@ -86,25 +86,33 @@ static int hpfs_writepage(struct page *p { return block_write_full_page(page,hpfs_get_block, wbc); } + static int hpfs_readpage(struct file *file, struct page *page) { return block_read_full_page(page,hpfs_get_block); } -static int hpfs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) -{ - return cont_prepare_write(page,from,to,hpfs_get_block, - hpfs_i(page-mapping-host)-mmu_private); + +static int hpfs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + hpfs_get_block, + hpfs_i(mapping-host)-mmu_private); } + static sector_t _hpfs_bmap(struct address_space *mapping, sector_t block) { return generic_block_bmap(mapping,block,hpfs_get_block); } + const struct address_space_operations hpfs_aops = { .readpage = hpfs_readpage, .writepage = hpfs_writepage, .sync_page = block_sync_page, - .prepare_write = hpfs_prepare_write, - .commit_write = generic_commit_write, + .write_begin = hpfs_write_begin, + .write_end = generic_write_end, .bmap = _hpfs_bmap }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 11/44] fs: fix data-loss on error
New buffers against uptodate pages are simply be marked uptodate, while the buffer_new bit remains set. This causes error-case code to zero out parts of those buffers because it thinks they contain stale data: wrong, they are actually uptodate so this is a data loss situation. Fix this by actually clearning buffer_new and marking the buffer dirty. It makes sense to always clear buffer_new before setting a buffer uptodate. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/buffer.c |2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -1800,7 +1800,9 @@ static int __block_prepare_write(struct unmap_underlying_metadata(bh-b_bdev, bh-b_blocknr); if (PageUptodate(page)) { + clear_buffer_new(bh); set_buffer_uptodate(bh); + mark_buffer_dirty(bh); continue; } if (block_end to || block_start from) { -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 09/44] mm: fix pagecache write deadlocks
Modify the core write() code so that it won't take a pagefault while holding a lock on the pagecache page. There are a number of different deadlocks possible if we try to do such a thing: 1. generic_buffered_write 2. lock_page 3.prepare_write 4. unlock_page+vmtruncate 5. copy_from_user 6. mmap_sem(r) 7. handle_mm_fault 8.lock_page (filemap_nopage) 9.commit_write 10. unlock_page a. sys_munmap / sys_mlock / others b. mmap_sem(w) c. make_pages_present d.get_user_pages e. handle_mm_fault f. lock_page (filemap_nopage) 2,8 - recursive deadlock if page is same 2,8;2,8 - ABBA deadlock is page is different 2,6;b,f - ABBA deadlock if page is same The solution is as follows: 1. If we find the destination page is uptodate, continue as normal, but use atomic usercopies which do not take pagefaults and do not zero the uncopied tail of the destination. The destination is already uptodate, so we can commit_write the full length even if there was a partial copy: it does not matter that the tail was not modified, because if it is dirtied and written back to disk it will not cause any problems (uptodate *means* that the destination page is as new or newer than the copy on disk). 1a. The above requires that fault_in_pages_readable correctly returns access information, because atomic usercopies cannot distinguish between non-present pages in a readable mapping, from lack of a readable mapping. 2. If we find the destination page is non uptodate, unlock it (this could be made slightly more optimal), then allocate a temporary page to copy the source data into. Relock the destination page and continue with the copy. However, instead of a usercopy (which might take a fault), copy the data from the pinned temporary page via the kernel address space. (also, rename maxlen to seglen, because it was confusing) This increases the CPU/memory copy cost by almost 50% on the affected workloads. That will be solved by introducing a new set of pagecache write aops in a subsequent patch. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] include/linux/pagemap.h | 11 +++- mm/filemap.c| 114 2 files changed, 104 insertions(+), 21 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1933,11 +1933,12 @@ generic_file_buffered_write(struct kiocb filemap_set_next_iovec(cur_iov, nr_segs, iov_offset, written); do { + struct page *src_page; struct page *page; pgoff_t index; /* Pagecache index for current page */ unsigned long offset; /* Offset into pagecache page */ - unsigned long maxlen; /* Bytes remaining in current iovec */ - size_t bytes; /* Bytes to write to page */ + unsigned long seglen; /* Bytes remaining in current iovec */ + unsigned long bytes;/* Bytes to write to page */ size_t copied; /* Bytes copied from user */ buf = cur_iov-iov_base + iov_offset; @@ -1947,20 +1948,30 @@ generic_file_buffered_write(struct kiocb if (bytes count) bytes = count; - maxlen = cur_iov-iov_len - iov_offset; - if (maxlen bytes) - maxlen = bytes; + /* +* a non-NULL src_page indicates that we're doing the +* copy via get_user_pages and kmap. +*/ + src_page = NULL; + + seglen = cur_iov-iov_len - iov_offset; + if (seglen bytes) + seglen = bytes; -#ifndef CONFIG_DEBUG_VM /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the * same page as we're writing to, without it being marked * up-to-date. +* +* Not only is this an optimisation, but it is also required +* to check that the address is actually valid, when atomic +* usercopies are used, below. */ - fault_in_pages_readable(buf, maxlen); -#endif - + if (unlikely(fault_in_pages_readable(buf, seglen))) { + status = -EFAULT; + break; + } page = __grab_cache_page(mapping, index); if (!page) { @@ -1968,32 +1979,104 @@ generic_file_buffered_write(struct kiocb break; } + /* +* non-uptodate pages
[patch 14/44] implement simple fs aops
Implement new aops for some of the simpler filesystems. Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/configfs/inode.c |4 ++-- fs/hugetlbfs/inode.c | 16 ++-- fs/ramfs/file-mmu.c |4 ++-- fs/ramfs/file-nommu.c |4 ++-- fs/sysfs/inode.c |4 ++-- mm/shmem.c| 35 --- 6 files changed, 46 insertions(+), 21 deletions(-) Index: linux-2.6/mm/shmem.c === --- linux-2.6.orig/mm/shmem.c +++ linux-2.6/mm/shmem.c @@ -1109,7 +1109,7 @@ static int shmem_getpage(struct inode *i * Normally, filepage is NULL on entry, and either found * uptodate immediately, or allocated and zeroed, or read * in under swappage, which is then assigned to filepage. -* But shmem_prepare_write passes in a locked filepage, +* But shmem_write_begin passes in a locked filepage, * which may be found not uptodate by other callers too, * and may need to be copied from the swappage read in. */ @@ -1454,14 +1454,35 @@ static const struct inode_operations shm static const struct inode_operations shmem_symlink_inline_operations; /* - * Normally tmpfs makes no use of shmem_prepare_write, but it + * Normally tmpfs makes no use of shmem_write_begin, but it * lets a tmpfs file be used read-write below the loop driver. */ static int -shmem_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) +shmem_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + struct inode *inode = mapping-host; + pgoff_t index = pos PAGE_CACHE_SHIFT; + *pagep = NULL; + return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); +} + +static int +shmem_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { - struct inode *inode = page-mapping-host; - return shmem_getpage(inode, page-index, page, SGP_WRITE, NULL); + struct inode *inode = mapping-host; + + set_page_dirty(page); + mark_page_accessed(page); + page_cache_release(page); + + if (pos+copied inode-i_size) + i_size_write(inode, pos+copied); + + return copied; } static ssize_t @@ -2358,8 +2379,8 @@ static const struct address_space_operat .writepage = shmem_writepage, .set_page_dirty = __set_page_dirty_no_writeback, #ifdef CONFIG_TMPFS - .prepare_write = shmem_prepare_write, - .commit_write = simple_commit_write, + .write_begin= shmem_write_begin, + .write_end = shmem_write_end, #endif .migratepage= migrate_page, }; Index: linux-2.6/fs/configfs/inode.c === --- linux-2.6.orig/fs/configfs/inode.c +++ linux-2.6/fs/configfs/inode.c @@ -40,8 +40,8 @@ extern struct super_block * configfs_sb; static const struct address_space_operations configfs_aops = { .readpage = simple_readpage, - .prepare_write = simple_prepare_write, - .commit_write = simple_commit_write + .write_begin= simple_write_begin, + .write_end = simple_write_end, }; static struct backing_dev_info configfs_backing_dev_info = { Index: linux-2.6/fs/sysfs/inode.c === --- linux-2.6.orig/fs/sysfs/inode.c +++ linux-2.6/fs/sysfs/inode.c @@ -20,8 +20,8 @@ extern struct super_block * sysfs_sb; static const struct address_space_operations sysfs_aops = { .readpage = simple_readpage, - .prepare_write = simple_prepare_write, - .commit_write = simple_commit_write + .write_begin= simple_write_begin, + .write_end = simple_write_end, }; static struct backing_dev_info sysfs_backing_dev_info = { Index: linux-2.6/fs/ramfs/file-mmu.c === --- linux-2.6.orig/fs/ramfs/file-mmu.c +++ linux-2.6/fs/ramfs/file-mmu.c @@ -29,8 +29,8 @@ const struct address_space_operations ramfs_aops = { .readpage = simple_readpage, - .prepare_write = simple_prepare_write, - .commit_write = simple_commit_write, + .write_begin= simple_write_begin, + .write_end = simple_write_end, .set_page_dirty = __set_page_dirty_no_writeback, }; Index: linux-2.6/fs/ramfs/file-nommu.c === --- linux-2.6.orig/fs/ramfs/file-nommu.c +++ linux-2.6/fs/ramfs/file-nommu.c @@ -29,8 +29,8 @@ static int ramfs_nommu_setattr(struct de const struct address_space_operations
[patch 12/44] fs: introduce write_begin, write_end, and perform_write aops
These are intended to replace prepare_write and commit_write with more flexible alternatives that are also able to avoid the buffered write deadlock problems efficiently (which prepare_write is unable to do). Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] API design contributions, code review and fixes. Signed-off-by: Mark Fasheh [EMAIL PROTECTED] Documentation/filesystems/Locking |9 - Documentation/filesystems/vfs.txt | 48 +++ drivers/block/loop.c | 77 fs/buffer.c | 203 +++-- fs/libfs.c| 44 +++ fs/namei.c| 47 +-- fs/splice.c | 70 +-- include/linux/buffer_head.h | 10 + include/linux/fs.h| 28 include/linux/pagemap.h |2 mm/filemap.c | 233 ++ 11 files changed, 561 insertions(+), 210 deletions(-) Index: linux-2.6/include/linux/fs.h === --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -391,6 +391,8 @@ enum positive_aop_returns { AOP_TRUNCATED_PAGE = 0x80001, }; +#define AOP_FLAG_UNINTERRUPTIBLE 0x0001 /* will not do a short write */ + /* * oh the beauties of C type declarations. */ @@ -451,6 +453,14 @@ struct address_space_operations { */ int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); int (*commit_write)(struct file *, struct page *, unsigned, unsigned); + + int (*write_begin)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); + int (*write_end)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); + /* Unfortunately this kludge is needed for FIBMAP. Don't use it */ sector_t (*bmap)(struct address_space *, sector_t); void (*invalidatepage) (struct page *, unsigned long); @@ -465,6 +475,18 @@ struct address_space_operations { int (*launder_page) (struct page *); }; +/* + * pagecache_write_begin/pagecache_write_end must be used by general code + * to write into the pagecache. + */ +int pagecache_write_begin(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); + +int pagecache_write_end(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); + struct backing_dev_info; struct address_space { struct inode*host; /* owner: inode, block_device */ @@ -1969,6 +1991,12 @@ extern int simple_prepare_write(struct f unsigned offset, unsigned to); extern int simple_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to); +extern int simple_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); +extern int simple_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct nameidata *); extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *); Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1950,6 +1950,93 @@ inline int generic_write_checks(struct f } EXPORT_SYMBOL(generic_write_checks); +int pagecache_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + const struct address_space_operations *aops = mapping-a_ops; + + if (aops-write_begin) { + return aops-write_begin(file, mapping, pos, len, flags, + pagep, fsdata); + } else { + int ret; + pgoff_t index = pos PAGE_CACHE_SHIFT; + unsigned offset = pos (PAGE_CACHE_SIZE - 1); + struct inode *inode = mapping-host; + struct page *page; +again: + page
[patch 15/44] block_dev convert to new aops
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/block_dev.c | 26 +++--- 1 file changed, 19 insertions(+), 7 deletions(-) Index: linux-2.6/fs/block_dev.c === --- linux-2.6.orig/fs/block_dev.c +++ linux-2.6/fs/block_dev.c @@ -378,14 +378,26 @@ static int blkdev_readpage(struct file * return block_read_full_page(page, blkdev_get_block); } -static int blkdev_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) -{ - return block_prepare_write(page, from, to, blkdev_get_block); +static int blkdev_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + blkdev_get_block); } -static int blkdev_commit_write(struct file *file, struct page *page, unsigned from, unsigned to) +static int blkdev_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { - return block_commit_write(page, from, to); + int ret; + ret = block_write_end(file, mapping, pos, len, copied, page, fsdata); + + unlock_page(page); + page_cache_release(page); + + return ret; } /* @@ -1333,8 +1345,8 @@ const struct address_space_operations de .readpage = blkdev_readpage, .writepage = blkdev_writepage, .sync_page = block_sync_page, - .prepare_write = blkdev_prepare_write, - .commit_write = blkdev_commit_write, + .write_begin= blkdev_write_begin, + .write_end = blkdev_write_end, .writepages = generic_writepages, .direct_IO = blkdev_direct_IO, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html