Re: it seems Evolution remove the Tabs
Andrew, Thanks for your information :-) Coly 在 2007-05-25五的 16:33 +1000,andrew hendry写道: select your whole mail and use pre-format in evolution. or change it to pre-format and do insert-text file. this should send it with tabs intact, the confusing bit i think is if your testing it by sending it to yourself, then you cant see the tabs again. Read it in another mailer, something like sylpheed or mutt to see if the tabs are really there. On 5/25/07, coly [EMAIL PROTECTED] wrote: Hi, I tested again, it seems Evolution removes the Tabs with blanks. How to resolve this issue on Evolution ? I am trying :-) Coly 在 2007-05-25五的 07:52 +0200,Jan Engelhardt写道: On May 25 2007 09:30, WANG Cong wrote: Yes, I found all TABs gone when I received the mail. When I post next version of the patch, I will test to send to me first :-) Thanks for your information. Blame Gmail. I am using gmail too. That's not gmail's fault, Then it is one of these: - gmail's default settings for web input sucks or - the web browser reformats it (not so much - pastebin.ca suffers from something similar, but *not the same*; in that it translates all tabs into spaces, but at least it keeps the width.) or - you are using your own client, and directly SMTPing gmail servers, in which case unwanted reformatting by broken MTAs can be bypassed. I think your email client sucks. So which email client are you using, coly? I recommend mutt to you. ;) X-Mailer: Evolution 2.6.0 Hm, this looks like another of these Thunderbird cases. (Means, Thunderbird users also get their patches wrapped and twangled unless they set some option that is not on by default.) Jan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] AFS: Implement file locking
David Howells napsal(a): Implement file locking for AFS. Signed-off-by: David Howells [EMAIL PROTECTED] --- fs/afs/Makefile|1 fs/afs/afs.h |8 + fs/afs/afs_fs.h|3 fs/afs/callback.c |3 fs/afs/dir.c |1 fs/afs/file.c |2 fs/afs/flock.c | 558 fs/afs/fsclient.c | 155 ++ fs/afs/internal.h | 30 +++ fs/afs/main.c |1 fs/afs/misc.c |1 fs/afs/super.c |3 fs/afs/vnode.c | 130 +++- include/linux/fs.h |4 14 files changed, 885 insertions(+), 15 deletions(-) diff --git a/fs/afs/Makefile b/fs/afs/Makefile index 73ce561..a666710 100644 --- a/fs/afs/Makefile +++ b/fs/afs/Makefile @@ -8,6 +8,7 @@ kafs-objs := \ cmservice.o \ dir.o \ file.o \ + flock.o \ fsclient.o \ inode.o \ main.o \ diff --git a/fs/afs/afs.h b/fs/afs/afs.h index 2452579..c548aa3 100644 --- a/fs/afs/afs.h +++ b/fs/afs/afs.h @@ -37,6 +37,13 @@ typedef enum { AFS_FTYPE_SYMLINK = 3, } afs_file_type_t; +typedef enum { + AFS_LOCK_READ = 0,/* read lock request */ + AFS_LOCK_WRITE = 1,/* write lock request */ +} afs_lock_type_t; Why typedef? regards, -- http://www.fi.muni.cz/~xslaby/Jiri Slaby faculty of informatics, masaryk university, brno, cz e-mail: jirislaby gmail com, gpg pubkey fingerprint: B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
This mail is about an issue that has been of concern to me for quite a while and I think it is (well past) time to air it more widely and try to come to a resolution. This issue is how write barriers (the block-device kind, not the memory-barrier kind) should be handled by the various layers. The following is my understanding, which could well be wrong in various specifics. Corrections and other comments are more than welcome. What are barriers? == Barriers (as generated by requests with BIO_RW_BARRIER) are intended to ensure that the data in the barrier request is not visible until all writes submitted earlier are safe on the media, and that the data is safe on the media before any subsequently submitted requests are visible on the device. This is achieved by tagging request in the elevator (or any other request queue) so that no re-ordering is performed around a BIO_RW_BARRIER request, and by sending appropriate commands to the device so that any write-behind caching is defeated by the barrier request. Along side BIO_RW_BARRIER is blkdev_issue_flush which calls q-issue_flush_fn. This can be used to achieve similar effects. There is no guarantee that a device can support BIO_RW_BARRIER - it is always possible that a request will fail with EOPNOTSUPP. Conversely, blkdev_issue_flush must be supported on any device that uses write-behind caching (it if cannot be supported, then write-behind caching should be turned off, at least by default). We can think of there being three types of devices: 1/ SAFE. With a SAFE device, there is no write-behind cache, or if there is it is non-volatile. Once a write completes it is completely safe. Such a device does not require barriers or -issue_flush_fn, and can respond to them either by a no-op or with -EOPNOTSUPP (the former is preferred). 2/ FLUSHABLE. A FLUSHABLE device may have a volatile write-behind cache. This cache can be flushed with a call to blkdev_issue_flush. It may not support barrier requests. 3/ BARRIER. A BARRIER device supports both blkdev_issue_flush and BIO_RW_BARRIER. Either may be used to synchronise any write-behind cache to non-volatile storage (media). Handling of SAFE and FLUSHABLE devices is essentially the same and can work on a BARRIER device. The BARRIER device has the option of more efficient handling. How does a filesystem use this? === A filesystem will often have a concept of a 'commit' block which makes an assertion about the correctness of other blocks in the filesystem. In the most gross sense, this could be the writing of the superblock of an ext2 filesystem, with the dirty bit clear. This write commits all other writes to the filesystem that precede it. More subtle/useful is the commit block in a journal as with ext3 and others. This write commits some number of preceding writes in the journal or elsewhere. The filesystem will want to ensure that all preceding writes are safe before writing the barrier block. There are two ways to achieve this. 1/ Issue all 'preceding writes', wait for them to complete (bi_endio called), call blkdev_issue_flush, issue the commit write, wait for it to complete, call blkdev_issue_flush a second time. (This is needed for FLUSHABLE) 2/ Set the BIO_RW_BARRIER bit in the write request for the commit block. (This is more efficient on BARRIER). The second, while much easier, can fail. So a filesystem should be prepared to deal with that failure by falling back to the first option. Thus the general sequence might be: a/ issue all preceding writes. b/ issue the commit write with BIO_RW_BARRIER c/ wait for the commit to complete. If it was successful - done. If it failed other than with EOPNOTSUPP, abort else continue d/ wait for all 'preceding writes' to complete e/ call blkdev_issue_flush f/ issue commit write without BIO_RW_BARRIER g/ wait for commit write to complete if it failed, abort h/ call blkdev_issue DONE steps b and c can be left out if it is known that the device does not support barriers. The only way to discover this to try and see if it fails. I don't think any filesystem follows all these steps. ext3 has the right structure, but it doesn't include steps e and h. reiserfs is similar. It does have a call to blkdev_issue_flush, but that is only on the fsync path, so it isn't really protecting general journal commits. XFS - I'm less sure. I think it does 'a' then 'd', then 'b' or 'f' depending on a whether it thinks the device handles barriers, and finally 'g'. I haven't looked at other filesystems. So for devices that support BIO_RW_BARRIER, and for devices that don't need any flush, they work OK, but for device that need flushing, but don't support BIO_RW_BARRIER, none of them work. This should be easy to fix.
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
Hi, 2007/5/24, James Morris [EMAIL PROTECTED]: I can restate my question and ask why you'd want a security policy like: Subject 'sysadmin' has: read access to /etc/shadow read/write access to /views/sysadmin/etc/shadow where the objects referenced by the paths are identical and visible to the subject along both paths, in keeping with your description of policy may allow access to some locations but not to others ? If I understand correctly, the original issue was whether to allow passing vfsmount to the inode_create LSM hook or not. Which is independent from AA or pathname based MAC, I think. It is proven that Linux can be used without that change, however it is also clear that current LSM cause the ambiguities as AA people has explained. Clearing ambiguities is a obvious gain to Linux and will make benefits for auditing besides pathname based MAC. So here's my opinion. If anybody can't explain clear reason (or needs) to keep these ambiguities unsolved, we should consider to merge the proposal. Thanks. -- Toshiharu Harada [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Fri, May 25, 2007 at 05:58:25PM +1000, Neil Brown wrote: We can think of there being three types of devices: 1/ SAFE. With a SAFE device, there is no write-behind cache, or if there is it is non-volatile. Once a write completes it is completely safe. Such a device does not require barriers or -issue_flush_fn, and can respond to them either by a no-op or with -EOPNOTSUPP (the former is preferred). 2/ FLUSHABLE. A FLUSHABLE device may have a volatile write-behind cache. This cache can be flushed with a call to blkdev_issue_flush. It may not support barrier requests. So returns -EOPNOTSUPP to any barrier request? 3/ BARRIER. A BARRIER device supports both blkdev_issue_flush and BIO_RW_BARRIER. Either may be used to synchronise any write-behind cache to non-volatile storage (media). Handling of SAFE and FLUSHABLE devices is essentially the same and can work on a BARRIER device. The BARRIER device has the option of more efficient handling. How does a filesystem use this? === The filesystem will want to ensure that all preceding writes are safe before writing the barrier block. There are two ways to achieve this. Three, actually. 1/ Issue all 'preceding writes', wait for them to complete (bi_endio called), call blkdev_issue_flush, issue the commit write, wait for it to complete, call blkdev_issue_flush a second time. (This is needed for FLUSHABLE) *nod* 2/ Set the BIO_RW_BARRIER bit in the write request for the commit block. (This is more efficient on BARRIER). *nod* 3/ Use a SAFE device. The second, while much easier, can fail. So we do a test I/O to see if the device supports them before enabling that mode. But, as we've recently discovered, this is not sufficient to detect *correctly functioning* barrier support. So a filesystem should be prepared to deal with that failure by falling back to the first option. I don't buy that argument. Thus the general sequence might be: a/ issue all preceding writes. b/ issue the commit write with BIO_RW_BARRIER At this point, the filesystem has done everything it needs to ensure that the block layer has been informed of the I/O ordering requirements. Why should the filesystem now have to detect block layer breakage, and then use a different block layer API to issue the same I/O under the same constraints? c/ wait for the commit to complete. If it was successful - done. If it failed other than with EOPNOTSUPP, abort else continue d/ wait for all 'preceding writes' to complete e/ call blkdev_issue_flush f/ issue commit write without BIO_RW_BARRIER g/ wait for commit write to complete if it failed, abort h/ call blkdev_issue _flush? DONE steps b and c can be left out if it is known that the device does not support barriers. The only way to discover this to try and see if it fails. That's a very linear, single-threaded way of looking at it... ;) I don't think any filesystem follows all these steps. ext3 has the right structure, but it doesn't include steps e and h. reiserfs is similar. It does have a call to blkdev_issue_flush, but that is only on the fsync path, so it isn't really protecting general journal commits. XFS - I'm less sure. I think it does 'a' then 'd', then 'b' or 'f' depending on a whether it thinks the device handles barriers, and finally 'g'. That's right, except for the g (or c) bit - commit writes are async and nothing waits for them - the io completion wakes anything waiting on it's completion (yes, all XFS barrier I/Os are issued async which is why having to handle an -EOPNOTSUPP error is a real pain. The fix I currently have is to reissue the I/O from the completion handler with is ugly, ugly, ugly.) So for devices that support BIO_RW_BARRIER, and for devices that don't need any flush, they work OK, but for device that need flushing, but don't support BIO_RW_BARRIER, none of them work. This should be easy to fix. Right - XFS as it stands was designed to work on SAFE devices, and we've modified it to work on BARRIER devices. We don't support FLUSHABLE devices at all. But if the filesystem supports BARRIER devices, I don't see any reason why a filesystem needs to be modified to support FLUSHABLE devices - the key point being that by the time the filesystem has issued the commit write it has already waited for all it's dependent I/O, and so all the block device needs to do is issue flushes either side of the commit write HOW DO MD or DM USE THIS 1/ striping devices. This includes md/raid0 md/linear dm-linear dm-stripe and probably others. These devices can easily support blkdev_issue_flush by simply calling blkdev_issue_flush on all component
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Fri, May 25 2007, David Chinner wrote: The second, while much easier, can fail. So we do a test I/O to see if the device supports them before enabling that mode. But, as we've recently discovered, this is not sufficient to detect *correctly functioning* barrier support. Right, those are two different things. But paranoia aside, will this ever be a real life problem? I've always been of the opinion to just nicely ignore them. We can't easily detect it and tell the user his hw is crap. So a filesystem should be prepared to deal with that failure by falling back to the first option. I don't buy that argument. The problem with Neils reasoning there is that blkdev_issue_flush() may use the same method as the barrier to ensure data is on platter. A barrier write will include a flush, but it may also use the FUA bit to ensure data is on platter. So the only situation where a fallback from a barrier to flush would be valid, is if the device lied and told you it could do FUA but it could not and that is the reason why the barrier write failed. If that is the case, the block layer should stop using FUA and fallback to flush-write-flush. And if it does that, then there's never a valid reason to switch from using barrier writes to blkdev_issue_flush() since both methods would either both work or both fail. Thus the general sequence might be: a/ issue all preceding writes. b/ issue the commit write with BIO_RW_BARRIER At this point, the filesystem has done everything it needs to ensure that the block layer has been informed of the I/O ordering requirements. Why should the filesystem now have to detect block layer breakage, and then use a different block layer API to issue the same I/O under the same constraints? It's not block layer breakage, it's a device issue. 2/ Mirror devices. This includes md/raid1 and dm-raid1. .. Hopefully this is unlikely to happen. What device would work correctly with barriers once, and then not the next time? The answer is md/raid1. If you remove a failed device and add a new device that doesn't support barriers, md/raid1 will notice and stop supporting barriers. In case you hadn't already guess, I don't like this behaviour at all. It makes async I/O completion of barrier I/O an ugly, messy business, and every place you do sync I/O completion you need to put special error handling. That's unfortunately very true. It's an artifact of the sometimes problematic device capability discovery. If this happens to md/raid1, then why can't it simply do a blkdev_issue_flush, write, blkdev_issue_flush sequence to the device that doesn't support barriers and then the md device *never changes behaviour*. Next time the filesystem is mounted, it will turn off barriers because they won't be supported Because if it doesn't support barriers, blkdev_issue_flush() wouldn't work either. At least that is the case for SATA/IDE, SCSI is somewhat different (and has somewhat other issues). - Should the various filesystems be fixed as suggested above? Is someone willing to do that? Alternate viewpoint - should the block layer be fixed so that the filesystems only need to use one barrier API that provides static behaviour for the life of the mount? blkdev_issue_flush() isn't part of the barrier API, and using it as a work-around for a device that has barrier issues is wrong for the reasons listed above. The DRAIN_FUA - DRAIN_FLUSH automatic downgrade I mentioned above should be added, in which case blkdev_issue_flush() would never be needed (unless you want to do a data-less barrier, and we should probably add that specific functionality with an empty bio instead of providing an alternate way of doing that). -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[EMAIL PROTECTED]: Re: [patch 00/41] Buffered write deadlock fix and new aops for 2.6.21-mm2]
I actually forgot to cc linux-fsdevel on this one. Vladimir found a corner case bug in the case of faulting source address, which is since fixed, but might be interesting to anyone else following development... - Forwarded message from Nick Piggin [EMAIL PROTECTED] - Date: Wed, 16 May 2007 09:14:06 +0200 From: Nick Piggin [EMAIL PROTECTED] To: Vladimir V. Saveliev [EMAIL PROTECTED] Cc: Andrew Morton [EMAIL PROTECTED] Subject: Re: [patch 00/41] Buffered write deadlock fix and new aops for 2.6.21-mm2 In-Reply-To: [EMAIL PROTECTED] User-Agent: Mutt/1.5.9i cc'ed linux-fsdevel again... On Tue, May 15, 2007 at 10:00:38PM +0400, Vladimir V. Saveliev wrote: Hello On Tuesday 15 May 2007 02:42, Nick Piggin wrote: On Mon, May 14, 2007 at 10:28:45PM +0400, Vladimir V. Saveliev wrote: Hello There is a problem with new write. If you expand an empty file with truncate and then write so that in one write file tail is overwritten and something is appended to the file - the write loops forever writing to page containing file tail. Something wrong happens writing to uptodate last page of a file, I guess. I can send a simple program if necessary. Is this reiserfs with your reiserfs patch? No, this is common problem. I get it easily on ext3 and ext2. Yes, please send a simple program and I'll have a look. Thanks, that was really helpful! What the program does is to write a non-faulted page into pagecache, which means we're relying on fault_in_pages_readable to bring it in for us. Usually it does, however your specific write pattern required that 2 pages be brought in, and _also_ that the total number of bytes to write went past that 2nd page. This caused the 1st and an Nth (2) page to be faulted in, but the atomic copy_from_user really needed the 2nd page :) This fixes it for me. --- Index: linux-2.6/include/linux/fs.h === --- linux-2.6.orig/include/linux/fs.h 2007-05-16 16:58:41.0 +1000 +++ linux-2.6/include/linux/fs.h2007-05-16 16:58:47.0 +1000 @@ -419,7 +419,7 @@ size_t iov_iter_copy_from_user(struct page *page, struct iov_iter *i, unsigned long offset, size_t bytes); void iov_iter_advance(struct iov_iter *i, size_t bytes); -int iov_iter_fault_in_readable(struct iov_iter *i); +int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes); size_t iov_iter_single_seg_count(struct iov_iter *i); static inline void iov_iter_init(struct iov_iter *i, Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-05-16 14:11:18.0 +1000 +++ linux-2.6/mm/filemap.c 2007-05-16 17:03:29.0 +1000 @@ -1806,11 +1806,10 @@ } EXPORT_SYMBOL(iov_iter_advance); -int iov_iter_fault_in_readable(struct iov_iter *i) +int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes) { - size_t seglen = min(i-iov-iov_len - i-iov_offset, i-count); char __user *buf = i-iov-iov_base + i-iov_offset; - return fault_in_pages_readable(buf, seglen); + return fault_in_pages_readable(buf, bytes); } EXPORT_SYMBOL(iov_iter_fault_in_readable); @@ -2102,7 +2101,7 @@ * to check that the address is actually valid, when atomic * usercopies are used, below. */ - if (unlikely(iov_iter_fault_in_readable(i))) { + if (unlikely(iov_iter_fault_in_readable(i, bytes))) { status = -EFAULT; break; } @@ -2276,7 +2275,7 @@ * to check that the address is actually valid, when atomic * usercopies are used, below. */ - if (unlikely(iov_iter_fault_in_readable(i))) { + if (unlikely(iov_iter_fault_in_readable(i, bytes))) { status = -EFAULT; break; } - End forwarded message - - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 00/41] Buffered write deadlock fix and new aops for 2.6.22-rc2-mm1
Hi, This is a resync of the new aops patches to 2.6.22-rc2-mm1 Only one more conversion broken this time, so we're doing OK. AFFS compile is broken due to cont_prepare_write disappearing, and me not bringing the conversion patch uptodate (which I won't do again until something happens with this patchset -- its only affs!). Reiser4 broken because it lost filemap_copy_from_user (it's deadlocky as well, yay!). Still unfortunately missing the OCFS2 and GFS2 conversions, which allowed us to remove a lot of code -- I won't ask the maintainers to redo them either until the patchset gets somewhere. Highlight of this release is the reiserfs conversion, and the removal of the reiserfs-specific generic_cont_expand helper. Also fixed a bug in my pagecache directory conversions. Please merge? -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 01/41] mm: revert KERNEL_DS buffered write optimisation
Revert the patch from Neil Brown to optimise NFSD writev handling. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Cc: Neil Brown [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c | 32 +--- 1 file changed, 13 insertions(+), 19 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1948,27 +1948,21 @@ generic_file_buffered_write(struct kiocb /* Limit the size of the copy to the caller's write size */ bytes = min(bytes, count); - /* We only need to worry about prefaulting when writes are from -* user-space. NFSd uses vfs_writev with several non-aligned -* segments in the vector, and limiting to one segment a time is -* a noticeable performance for re-write + /* +* Limit the size of the copy to that of the current segment, +* because fault_in_pages_readable() doesn't know how to walk +* segments. */ - if (!segment_eq(get_fs(), KERNEL_DS)) { - /* -* Limit the size of the copy to that of the current -* segment, because fault_in_pages_readable() doesn't -* know how to walk segments. -*/ - bytes = min(bytes, cur_iov-iov_len - iov_base); + bytes = min(bytes, cur_iov-iov_len - iov_base); + + /* +* Bring in the user page that we will copy from _first_. +* Otherwise there's a nasty deadlock on copying from the +* same page as we're writing to, without it being marked +* up-to-date. +*/ + fault_in_pages_readable(buf, bytes); - /* -* Bring in the user page that we will copy from -* _first_. Otherwise there's a nasty deadlock on -* copying from the same page as we're writing to, -* without it being marked up-to-date. -*/ - fault_in_pages_readable(buf, bytes); - } page = __grab_cache_page(mapping,index,cached_page,lru_pvec); if (!page) { status = -ENOMEM; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 13/41] mm: restore KERNEL_DS optimisations
Restore the KERNEL_DS optimisation, especially helpful to the 2copy write path. This may be a pretty questionable gain in most cases, especially after the legacy 2copy write path is removed, but it doesn't cost much. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -2123,7 +2123,7 @@ static ssize_t generic_perform_write_2co * cannot take a pagefault with the destination page locked. * So pin the source page to copy it. */ - if (!PageUptodate(page)) { + if (!PageUptodate(page) !segment_eq(get_fs(), KERNEL_DS)) { unlock_page(page); src_page = alloc_page(GFP_KERNEL); @@ -2248,6 +2248,13 @@ static ssize_t generic_perform_write(str const struct address_space_operations *a_ops = mapping-a_ops; long status = 0; ssize_t written = 0; + unsigned int flags = 0; + + /* +* Copies from kernel address space cannot fail (NFSD is a big user). +*/ + if (segment_eq(get_fs(), KERNEL_DS)) + flags |= AOP_FLAG_UNINTERRUPTIBLE; do { struct page *page; @@ -2279,7 +2286,7 @@ again: break; } - status = a_ops-write_begin(file, mapping, pos, bytes, 0, + status = a_ops-write_begin(file, mapping, pos, bytes, flags, page, fsdata); if (unlikely(status)) break; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 19/41] xfs convert to new aops.
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/xfs/linux-2.6/xfs_aops.c | 19 --- fs/xfs/linux-2.6/xfs_lrw.c | 35 --- 2 files changed, 24 insertions(+), 30 deletions(-) Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c === --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c +++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c @@ -1479,13 +1479,18 @@ xfs_vm_direct_IO( } STATIC int -xfs_vm_prepare_write( +xfs_vm_write_begin( struct file *file, - struct page *page, - unsigned intfrom, - unsigned intto) + struct address_space*mapping, + loff_t pos, + unsignedlen, + unsignedflags, + struct page **pagep, + void**fsdata) { - return block_prepare_write(page, from, to, xfs_get_blocks); + *pagep = NULL; + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + xfs_get_blocks); } STATIC sector_t @@ -1539,8 +1544,8 @@ const struct address_space_operations xf .sync_page = block_sync_page, .releasepage= xfs_vm_releasepage, .invalidatepage = xfs_vm_invalidatepage, - .prepare_write = xfs_vm_prepare_write, - .commit_write = generic_commit_write, + .write_begin= xfs_vm_write_begin, + .write_end = generic_write_end, .bmap = xfs_vm_bmap, .direct_IO = xfs_vm_direct_IO, .migratepage= buffer_migrate_page, Index: linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c === --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_lrw.c +++ linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c @@ -134,45 +134,34 @@ xfs_iozero( loff_t pos,/* offset in file */ size_t count) /* size of data to zero */ { - unsignedbytes; struct page *page; struct address_space*mapping; int status; mapping = ip-i_mapping; do { - unsigned long index, offset; + unsigned offset, bytes; + void *fsdata; offset = (pos (PAGE_CACHE_SIZE -1)); /* Within page */ - index = pos PAGE_CACHE_SHIFT; bytes = PAGE_CACHE_SIZE - offset; if (bytes count) bytes = count; - status = -ENOMEM; - page = grab_cache_page(mapping, index); - if (!page) - break; - - status = mapping-a_ops-prepare_write(NULL, page, offset, - offset + bytes); + status = pagecache_write_begin(NULL, mapping, pos, bytes, + AOP_FLAG_UNINTERRUPTIBLE, + page, fsdata); if (status) - goto unlock; + break; zero_user_page(page, offset, bytes, KM_USER0); - status = mapping-a_ops-commit_write(NULL, page, offset, - offset + bytes); - if (!status) { - pos += bytes; - count -= bytes; - } - -unlock: - unlock_page(page); - page_cache_release(page); - if (status) - break; + status = pagecache_write_end(NULL, mapping, pos, bytes, bytes, + page, fsdata); + WARN_ON(status = 0); /* can't return less than zero! */ + pos += bytes; + count -= bytes; + status = 0; } while (count); return (-status); -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 16/41] ext2 convert to new aops.
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/ext2/dir.c | 56 ++-- fs/ext2/ext2.h |3 +++ fs/ext2/inode.c | 24 +--- 3 files changed, 54 insertions(+), 29 deletions(-) Index: linux-2.6/fs/ext2/inode.c === --- linux-2.6.orig/fs/ext2/inode.c +++ linux-2.6/fs/ext2/inode.c @@ -726,18 +726,21 @@ ext2_readpages(struct file *file, struct return mpage_readpages(mapping, pages, nr_pages, ext2_get_block); } -static int -ext2_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +int __ext2_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return block_prepare_write(page,from,to,ext2_get_block); + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + ext2_get_block); } static int -ext2_nobh_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +ext2_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return nobh_prepare_write(page,from,to,ext2_get_block); + *pagep = NULL; + return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata); } static int ext2_nobh_writepage(struct page *page, @@ -773,8 +776,8 @@ const struct address_space_operations ex .readpages = ext2_readpages, .writepage = ext2_writepage, .sync_page = block_sync_page, - .prepare_write = ext2_prepare_write, - .commit_write = generic_commit_write, + .write_begin= ext2_write_begin, + .write_end = generic_write_end, .bmap = ext2_bmap, .direct_IO = ext2_direct_IO, .writepages = ext2_writepages, @@ -791,8 +794,7 @@ const struct address_space_operations ex .readpages = ext2_readpages, .writepage = ext2_nobh_writepage, .sync_page = block_sync_page, - .prepare_write = ext2_nobh_prepare_write, - .commit_write = nobh_commit_write, + /* XXX: todo */ .bmap = ext2_bmap, .direct_IO = ext2_direct_IO, .writepages = ext2_writepages, Index: linux-2.6/fs/ext2/dir.c === --- linux-2.6.orig/fs/ext2/dir.c +++ linux-2.6/fs/ext2/dir.c @@ -22,7 +22,9 @@ */ #include ext2.h +#include linux/buffer_head.h #include linux/pagemap.h +#include linux/swap.h typedef struct ext2_dir_entry_2 ext2_dirent; @@ -61,16 +63,26 @@ ext2_last_byte(struct inode *inode, unsi return last_byte; } -static int ext2_commit_chunk(struct page *page, unsigned from, unsigned to) +static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len) { - struct inode *dir = page-mapping-host; + struct address_space *mapping = page-mapping; + struct inode *dir = mapping-host; int err = 0; + dir-i_version++; - page-mapping-a_ops-commit_write(NULL, page, from, to); + block_write_end(NULL, mapping, pos, len, len, page, NULL); + + if (pos+len dir-i_size) { + i_size_write(dir, pos+len); + mark_inode_dirty(dir); + } + if (IS_DIRSYNC(dir)) err = write_one_page(page, 1); else unlock_page(page); + mark_page_accessed(page); + return err; } @@ -412,16 +424,18 @@ ino_t ext2_inode_by_name(struct inode * void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de, struct page *page, struct inode *inode) { - unsigned from = (char *) de - (char *) page_address(page); - unsigned to = from + le16_to_cpu(de-rec_len); + loff_t pos = (page-index PAGE_CACHE_SHIFT) + + (char *) de - (char *) page_address(page); + unsigned len = le16_to_cpu(de-rec_len); int err; lock_page(page); - err = page-mapping-a_ops-prepare_write(NULL, page, from, to); + err = __ext2_write_begin(NULL, page-mapping, pos, len, + AOP_FLAG_UNINTERRUPTIBLE, page, NULL); BUG_ON(err); de-inode = cpu_to_le32(inode-i_ino); - ext2_set_de_type (de, inode); - err = ext2_commit_chunk(page, from, to); + ext2_set_de_type(de, inode); + err = ext2_commit_chunk(page, pos, len); ext2_put_page(page);
[patch 30/41] reiserfs use generic_cont_expand_simple
From: Vladimir Saveliev [EMAIL PROTECTED] This patch makes reiserfs to use AOP_FLAG_CONT_EXPAND in order to get rid of the special generic_cont_expand routine Signed-off-by: Vladimir Saveliev [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- --- fs/reiserfs/inode.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) Index: linux-2.6/fs/reiserfs/inode.c === --- linux-2.6.orig/fs/reiserfs/inode.c +++ linux-2.6/fs/reiserfs/inode.c @@ -2562,13 +2562,20 @@ static int reiserfs_write_begin(struct f int ret; int old_ref = 0; + inode = mapping-host; + *fsdata = 0; + if (flags AOP_FLAG_CONT_EXPAND + (pos (inode-i_sb-s_blocksize - 1)) == 0) { + pos ++; + *fsdata = (void *)(unsigned long)flags; + } + index = pos PAGE_CACHE_SHIFT; page = __grab_cache_page(mapping, index); if (!page) return -ENOMEM; *pagep = page; - inode = mapping-host; reiserfs_wait_on_write_block(inode-i_sb); fix_tail_page_for_writing(page); if (reiserfs_transaction_running(inode-i_sb)) { @@ -2678,6 +2685,8 @@ static int reiserfs_write_end(struct fil struct reiserfs_transaction_handle *th; unsigned start; + if ((unsigned long)fsdata AOP_FLAG_CONT_EXPAND) + pos ++; reiserfs_wait_on_write_block(inode-i_sb); if (reiserfs_transaction_running(inode-i_sb)) @@ -3066,7 +3075,7 @@ int reiserfs_setattr(struct dentry *dent } /* fill in hole pointers in the expanding truncate case. */ if (attr-ia_size inode-i_size) { - error = generic_cont_expand(inode, attr-ia_size); + error = generic_cont_expand_simple(inode, attr-ia_size); if (REISERFS_I(inode)-i_prealloc_count 0) { int err; struct reiserfs_transaction_handle th; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 34/41] fuse convert to new aops.
[mszeredi] - don't send zero length write requests - it is not legal for the filesystem to return with zero written bytes Signed-off-by: Nick Piggin [EMAIL PROTECTED] Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] fs/fuse/file.c | 48 +--- 1 file changed, 33 insertions(+), 15 deletions(-) Index: linux-2.6/fs/fuse/file.c === --- linux-2.6.orig/fs/fuse/file.c +++ linux-2.6/fs/fuse/file.c @@ -444,22 +444,25 @@ static size_t fuse_send_write(struct fus return outarg.size; } -static int fuse_prepare_write(struct file *file, struct page *page, - unsigned offset, unsigned to) -{ - /* No op */ +static int fuse_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + pgoff_t index = pos PAGE_CACHE_SHIFT; + + *pagep = __grab_cache_page(mapping, index); + if (!*pagep) + return -ENOMEM; return 0; } -static int fuse_commit_write(struct file *file, struct page *page, -unsigned offset, unsigned to) +static int fuse_buffered_write(struct file *file, struct inode *inode, + loff_t pos, unsigned count, struct page *page) { int err; size_t nres; - unsigned count = to - offset; - struct inode *inode = page-mapping-host; struct fuse_conn *fc = get_fuse_conn(inode); - loff_t pos = page_offset(page) + offset; + unsigned offset = pos (PAGE_CACHE_SIZE - 1); struct fuse_req *req; if (is_bad_inode(inode)) @@ -475,20 +478,35 @@ static int fuse_commit_write(struct file nres = fuse_send_write(req, file, inode, pos, count); err = req-out.h.error; fuse_put_request(fc, req); - if (!err nres != count) + if (!err !nres) err = -EIO; if (!err) { - pos += count; + pos += nres; spin_lock(fc-lock); if (pos inode-i_size) i_size_write(inode, pos); spin_unlock(fc-lock); - if (offset == 0 to == PAGE_CACHE_SIZE) + if (count == PAGE_CACHE_SIZE) SetPageUptodate(page); } fuse_invalidate_attr(inode); - return err; + return err ? err : nres; +} + +static int fuse_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) +{ + struct inode *inode = mapping-host; + int res = 0; + + if (copied) + res = fuse_buffered_write(file, inode, pos, copied, page); + + unlock_page(page); + page_cache_release(page); + return res; } static void fuse_release_user_pages(struct fuse_req *req, int write) @@ -819,8 +837,8 @@ static const struct file_operations fuse static const struct address_space_operations fuse_file_aops = { .readpage = fuse_readpage, - .prepare_write = fuse_prepare_write, - .commit_write = fuse_commit_write, + .write_begin= fuse_write_begin, + .write_end = fuse_write_end, .readpages = fuse_readpages, .set_page_dirty = fuse_set_page_dirty, .bmap = fuse_bmap, -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 14/41] implement simple fs aops
Implement new aops for some of the simpler filesystems. Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/configfs/inode.c |4 ++-- fs/hugetlbfs/inode.c | 16 ++-- fs/ramfs/file-mmu.c |4 ++-- fs/ramfs/file-nommu.c |4 ++-- fs/sysfs/inode.c |4 ++-- mm/shmem.c| 35 --- 6 files changed, 46 insertions(+), 21 deletions(-) Index: linux-2.6/mm/shmem.c === --- linux-2.6.orig/mm/shmem.c +++ linux-2.6/mm/shmem.c @@ -1107,7 +1107,7 @@ static int shmem_getpage(struct inode *i * Normally, filepage is NULL on entry, and either found * uptodate immediately, or allocated and zeroed, or read * in under swappage, which is then assigned to filepage. -* But shmem_prepare_write passes in a locked filepage, +* But shmem_write_begin passes in a locked filepage, * which may be found not uptodate by other callers too, * and may need to be copied from the swappage read in. */ @@ -1452,14 +1452,35 @@ static const struct inode_operations shm static const struct inode_operations shmem_symlink_inline_operations; /* - * Normally tmpfs makes no use of shmem_prepare_write, but it + * Normally tmpfs makes no use of shmem_write_begin, but it * lets a tmpfs file be used read-write below the loop driver. */ static int -shmem_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) +shmem_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + struct inode *inode = mapping-host; + pgoff_t index = pos PAGE_CACHE_SHIFT; + *pagep = NULL; + return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); +} + +static int +shmem_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { - struct inode *inode = page-mapping-host; - return shmem_getpage(inode, page-index, page, SGP_WRITE, NULL); + struct inode *inode = mapping-host; + + set_page_dirty(page); + mark_page_accessed(page); + page_cache_release(page); + + if (pos+copied inode-i_size) + i_size_write(inode, pos+copied); + + return copied; } static ssize_t @@ -2353,8 +2374,8 @@ static const struct address_space_operat .writepage = shmem_writepage, .set_page_dirty = __set_page_dirty_no_writeback, #ifdef CONFIG_TMPFS - .prepare_write = shmem_prepare_write, - .commit_write = simple_commit_write, + .write_begin= shmem_write_begin, + .write_end = shmem_write_end, #endif .migratepage= migrate_page, }; Index: linux-2.6/fs/configfs/inode.c === --- linux-2.6.orig/fs/configfs/inode.c +++ linux-2.6/fs/configfs/inode.c @@ -41,8 +41,8 @@ extern struct super_block * configfs_sb; static const struct address_space_operations configfs_aops = { .readpage = simple_readpage, - .prepare_write = simple_prepare_write, - .commit_write = simple_commit_write + .write_begin= simple_write_begin, + .write_end = simple_write_end, }; static struct backing_dev_info configfs_backing_dev_info = { Index: linux-2.6/fs/sysfs/inode.c === --- linux-2.6.orig/fs/sysfs/inode.c +++ linux-2.6/fs/sysfs/inode.c @@ -21,8 +21,8 @@ extern struct super_block * sysfs_sb; static const struct address_space_operations sysfs_aops = { .readpage = simple_readpage, - .prepare_write = simple_prepare_write, - .commit_write = simple_commit_write + .write_begin= simple_write_begin, + .write_end = simple_write_end, }; static struct backing_dev_info sysfs_backing_dev_info = { Index: linux-2.6/fs/ramfs/file-mmu.c === --- linux-2.6.orig/fs/ramfs/file-mmu.c +++ linux-2.6/fs/ramfs/file-mmu.c @@ -29,8 +29,8 @@ const struct address_space_operations ramfs_aops = { .readpage = simple_readpage, - .prepare_write = simple_prepare_write, - .commit_write = simple_commit_write, + .write_begin= simple_write_begin, + .write_end = simple_write_end, .set_page_dirty = __set_page_dirty_no_writeback, }; Index: linux-2.6/fs/ramfs/file-nommu.c === --- linux-2.6.orig/fs/ramfs/file-nommu.c +++ linux-2.6/fs/ramfs/file-nommu.c @@ -29,8 +29,8 @@ static int ramfs_nommu_setattr(struct de const struct address_space_operations
[patch 32/41] nfs convert to new aops.
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Acked-by: Trond Myklebust [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/nfs/file.c | 49 - 1 file changed, 36 insertions(+), 13 deletions(-) Index: linux-2.6/fs/nfs/file.c === --- linux-2.6.orig/fs/nfs/file.c +++ linux-2.6/fs/nfs/file.c @@ -283,27 +283,50 @@ nfs_fsync(struct file *file, struct dent } /* - * This does the real work of the write. The generic routine has - * allocated the page, locked it, done all the page alignment stuff - * calculations etc. Now we should just copy the data from user - * space and write it back to the real medium.. + * This does the real work of the write. We must allocate and lock the + * page to be sent back to the generic routine, which then copies the + * data from user space. * * If the writer ends up delaying the write, the writer needs to * increment the page use counts until he is done with the page. */ -static int nfs_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) -{ - return nfs_flush_incompatible(file, page); +static int nfs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + int ret; + pgoff_t index; + struct page *page; + index = pos PAGE_CACHE_SHIFT; + + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; + + ret = nfs_flush_incompatible(file, page); + if (ret) { + unlock_page(page); + page_cache_release(page); + } + return ret; } -static int nfs_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to) +static int nfs_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { - long status; + unsigned offset = pos (PAGE_CACHE_SIZE - 1); + int status; lock_kernel(); - status = nfs_updatepage(file, page, offset, to-offset); + status = nfs_updatepage(file, page, offset, copied); unlock_kernel(); - return status; + + unlock_page(page); + page_cache_release(page); + + return status 0 ? status : copied; } static void nfs_invalidate_page(struct page *page, unsigned long offset) @@ -331,8 +354,8 @@ const struct address_space_operations nf .set_page_dirty = nfs_set_page_dirty, .writepage = nfs_writepage, .writepages = nfs_writepages, - .prepare_write = nfs_prepare_write, - .commit_write = nfs_commit_write, + .write_begin = nfs_write_begin, + .write_end = nfs_write_end, .invalidatepage = nfs_invalidate_page, .releasepage = nfs_release_page, #ifdef CONFIG_NFS_DIRECTIO -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 15/41] block_dev convert to new aops.
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/block_dev.c | 26 +++--- 1 file changed, 19 insertions(+), 7 deletions(-) Index: linux-2.6/fs/block_dev.c === --- linux-2.6.orig/fs/block_dev.c +++ linux-2.6/fs/block_dev.c @@ -380,14 +380,26 @@ static int blkdev_readpage(struct file * return block_read_full_page(page, blkdev_get_block); } -static int blkdev_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) -{ - return block_prepare_write(page, from, to, blkdev_get_block); +static int blkdev_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + blkdev_get_block); } -static int blkdev_commit_write(struct file *file, struct page *page, unsigned from, unsigned to) +static int blkdev_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { - return block_commit_write(page, from, to); + int ret; + ret = block_write_end(file, mapping, pos, len, copied, page, fsdata); + + unlock_page(page); + page_cache_release(page); + + return ret; } /* @@ -1331,8 +1343,8 @@ const struct address_space_operations de .readpage = blkdev_readpage, .writepage = blkdev_writepage, .sync_page = block_sync_page, - .prepare_write = blkdev_prepare_write, - .commit_write = blkdev_commit_write, + .write_begin= blkdev_write_begin, + .write_end = blkdev_write_end, .writepages = generic_writepages, .direct_IO = blkdev_direct_IO, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 17/41] ext3 convert to new aops.
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] Various fixes and improvements Signed-off-by: Badari Pulavarty [EMAIL PROTECTED] fs/ext3/inode.c | 136 1 file changed, 88 insertions(+), 48 deletions(-) Index: linux-2.6/fs/ext3/inode.c === --- linux-2.6.orig/fs/ext3/inode.c +++ linux-2.6/fs/ext3/inode.c @@ -1147,51 +1147,68 @@ static int do_journal_get_write_access(h return ext3_journal_get_write_access(handle, bh); } -static int ext3_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +static int ext3_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - struct inode *inode = page-mapping-host; + struct inode *inode = mapping-host; int ret, needed_blocks = ext3_writepage_trans_blocks(inode); handle_t *handle; int retries = 0; + struct page *page; + pgoff_t index; + unsigned from, to; + + index = pos PAGE_CACHE_SHIFT; + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; retry: + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; + handle = ext3_journal_start(inode, needed_blocks); if (IS_ERR(handle)) { + unlock_page(page); + page_cache_release(page); ret = PTR_ERR(handle); goto out; } - if (test_opt(inode-i_sb, NOBH) ext3_should_writeback_data(inode)) - ret = nobh_prepare_write(page, from, to, ext3_get_block); - else - ret = block_prepare_write(page, from, to, ext3_get_block); + ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + ext3_get_block); if (ret) - goto prepare_write_failed; + goto write_begin_failed; if (ext3_should_journal_data(inode)) { ret = walk_page_buffers(handle, page_buffers(page), from, to, NULL, do_journal_get_write_access); } -prepare_write_failed: - if (ret) +write_begin_failed: + if (ret) { ext3_journal_stop(handle); + unlock_page(page); + page_cache_release(page); + } if (ret == -ENOSPC ext3_should_retry_alloc(inode-i_sb, retries)) goto retry; out: return ret; } + int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh) { int err = journal_dirty_data(handle, bh); if (err) ext3_journal_abort_handle(__FUNCTION__, __FUNCTION__, - bh, handle,err); + bh, handle, err); return err; } -/* For commit_write() in data=journal mode */ -static int commit_write_fn(handle_t *handle, struct buffer_head *bh) +/* For write_end() in data=journal mode */ +static int write_end_fn(handle_t *handle, struct buffer_head *bh) { if (!buffer_mapped(bh) || buffer_freed(bh)) return 0; @@ -1206,78 +1223,100 @@ static int commit_write_fn(handle_t *han * ext3 never places buffers on inode-i_mapping-private_list. metadata * buffers are managed internally. */ -static int ext3_ordered_commit_write(struct file *file, struct page *page, -unsigned from, unsigned to) +static int ext3_ordered_write_end(struct file *file, + struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { handle_t *handle = ext3_journal_current_handle(); - struct inode *inode = page-mapping-host; + struct inode *inode = file-f_mapping-host; + unsigned from, to; int ret = 0, ret2; + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; + ret = walk_page_buffers(handle, page_buffers(page), from, to, NULL, ext3_journal_dirty_data); if (ret == 0) { /* -* generic_commit_write() will run mark_inode_dirty() if i_size +* generic_write_end() will run mark_inode_dirty() if i_size * changes. So let's piggyback the i_disksize mark_inode_dirty * into that. */ loff_t new_i_size; - new_i_size = ((loff_t)page-index PAGE_CACHE_SHIFT) + to; + new_i_size = pos + copied; if (new_i_size EXT3_I(inode)-i_disksize)
[patch 26/41] bfs convert to new aops.
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/bfs/file.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) Index: linux-2.6/fs/bfs/file.c === --- linux-2.6.orig/fs/bfs/file.c +++ linux-2.6/fs/bfs/file.c @@ -145,9 +145,13 @@ static int bfs_readpage(struct file *fil return block_read_full_page(page, bfs_get_block); } -static int bfs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) +static int bfs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return block_prepare_write(page, from, to, bfs_get_block); + *pagep = NULL; + return block_write_begin(file, mapping, pos, len, flags, + pagep, fsdata, bfs_get_block); } static sector_t bfs_bmap(struct address_space *mapping, sector_t block) @@ -159,8 +163,8 @@ const struct address_space_operations bf .readpage = bfs_readpage, .writepage = bfs_writepage, .sync_page = block_sync_page, - .prepare_write = bfs_prepare_write, - .commit_write = generic_commit_write, + .write_begin= bfs_write_begin, + .write_end = generic_write_end, .bmap = bfs_bmap, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 38/41] udf convert to new aops.
Convert udf to new aops. Also seem to have fixed pagecache corruption in udf_adinicb_commit_write -- page was marked uptodate when it is not. Also, fixed the silly setup where prepare_write was doing a kmap to be used in commit_write: just do kmap_atomic in write_end. Use libfs helpers to make this easier. Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/udf/file.c | 32 +--- fs/udf/inode.c | 11 +++ 2 files changed, 20 insertions(+), 23 deletions(-) Index: linux-2.6/fs/udf/file.c === --- linux-2.6.orig/fs/udf/file.c +++ linux-2.6/fs/udf/file.c @@ -74,34 +74,28 @@ static int udf_adinicb_writepage(struct return 0; } -static int udf_adinicb_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) +static int udf_adinicb_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { - kmap(page); - return 0; -} - -static int udf_adinicb_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to) -{ - struct inode *inode = page-mapping-host; - char *kaddr = page_address(page); + struct inode *inode = mapping-host; + unsigned offset = pos (PAGE_CACHE_SIZE - 1); + char *kaddr; + kaddr = kmap_atomic(page, KM_USER0); memcpy(UDF_I_DATA(inode) + UDF_I_LENEATTR(inode) + offset, - kaddr + offset, to - offset); - mark_inode_dirty(inode); - SetPageUptodate(page); - kunmap(page); - /* only one page here */ - if (to inode-i_size) - inode-i_size = to; - return 0; + kaddr + offset, copied); + kunmap_atomic(kaddr, KM_USER0); + + return simple_write_end(file, mapping, pos, len, copied, page, fsdata); } const struct address_space_operations udf_adinicb_aops = { .readpage = udf_adinicb_readpage, .writepage = udf_adinicb_writepage, .sync_page = block_sync_page, - .prepare_write = udf_adinicb_prepare_write, - .commit_write = udf_adinicb_commit_write, + .write_begin= simple_write_begin, + .write_end = udf_adinicb_write_end, }; static ssize_t udf_file_aio_write(struct kiocb *iocb, const struct iovec *iov, Index: linux-2.6/fs/udf/inode.c === --- linux-2.6.orig/fs/udf/inode.c +++ linux-2.6/fs/udf/inode.c @@ -123,9 +123,12 @@ static int udf_readpage(struct file *fil return block_read_full_page(page, udf_get_block); } -static int udf_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) +static int udf_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return block_prepare_write(page, from, to, udf_get_block); + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + udf_get_block); } static sector_t udf_bmap(struct address_space *mapping, sector_t block) @@ -137,8 +140,8 @@ const struct address_space_operations ud .readpage = udf_readpage, .writepage = udf_writepage, .sync_page = block_sync_page, - .prepare_write = udf_prepare_write, - .commit_write = generic_commit_write, + .write_begin= udf_write_begin, + .write_end = generic_write_end, .bmap = udf_bmap, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 37/41] ufs convert to new aops.
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/ufs/dir.c | 56 +--- fs/ufs/inode.c | 23 +++ 2 files changed, 56 insertions(+), 23 deletions(-) Index: linux-2.6/fs/ufs/inode.c === --- linux-2.6.orig/fs/ufs/inode.c +++ linux-2.6/fs/ufs/inode.c @@ -558,24 +558,39 @@ static int ufs_writepage(struct page *pa { return block_write_full_page(page,ufs_getfrag_block,wbc); } + static int ufs_readpage(struct file *file, struct page *page) { return block_read_full_page(page,ufs_getfrag_block); } -static int ufs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) + +int __ufs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return block_prepare_write(page,from,to,ufs_getfrag_block); + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + ufs_getfrag_block); } + +static int ufs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return __ufs_write_begin(file, mapping, pos, len, flags, pagep, fsdata); +} + static sector_t ufs_bmap(struct address_space *mapping, sector_t block) { return generic_block_bmap(mapping,block,ufs_getfrag_block); } + const struct address_space_operations ufs_aops = { .readpage = ufs_readpage, .writepage = ufs_writepage, .sync_page = block_sync_page, - .prepare_write = ufs_prepare_write, - .commit_write = generic_commit_write, + .write_begin = ufs_write_begin, + .write_end = generic_write_end, .bmap = ufs_bmap }; Index: linux-2.6/fs/ufs/dir.c === --- linux-2.6.orig/fs/ufs/dir.c +++ linux-2.6/fs/ufs/dir.c @@ -19,6 +19,7 @@ #include linux/time.h #include linux/fs.h #include linux/ufs_fs.h +#include linux/swap.h #include swab.h #include util.h @@ -38,16 +39,23 @@ static inline int ufs_match(struct super return !memcmp(name, de-d_name, len); } -static int ufs_commit_chunk(struct page *page, unsigned from, unsigned to) +static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len) { - struct inode *dir = page-mapping-host; + struct address_space *mapping = page-mapping; + struct inode *dir = mapping-host; int err = 0; + dir-i_version++; - page-mapping-a_ops-commit_write(NULL, page, from, to); + block_write_end(NULL, mapping, pos, len, len, page, NULL); + if (pos+len dir-i_size) { + i_size_write(dir, pos+len); + mark_inode_dirty(dir); + } if (IS_DIRSYNC(dir)) err = write_one_page(page, 1); else unlock_page(page); + mark_page_accessed(page); return err; } @@ -81,16 +89,20 @@ ino_t ufs_inode_by_name(struct inode *di void ufs_set_link(struct inode *dir, struct ufs_dir_entry *de, struct page *page, struct inode *inode) { - unsigned from = (char *) de - (char *) page_address(page); - unsigned to = from + fs16_to_cpu(dir-i_sb, de-d_reclen); + loff_t pos = (page-index PAGE_CACHE_SHIFT) + + (char *) de - (char *) page_address(page); + unsigned len = fs16_to_cpu(dir-i_sb, de-d_reclen); int err; lock_page(page); - err = page-mapping-a_ops-prepare_write(NULL, page, from, to); + err = __ufs_write_begin(NULL, page-mapping, pos, len, + AOP_FLAG_UNINTERRUPTIBLE, page, NULL); BUG_ON(err); + de-d_ino = cpu_to_fs32(dir-i_sb, inode-i_ino); ufs_set_de_type(dir-i_sb, de, inode-i_mode); - err = ufs_commit_chunk(page, from, to); + + err = ufs_commit_chunk(page, pos, len); ufs_put_page(page); dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(dir); @@ -312,7 +324,7 @@ int ufs_add_link(struct dentry *dentry, unsigned long npages = ufs_dir_pages(dir); unsigned long n; char *kaddr; - unsigned from, to; + loff_t pos; int err; UFSD(ENTER, name %s, namelen %u\n, name, namelen); @@ -367,9 +379,10 @@ int ufs_add_link(struct dentry *dentry, return -EINVAL; got_it: - from = (char*)de - (char*)page_address(page); - to = from + rec_len; - err = page-mapping-a_ops-prepare_write(NULL, page, from, to); + pos = (page-index PAGE_CACHE_SHIFT) + + (char*)de - (char*)page_address(page); +
[patch 18/41] ext4 convert to new aops.
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Convert ext4 to use write_begin()/write_end() methods. Signed-off-by: Badari Pulavarty [EMAIL PROTECTED] fs/ext4/inode.c | 147 +++- 1 file changed, 93 insertions(+), 54 deletions(-) Index: linux-2.6/fs/ext4/inode.c === --- linux-2.6.orig/fs/ext4/inode.c +++ linux-2.6/fs/ext4/inode.c @@ -1146,34 +1146,50 @@ static int do_journal_get_write_access(h return ext4_journal_get_write_access(handle, bh); } -static int ext4_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +static int ext4_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - struct inode *inode = page-mapping-host; + struct inode *inode = mapping-host; int ret, needed_blocks = ext4_writepage_trans_blocks(inode); handle_t *handle; int retries = 0; + struct page *page; + pgoff_t index; + unsigned from, to; + + index = pos PAGE_CACHE_SHIFT; + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; retry: - handle = ext4_journal_start(inode, needed_blocks); - if (IS_ERR(handle)) { - ret = PTR_ERR(handle); - goto out; + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; + + handle = ext4_journal_start(inode, needed_blocks); + if (IS_ERR(handle)) { + unlock_page(page); + page_cache_release(page); + ret = PTR_ERR(handle); + goto out; } - if (test_opt(inode-i_sb, NOBH) ext4_should_writeback_data(inode)) - ret = nobh_prepare_write(page, from, to, ext4_get_block); - else - ret = block_prepare_write(page, from, to, ext4_get_block); - if (ret) - goto prepare_write_failed; - if (ext4_should_journal_data(inode)) { + ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + ext4_get_block); + + if (!ret ext4_should_journal_data(inode)) { ret = walk_page_buffers(handle, page_buffers(page), from, to, NULL, do_journal_get_write_access); } -prepare_write_failed: - if (ret) + + if (ret) { ext4_journal_stop(handle); + unlock_page(page); + page_cache_release(page); + } + if (ret == -ENOSPC ext4_should_retry_alloc(inode-i_sb, retries)) goto retry; out: @@ -1185,12 +1201,12 @@ int ext4_journal_dirty_data(handle_t *ha int err = jbd2_journal_dirty_data(handle, bh); if (err) ext4_journal_abort_handle(__FUNCTION__, __FUNCTION__, - bh, handle,err); + bh, handle, err); return err; } -/* For commit_write() in data=journal mode */ -static int commit_write_fn(handle_t *handle, struct buffer_head *bh) +/* For write_end() in data=journal mode */ +static int write_end_fn(handle_t *handle, struct buffer_head *bh) { if (!buffer_mapped(bh) || buffer_freed(bh)) return 0; @@ -1205,78 +1221,100 @@ static int commit_write_fn(handle_t *han * ext4 never places buffers on inode-i_mapping-private_list. metadata * buffers are managed internally. */ -static int ext4_ordered_commit_write(struct file *file, struct page *page, -unsigned from, unsigned to) +static int ext4_ordered_write_end(struct file *file, + struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { handle_t *handle = ext4_journal_current_handle(); - struct inode *inode = page-mapping-host; + struct inode *inode = file-f_mapping-host; + unsigned from, to; int ret = 0, ret2; + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; + ret = walk_page_buffers(handle, page_buffers(page), from, to, NULL, ext4_journal_dirty_data); if (ret == 0) { /* -* generic_commit_write() will run mark_inode_dirty() if i_size +* generic_write_end() will run mark_inode_dirty() if i_size * changes. So let's piggyback the i_disksize mark_inode_dirty * into that. */ loff_t new_i_size; - new_i_size = ((loff_t)page-index PAGE_CACHE_SHIFT) + to; +
[patch 04/41] mm: clean up buffered write code
From: Andrew Morton [EMAIL PROTECTED] Rename some variables and fix some types. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c | 35 ++- 1 file changed, 18 insertions(+), 17 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1912,16 +1912,15 @@ generic_file_buffered_write(struct kiocb size_t count, ssize_t written) { struct file *file = iocb-ki_filp; - struct address_space * mapping = file-f_mapping; + struct address_space *mapping = file-f_mapping; const struct address_space_operations *a_ops = mapping-a_ops; struct inode*inode = mapping-host; longstatus = 0; struct page *page; struct page *cached_page = NULL; - size_t bytes; struct pagevec lru_pvec; const struct iovec *cur_iov = iov; /* current iovec */ - size_t iov_base = 0; /* offset in the current iovec */ + size_t iov_offset = 0;/* offset in the current iovec */ char __user *buf; pagevec_init(lru_pvec, 0); @@ -1932,31 +1931,33 @@ generic_file_buffered_write(struct kiocb if (likely(nr_segs == 1)) buf = iov-iov_base + written; else { - filemap_set_next_iovec(cur_iov, iov_base, written); - buf = cur_iov-iov_base + iov_base; + filemap_set_next_iovec(cur_iov, iov_offset, written); + buf = cur_iov-iov_base + iov_offset; } do { - unsigned long index; - unsigned long offset; - unsigned long maxlen; - size_t copied; + pgoff_t index; /* Pagecache index for current page */ + unsigned long offset; /* Offset into pagecache page */ + unsigned long maxlen; /* Bytes remaining in current iovec */ + size_t bytes; /* Bytes to write to page */ + size_t copied; /* Bytes copied from user */ - offset = (pos (PAGE_CACHE_SIZE -1)); /* Within page */ + offset = (pos (PAGE_CACHE_SIZE - 1)); index = pos PAGE_CACHE_SHIFT; bytes = PAGE_CACHE_SIZE - offset; if (bytes count) bytes = count; + maxlen = cur_iov-iov_len - iov_offset; + if (maxlen bytes) + maxlen = bytes; + /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the * same page as we're writing to, without it being marked * up-to-date. */ - maxlen = cur_iov-iov_len - iov_base; - if (maxlen bytes) - maxlen = bytes; fault_in_pages_readable(buf, maxlen); page = __grab_cache_page(mapping,index,cached_page,lru_pvec); @@ -1987,7 +1988,7 @@ generic_file_buffered_write(struct kiocb buf, bytes); else copied = filemap_copy_from_user_iovec(page, offset, - cur_iov, iov_base, bytes); + cur_iov, iov_offset, bytes); flush_dcache_page(page); status = a_ops-commit_write(file, page, offset, offset+bytes); if (status == AOP_TRUNCATED_PAGE) { @@ -2005,12 +2006,12 @@ generic_file_buffered_write(struct kiocb buf += status; if (unlikely(nr_segs 1)) { filemap_set_next_iovec(cur_iov, - iov_base, status); + iov_offset, status); if (count) buf = cur_iov-iov_base + - iov_base; + iov_offset; } else { - iov_base += status; + iov_offset += status; } } } -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 29/41] reiserfs convert to new aops.
From: Vladimir Saveliev [EMAIL PROTECTED] Convert reiserfs to new aops Signed-off-by: Vladimir Saveliev [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- --- fs/reiserfs/inode.c | 177 +--- fs/reiserfs/ioctl.c | 10 +- fs/reiserfs/xattr.c | 16 +++- 3 files changed, 184 insertions(+), 19 deletions(-) Index: linux-2.6/fs/reiserfs/inode.c === --- linux-2.6.orig/fs/reiserfs/inode.c +++ linux-2.6/fs/reiserfs/inode.c @@ -17,11 +17,12 @@ #include linux/mpage.h #include linux/writeback.h #include linux/quotaops.h +#include linux/swap.h -static int reiserfs_commit_write(struct file *f, struct page *page, -unsigned from, unsigned to); -static int reiserfs_prepare_write(struct file *f, struct page *page, - unsigned from, unsigned to); +int reiserfs_commit_write(struct file *f, struct page *page, + unsigned from, unsigned to); +int reiserfs_prepare_write(struct file *f, struct page *page, + unsigned from, unsigned to); void reiserfs_delete_inode(struct inode *inode) { @@ -2550,8 +2551,71 @@ static int reiserfs_writepage(struct pag return reiserfs_write_full_page(page, wbc); } -static int reiserfs_prepare_write(struct file *f, struct page *page, - unsigned from, unsigned to) +static int reiserfs_write_begin(struct file *file, + struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + struct inode *inode; + struct page *page; + pgoff_t index; + int ret; + int old_ref = 0; + + index = pos PAGE_CACHE_SHIFT; + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; + + inode = mapping-host; + reiserfs_wait_on_write_block(inode-i_sb); + fix_tail_page_for_writing(page); + if (reiserfs_transaction_running(inode-i_sb)) { + struct reiserfs_transaction_handle *th; + th = (struct reiserfs_transaction_handle *)current- + journal_info; + BUG_ON(!th-t_refcount); + BUG_ON(!th-t_trans_id); + old_ref = th-t_refcount; + th-t_refcount++; + } + ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + reiserfs_get_block); + if (ret reiserfs_transaction_running(inode-i_sb)) { + struct reiserfs_transaction_handle *th = current-journal_info; + /* this gets a little ugly. If reiserfs_get_block returned an +* error and left a transacstion running, we've got to close it, +* and we've got to free handle if it was a persistent transaction. +* +* But, if we had nested into an existing transaction, we need +* to just drop the ref count on the handle. +* +* If old_ref == 0, the transaction is from reiserfs_get_block, +* and it was a persistent trans. Otherwise, it was nested above. +*/ + if (th-t_refcount old_ref) { + if (old_ref) + th-t_refcount--; + else { + int err; + reiserfs_write_lock(inode-i_sb); + err = reiserfs_end_persistent_transaction(th); + reiserfs_write_unlock(inode-i_sb); + if (err) + ret = err; + } + } + } + if (ret) { + unlock_page(page); + page_cache_release(page); + } + return ret; +} + +int reiserfs_prepare_write(struct file *f, struct page *page, + unsigned from, unsigned to) { struct inode *inode = page-mapping-host; int ret; @@ -2604,8 +2668,101 @@ static sector_t reiserfs_aop_bmap(struct return generic_block_bmap(as, block, reiserfs_bmap); } -static int reiserfs_commit_write(struct file *f, struct page *page, -unsigned from, unsigned to) +static int reiserfs_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) +{ + struct inode *inode = page-mapping-host; + int ret = 0; + int update_sd = 0; + struct reiserfs_transaction_handle *th; + unsigned start; + + + reiserfs_wait_on_write_block(inode-i_sb); +
[patch 33/41] smb convert to new aops.
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/smbfs/file.c | 34 +- 1 file changed, 25 insertions(+), 9 deletions(-) Index: linux-2.6/fs/smbfs/file.c === --- linux-2.6.orig/fs/smbfs/file.c +++ linux-2.6/fs/smbfs/file.c @@ -291,29 +291,45 @@ out: * If the writer ends up delaying the write, the writer needs to * increment the page use counts until he is done with the page. */ -static int smb_prepare_write(struct file *file, struct page *page, -unsigned offset, unsigned to) -{ +static int smb_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + pgoff_t index = pos PAGE_CACHE_SHIFT; + *pagep = __grab_cache_page(mapping, index); + if (!*pagep) + return -ENOMEM; return 0; } -static int smb_commit_write(struct file *file, struct page *page, - unsigned offset, unsigned to) +static int smb_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { int status; + unsigned offset = pos (PAGE_CACHE_SIZE - 1); - status = -EFAULT; lock_kernel(); - status = smb_updatepage(file, page, offset, to-offset); + status = smb_updatepage(file, page, offset, copied); unlock_kernel(); + + if (!status) { + if (!PageUptodate(page) copied == PAGE_CACHE_SIZE) + SetPageUptodate(page); + status = copied; + } + + unlock_page(page); + page_cache_release(page); + return status; } const struct address_space_operations smb_file_aops = { .readpage = smb_readpage, .writepage = smb_writepage, - .prepare_write = smb_prepare_write, - .commit_write = smb_commit_write + .write_begin = smb_write_begin, + .write_end = smb_write_end, }; /* -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 10/41] mm: buffered write iterator
Add an iterator data structure to operate over an iovec. Add usercopy operators needed by generic_file_buffered_write, and convert that function over. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] include/linux/fs.h | 33 mm/filemap.c | 144 +++-- mm/filemap.h | 103 - 3 files changed, 150 insertions(+), 130 deletions(-) Index: linux-2.6/include/linux/fs.h === --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -416,6 +416,39 @@ struct page; struct address_space; struct writeback_control; +struct iov_iter { + const struct iovec *iov; + unsigned long nr_segs; + size_t iov_offset; + size_t count; +}; + +size_t iov_iter_copy_from_user_atomic(struct page *page, + struct iov_iter *i, unsigned long offset, size_t bytes); +size_t iov_iter_copy_from_user(struct page *page, + struct iov_iter *i, unsigned long offset, size_t bytes); +void iov_iter_advance(struct iov_iter *i, size_t bytes); +int iov_iter_fault_in_readable(struct iov_iter *i); +size_t iov_iter_single_seg_count(struct iov_iter *i); + +static inline void iov_iter_init(struct iov_iter *i, + const struct iovec *iov, unsigned long nr_segs, + size_t count, size_t written) +{ + i-iov = iov; + i-nr_segs = nr_segs; + i-iov_offset = 0; + i-count = count + written; + + iov_iter_advance(i, written); +} + +static inline size_t iov_iter_count(struct iov_iter *i) +{ + return i-count; +} + + struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -30,7 +30,7 @@ #include linux/security.h #include linux/syscalls.h #include linux/cpuset.h -#include filemap.h +#include linux/hardirq.h /* for BUG_ON(!in_atomic()) only */ #include internal.h /* @@ -1707,8 +1707,7 @@ int remove_suid(struct dentry *dentry) } EXPORT_SYMBOL(remove_suid); -size_t -__filemap_copy_from_user_iovec_inatomic(char *vaddr, +static size_t __iovec_copy_from_user_inatomic(char *vaddr, const struct iovec *iov, size_t base, size_t bytes) { size_t copied = 0, left = 0; @@ -1731,6 +1730,110 @@ __filemap_copy_from_user_iovec_inatomic( } /* + * Copy as much as we can into the page and return the number of bytes which + * were sucessfully copied. If a fault is encountered then return the number of + * bytes which were copied. + */ +size_t iov_iter_copy_from_user_atomic(struct page *page, + struct iov_iter *i, unsigned long offset, size_t bytes) +{ + char *kaddr; + size_t copied; + + BUG_ON(!in_atomic()); + kaddr = kmap_atomic(page, KM_USER0); + if (likely(i-nr_segs == 1)) { + int left; + char __user *buf = i-iov-iov_base + i-iov_offset; + left = __copy_from_user_inatomic_nocache(kaddr + offset, + buf, bytes); + copied = bytes - left; + } else { + copied = __iovec_copy_from_user_inatomic(kaddr + offset, + i-iov, i-iov_offset, bytes); + } + kunmap_atomic(kaddr, KM_USER0); + + return copied; +} + +/* + * This has the same sideeffects and return value as + * iov_iter_copy_from_user_atomic(). + * The difference is that it attempts to resolve faults. + * Page must not be locked. + */ +size_t iov_iter_copy_from_user(struct page *page, + struct iov_iter *i, unsigned long offset, size_t bytes) +{ + char *kaddr; + size_t copied; + + kaddr = kmap(page); + if (likely(i-nr_segs == 1)) { + int left; + char __user *buf = i-iov-iov_base + i-iov_offset; + left = __copy_from_user_nocache(kaddr + offset, buf, bytes); + copied = bytes - left; + } else { + copied = __iovec_copy_from_user_inatomic(kaddr + offset, + i-iov, i-iov_offset, bytes); + } + kunmap(page); + return copied; +} + +static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes) +{ + if (likely(i-nr_segs == 1)) { + i-iov_offset += bytes; + } else { + const struct iovec *iov = i-iov; + size_t base = i-iov_offset; + + while (bytes) { + int copy = min(bytes, iov-iov_len - base); + + bytes -= copy; +
[patch 31/41] With reiserfs no longer using the weird generic_cont_expand, remove it completely.
Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- fs/buffer.c | 20 include/linux/buffer_head.h |1 - 2 files changed, 21 deletions(-) Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -2189,25 +2189,6 @@ out: return err; } -int generic_cont_expand(struct inode *inode, loff_t size) -{ - unsigned int offset; - - offset = (size (PAGE_CACHE_SIZE - 1)); /* Within page */ - - /* ugh. in prepare/commit_write, if from==to==start of block, we -* skip the prepare. make sure we never send an offset for the start -* of a block. -* XXX: actually, this should be handled in those filesystems by -* checking for the AOP_FLAG_CONT_EXPAND flag. -*/ - if ((offset (inode-i_sb-s_blocksize - 1)) == 0) { - /* caller must handle this extra byte. */ - size++; - } - return generic_cont_expand_simple(inode, size); -} - int cont_expand_zero(struct file *file, struct address_space *mapping, loff_t pos, loff_t *bytes) { @@ -3135,7 +3116,6 @@ EXPORT_SYMBOL(file_fsync); EXPORT_SYMBOL(fsync_bdev); EXPORT_SYMBOL(generic_block_bmap); EXPORT_SYMBOL(generic_commit_write); -EXPORT_SYMBOL(generic_cont_expand); EXPORT_SYMBOL(generic_cont_expand_simple); EXPORT_SYMBOL(init_buffer); EXPORT_SYMBOL(invalidate_bdev); Index: linux-2.6/include/linux/buffer_head.h === --- linux-2.6.orig/include/linux/buffer_head.h +++ linux-2.6/include/linux/buffer_head.h @@ -217,7 +217,6 @@ int block_prepare_write(struct page*, un int cont_write_begin(struct file *, struct address_space *, loff_t, unsigned, unsigned, struct page **, void **, get_block_t *, loff_t *); -int generic_cont_expand(struct inode *inode, loff_t size); int generic_cont_expand_simple(struct inode *inode, loff_t size); int block_commit_write(struct page *page, unsigned from, unsigned to); void block_sync_page(struct page *); -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 09/41] mm: fix pagecache write deadlocks
Modify the core write() code so that it won't take a pagefault while holding a lock on the pagecache page. There are a number of different deadlocks possible if we try to do such a thing: 1. generic_buffered_write 2. lock_page 3.prepare_write 4. unlock_page+vmtruncate 5. copy_from_user 6. mmap_sem(r) 7. handle_mm_fault 8.lock_page (filemap_nopage) 9.commit_write 10. unlock_page a. sys_munmap / sys_mlock / others b. mmap_sem(w) c. make_pages_present d.get_user_pages e. handle_mm_fault f. lock_page (filemap_nopage) 2,8 - recursive deadlock if page is same 2,8;2,8 - ABBA deadlock is page is different 2,6;b,f - ABBA deadlock if page is same The solution is as follows: 1. If we find the destination page is uptodate, continue as normal, but use atomic usercopies which do not take pagefaults and do not zero the uncopied tail of the destination. The destination is already uptodate, so we can commit_write the full length even if there was a partial copy: it does not matter that the tail was not modified, because if it is dirtied and written back to disk it will not cause any problems (uptodate *means* that the destination page is as new or newer than the copy on disk). 1a. The above requires that fault_in_pages_readable correctly returns access information, because atomic usercopies cannot distinguish between non-present pages in a readable mapping, from lack of a readable mapping. 2. If we find the destination page is non uptodate, unlock it (this could be made slightly more optimal), then allocate a temporary page to copy the source data into. Relock the destination page and continue with the copy. However, instead of a usercopy (which might take a fault), copy the data from the pinned temporary page via the kernel address space. (also, rename maxlen to seglen, because it was confusing) This increases the CPU/memory copy cost by almost 50% on the affected workloads. That will be solved by introducing a new set of pagecache write aops in a subsequent patch. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] include/linux/pagemap.h | 11 +++- mm/filemap.c| 122 2 files changed, 112 insertions(+), 21 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1900,11 +1900,12 @@ generic_file_buffered_write(struct kiocb filemap_set_next_iovec(cur_iov, nr_segs, iov_offset, written); do { + struct page *src_page; struct page *page; pgoff_t index; /* Pagecache index for current page */ unsigned long offset; /* Offset into pagecache page */ - unsigned long maxlen; /* Bytes remaining in current iovec */ - size_t bytes; /* Bytes to write to page */ + unsigned long seglen; /* Bytes remaining in current iovec */ + unsigned long bytes;/* Bytes to write to page */ size_t copied; /* Bytes copied from user */ buf = cur_iov-iov_base + iov_offset; @@ -1914,20 +1915,30 @@ generic_file_buffered_write(struct kiocb if (bytes count) bytes = count; - maxlen = cur_iov-iov_len - iov_offset; - if (maxlen bytes) - maxlen = bytes; + /* +* a non-NULL src_page indicates that we're doing the +* copy via get_user_pages and kmap. +*/ + src_page = NULL; + + seglen = cur_iov-iov_len - iov_offset; + if (seglen bytes) + seglen = bytes; -#ifndef CONFIG_DEBUG_VM /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the * same page as we're writing to, without it being marked * up-to-date. +* +* Not only is this an optimisation, but it is also required +* to check that the address is actually valid, when atomic +* usercopies are used, below. */ - fault_in_pages_readable(buf, maxlen); -#endif - + if (unlikely(fault_in_pages_readable(buf, seglen))) { + status = -EFAULT; + break; + } page = __grab_cache_page(mapping, index); if (!page) { @@ -1935,32 +1946,104 @@ generic_file_buffered_write(struct kiocb break; } + /* +* non-uptodate pages
[patch 07/41] mm: buffered write cleanup
Quite a bit of code is used in maintaining these cached pages that are probably pretty unlikely to get used. It would require a narrow race where the page is inserted concurrently while this process is allocating a page in order to create the spare page. Then a multi-page write into an uncached part of the file, to make use of it. Next, the buffered write path (and others) uses its own LRU pagevec when it should be just using the per-CPU LRU pagevec (which will cut down on both data and code size cacheline footprint). Also, these private LRU pagevecs are emptied after just a very short time, in contrast with the per-CPU pagevecs that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required to add the pages to pagecache for a bulk write (in 4K chunks). [this gets rid of some cond_resched() calls in readahead.c and mpage.c due to clashes in -mm. What put them there, and why? ] Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/mpage.c | 10 --- mm/filemap.c | 145 ++--- mm/readahead.c | 24 +++-- 3 files changed, 66 insertions(+), 113 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -666,27 +666,22 @@ EXPORT_SYMBOL(find_lock_page); struct page *find_or_create_page(struct address_space *mapping, unsigned long index, gfp_t gfp_mask) { - struct page *page, *cached_page = NULL; + struct page *page; int err; repeat: page = find_lock_page(mapping, index); if (!page) { - if (!cached_page) { - cached_page = - __page_cache_alloc(gfp_mask); - if (!cached_page) - return NULL; - } - err = add_to_page_cache_lru(cached_page, mapping, - index, gfp_mask); - if (!err) { - page = cached_page; - cached_page = NULL; - } else if (err == -EEXIST) - goto repeat; + page = __page_cache_alloc(gfp_mask); + if (!page) + return NULL; + err = add_to_page_cache_lru(page, mapping, index, gfp_mask); + if (unlikely(err)) { + page_cache_release(page); + page = NULL; + if (err == -EEXIST) + goto repeat; + } } - if (cached_page) - page_cache_release(cached_page); return page; } EXPORT_SYMBOL(find_or_create_page); @@ -883,11 +878,9 @@ void do_generic_mapping_read(struct addr unsigned long prev_index; unsigned int prev_offset; loff_t isize; - struct page *cached_page; int error; struct file_ra_state ra = *_ra; - cached_page = NULL; index = *ppos PAGE_CACHE_SHIFT; next_index = index; prev_index = ra.prev_index; @@ -1059,23 +1052,20 @@ no_cached_page: * Ok, it wasn't cached, so we need to create a new * page.. */ - if (!cached_page) { - cached_page = page_cache_alloc_cold(mapping); - if (!cached_page) { - desc-error = -ENOMEM; - goto out; - } + page = page_cache_alloc_cold(mapping); + if (!page) { + desc-error = -ENOMEM; + goto out; } - error = add_to_page_cache_lru(cached_page, mapping, + error = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL); if (error) { + page_cache_release(page); if (error == -EEXIST) goto find_page; desc-error = error; goto out; } - page = cached_page; - cached_page = NULL; goto readpage; } @@ -1084,8 +1074,6 @@ out: _ra-prev_index = prev_index; *ppos = ((loff_t) index PAGE_CACHE_SHIFT) + offset; - if (cached_page) - page_cache_release(cached_page); if (filp) file_accessed(filp); } @@ -1573,35 +1561,28 @@ static struct page *__read_cache_page(st int (*filler)(void *,struct page*), void *data) { - struct page *page, *cached_page = NULL; + struct page *page; int err; repeat:
[patch 28/41] reiserfs use generic write.
From: Vladimir Saveliev [EMAIL PROTECTED] Make reiserfs to write via generic routines. Original reiserfs write optimized for big writes is deadlock rone Signed-off-by: Vladimir Saveliev [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- --- fs/reiserfs/file.c | 1240 - 1 file changed, 1 insertion(+), 1239 deletions(-) Index: linux-2.6/fs/reiserfs/file.c === --- linux-2.6.orig/fs/reiserfs/file.c +++ linux-2.6/fs/reiserfs/file.c @@ -153,608 +153,6 @@ static int reiserfs_sync_file(struct fil return (n_err 0) ? -EIO : 0; } -/* I really do not want to play with memory shortage right now, so - to simplify the code, we are not going to write more than this much pages at - a time. This still should considerably improve performance compared to 4k - at a time case. This is 32 pages of 4k size. */ -#define REISERFS_WRITE_PAGES_AT_A_TIME (128 * 1024) / PAGE_CACHE_SIZE - -/* Allocates blocks for a file to fulfil write request. - Maps all unmapped but prepared pages from the list. - Updates metadata with newly allocated blocknumbers as needed */ -static int reiserfs_allocate_blocks_for_region(struct reiserfs_transaction_handle *th, struct inode *inode,/* Inode we work with */ - loff_t pos, /* Writing position */ - int num_pages, /* number of pages write going - to touch */ - int write_bytes, /* amount of bytes to write */ - struct page **prepared_pages, /* array of - prepared pages - */ - int blocks_to_allocate /* Amount of blocks we - need to allocate to - fit the data into file -*/ -) -{ - struct cpu_key key; // cpu key of item that we are going to deal with - struct item_head *ih; // pointer to item head that we are going to deal with - struct buffer_head *bh; // Buffer head that contains items that we are going to deal with - __le32 *item; // pointer to item we are going to deal with - INITIALIZE_PATH(path); // path to item, that we are going to deal with. - b_blocknr_t *allocated_blocks; // Pointer to a place where allocated blocknumbers would be stored. - reiserfs_blocknr_hint_t hint; // hint structure for block allocator. - size_t res; // return value of various functions that we call. - int curr_block; // current block used to keep track of unmapped blocks. - int i; // loop counter - int itempos;// position in item - unsigned int from = (pos (PAGE_CACHE_SIZE - 1)); // writing position in - // first page - unsigned int to = ((pos + write_bytes - 1) (PAGE_CACHE_SIZE - 1)) + 1;/* last modified byte offset in last page */ - __u64 hole_size;// amount of blocks for a file hole, if it needed to be created. - int modifying_this_item = 0;// Flag for items traversal code to keep track - // of the fact that we already prepared - // current block for journal - int will_prealloc = 0; - RFALSE(!blocks_to_allocate, - green-9004: tried to allocate zero blocks?); - - /* only preallocate if this is a small write */ - if (REISERFS_I(inode)-i_prealloc_count || - (!(write_bytes (inode-i_sb-s_blocksize - 1)) -blocks_to_allocate -REISERFS_SB(inode-i_sb)-s_alloc_options.preallocsize)) - will_prealloc = - REISERFS_SB(inode-i_sb)-s_alloc_options.preallocsize; - - allocated_blocks = kmalloc((blocks_to_allocate + will_prealloc) * - sizeof(b_blocknr_t), GFP_NOFS); - if (!allocated_blocks) - return -ENOMEM; - - /* First we compose a key to point at the writing position, we want to do - that outside of any locking region. */ - make_cpu_key(key, inode, pos + 1, TYPE_ANY, 3 /*key length */ ); - - /* If we came here, it means we absolutely need to open a transaction, - since we need to allocate some blocks */ - reiserfs_write_lock(inode-i_sb); // Journaling stuff and we need that. - res = journal_begin(th, inode-i_sb, JOURNAL_PER_BALANCE_CNT * 3 + 1 + 2 *
[patch 12/41] fs: introduce write_begin, write_end, and perform_write aops
These are intended to replace prepare_write and commit_write with more flexible alternatives that are also able to avoid the buffered write deadlock problems efficiently (which prepare_write is unable to do). Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] API design contributions, code review and fixes. Signed-off-by: Mark Fasheh [EMAIL PROTECTED] Documentation/filesystems/Locking |9 - Documentation/filesystems/vfs.txt | 45 +++ drivers/block/loop.c | 75 --- fs/buffer.c | 198 ++- fs/libfs.c| 44 +++ fs/namei.c| 47 +-- fs/revoked_inode.c| 14 ++ fs/splice.c | 70 +-- include/linux/buffer_head.h | 10 + include/linux/fs.h| 30 include/linux/pagemap.h |2 mm/filemap.c | 238 +- 12 files changed, 576 insertions(+), 206 deletions(-) Index: linux-2.6/include/linux/fs.h === --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -409,6 +409,8 @@ enum positive_aop_returns { AOP_TRUNCATED_PAGE = 0x80001, }; +#define AOP_FLAG_UNINTERRUPTIBLE 0x0001 /* will not do a short write */ + /* * oh the beauties of C type declarations. */ @@ -428,7 +430,7 @@ size_t iov_iter_copy_from_user_atomic(st size_t iov_iter_copy_from_user(struct page *page, struct iov_iter *i, unsigned long offset, size_t bytes); void iov_iter_advance(struct iov_iter *i, size_t bytes); -int iov_iter_fault_in_readable(struct iov_iter *i); +int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes); size_t iov_iter_single_seg_count(struct iov_iter *i); static inline void iov_iter_init(struct iov_iter *i, @@ -469,6 +471,14 @@ struct address_space_operations { */ int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); int (*commit_write)(struct file *, struct page *, unsigned, unsigned); + + int (*write_begin)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); + int (*write_end)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); + /* Unfortunately this kludge is needed for FIBMAP. Don't use it */ sector_t (*bmap)(struct address_space *, sector_t); void (*invalidatepage) (struct page *, unsigned long); @@ -483,6 +493,18 @@ struct address_space_operations { int (*launder_page) (struct page *); }; +/* + * pagecache_write_begin/pagecache_write_end must be used by general code + * to write into the pagecache. + */ +int pagecache_write_begin(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); + +int pagecache_write_end(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); + struct backing_dev_info; struct address_space { struct inode*host; /* owner: inode, block_device */ @@ -1894,6 +1916,12 @@ extern int simple_prepare_write(struct f unsigned offset, unsigned to); extern int simple_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to); +extern int simple_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); +extern int simple_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct nameidata *); extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *); Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1814,11 +1814,10 @@ void iov_iter_advance(struct iov_iter *i i-count -= bytes; } -int iov_iter_fault_in_readable(struct iov_iter *i) +int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes) { - size_t seglen = min(i-iov-iov_len - i-iov_offset, i-count); char __user *buf = i-iov-iov_base +
[patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6
From: Andrew Morton [EMAIL PROTECTED] This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we also revert. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c |9 + mm/filemap.h |4 ++-- 2 files changed, 3 insertions(+), 10 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1969,12 +1969,6 @@ generic_file_buffered_write(struct kiocb break; } - if (unlikely(bytes == 0)) { - status = 0; - copied = 0; - goto zero_length_segment; - } - status = a_ops-prepare_write(file, page, offset, offset+bytes); if (unlikely(status)) { loff_t isize = i_size_read(inode); @@ -2004,8 +1998,7 @@ generic_file_buffered_write(struct kiocb page_cache_release(page); continue; } -zero_length_segment: - if (likely(copied = 0)) { + if (likely(copied 0)) { if (!status) status = copied; Index: linux-2.6/mm/filemap.h === --- linux-2.6.orig/mm/filemap.h +++ linux-2.6/mm/filemap.h @@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove const struct iovec *iov = *iovp; size_t base = *basep; - do { + while (bytes) { int copy = min(bytes, iov-iov_len - base); bytes -= copy; @@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove iov++; base = 0; } - } while (bytes); + } *iovp = iov; *basep = base; } -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 11/41] fs: fix data-loss on error
New buffers against uptodate pages are simply be marked uptodate, while the buffer_new bit remains set. This causes error-case code to zero out parts of those buffers because it thinks they contain stale data: wrong, they are actually uptodate so this is a data loss situation. Fix this by actually clearning buffer_new and marking the buffer dirty. It makes sense to always clear buffer_new before setting a buffer uptodate. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/buffer.c |2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -1816,7 +1816,9 @@ static int __block_prepare_write(struct unmap_underlying_metadata(bh-b_bdev, bh-b_blocknr); if (PageUptodate(page)) { + clear_buffer_new(bh); set_buffer_uptodate(bh); + mark_buffer_dirty(bh); continue; } if (block_end to || block_start from) { -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 27/41] qnx4 convert to new aops.
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/qnx4/inode.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) Index: linux-2.6/fs/qnx4/inode.c === --- linux-2.6.orig/fs/qnx4/inode.c +++ linux-2.6/fs/qnx4/inode.c @@ -433,16 +433,21 @@ static int qnx4_writepage(struct page *p { return block_write_full_page(page,qnx4_get_block, wbc); } + static int qnx4_readpage(struct file *file, struct page *page) { return block_read_full_page(page,qnx4_get_block); } -static int qnx4_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) -{ - struct qnx4_inode_info *qnx4_inode = qnx4_i(page-mapping-host); - return cont_prepare_write(page, from, to, qnx4_get_block, - qnx4_inode-mmu_private); + +static int qnx4_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + struct qnx4_inode_info *qnx4_inode = qnx4_i(mapping-host); + *pagep = NULL; + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + qnx4_get_block, + qnx4_inode-mmu_private); } static sector_t qnx4_bmap(struct address_space *mapping, sector_t block) { @@ -452,8 +457,8 @@ static const struct address_space_operat .readpage = qnx4_readpage, .writepage = qnx4_writepage, .sync_page = block_sync_page, - .prepare_write = qnx4_prepare_write, - .commit_write = generic_commit_write, + .write_begin= qnx4_write_begin, + .write_end = generic_write_end, .bmap = qnx4_bmap }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 06/41] mm: trim more holes
If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then we may have failed the write operation despite prepare_write having instantiated blocks past i_size. Fix this, and consolidate the trimming into one place. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c | 80 +-- 1 file changed, 40 insertions(+), 40 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1969,22 +1969,9 @@ generic_file_buffered_write(struct kiocb } status = a_ops-prepare_write(file, page, offset, offset+bytes); - if (unlikely(status)) { - loff_t isize = i_size_read(inode); + if (unlikely(status)) + goto fs_write_aop_error; - if (status != AOP_TRUNCATED_PAGE) - unlock_page(page); - page_cache_release(page); - if (status == AOP_TRUNCATED_PAGE) - continue; - /* -* prepare_write() may have instantiated a few blocks -* outside i_size. Trim these off again. -*/ - if (pos + bytes isize) - vmtruncate(inode, isize); - break; - } if (likely(nr_segs == 1)) copied = filemap_copy_from_user(page, offset, buf, bytes); @@ -1993,40 +1980,53 @@ generic_file_buffered_write(struct kiocb cur_iov, iov_offset, bytes); flush_dcache_page(page); status = a_ops-commit_write(file, page, offset, offset+bytes); - if (status == AOP_TRUNCATED_PAGE) { - page_cache_release(page); - continue; + if (unlikely(status 0 || status == AOP_TRUNCATED_PAGE)) + goto fs_write_aop_error; + if (unlikely(copied != bytes)) { + status = -EFAULT; + goto fs_write_aop_error; } - if (likely(copied 0)) { - if (!status) - status = copied; + if (unlikely(status 0)) /* filesystem did partial write */ + copied = status; - if (status = 0) { - written += status; - count -= status; - pos += status; - buf += status; - if (unlikely(nr_segs 1)) { - filemap_set_next_iovec(cur_iov, - iov_offset, status); - if (count) - buf = cur_iov-iov_base + - iov_offset; - } else { - iov_offset += status; - } + if (likely(copied 0)) { + written += copied; + count -= copied; + pos += copied; + buf += copied; + if (unlikely(nr_segs 1)) { + filemap_set_next_iovec(cur_iov, + iov_offset, copied); + if (count) + buf = cur_iov-iov_base + iov_offset; + } else { + iov_offset += copied; } } - if (unlikely(copied != bytes)) - if (status = 0) - status = -EFAULT; unlock_page(page); mark_page_accessed(page); page_cache_release(page); - if (status 0) - break; balance_dirty_pages_ratelimited(mapping); cond_resched(); + continue; + +fs_write_aop_error: + if (status != AOP_TRUNCATED_PAGE) + unlock_page(page); + page_cache_release(page); + + /* +* prepare_write() may have instantiated a few blocks +* outside i_size. Trim these off again. Don't need +* i_size_read because we hold i_mutex. +*/ +
[patch 05/41] mm: debug write deadlocks
Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the difficult race where the page may be unmapped before calling copy_from_user. Makes the race much easier to hit. This is useful for demonstration and testing purposes, but is removed in a subsequent patch. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c |2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1952,6 +1952,7 @@ generic_file_buffered_write(struct kiocb if (maxlen bytes) maxlen = bytes; +#ifndef CONFIG_DEBUG_VM /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the @@ -1959,6 +1960,7 @@ generic_file_buffered_write(struct kiocb * up-to-date. */ fault_in_pages_readable(buf, maxlen); +#endif page = __grab_cache_page(mapping,index,cached_page,lru_pvec); if (!page) { -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 24/41] hfsplus convert to new aops.
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/hfsplus/extents.c | 21 + fs/hfsplus/inode.c | 20 2 files changed, 21 insertions(+), 20 deletions(-) Index: linux-2.6/fs/hfsplus/inode.c === --- linux-2.6.orig/fs/hfsplus/inode.c +++ linux-2.6/fs/hfsplus/inode.c @@ -27,10 +27,14 @@ static int hfsplus_writepage(struct page return block_write_full_page(page, hfsplus_get_block, wbc); } -static int hfsplus_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) -{ - return cont_prepare_write(page, from, to, hfsplus_get_block, - HFSPLUS_I(page-mapping-host).phys_size); +static int hfsplus_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + hfsplus_get_block, + HFSPLUS_I(mapping-host).phys_size); } static sector_t hfsplus_bmap(struct address_space *mapping, sector_t block) @@ -114,8 +118,8 @@ const struct address_space_operations hf .readpage = hfsplus_readpage, .writepage = hfsplus_writepage, .sync_page = block_sync_page, - .prepare_write = hfsplus_prepare_write, - .commit_write = generic_commit_write, + .write_begin= hfsplus_write_begin, + .write_end = generic_write_end, .bmap = hfsplus_bmap, .releasepage= hfsplus_releasepage, }; @@ -124,8 +128,8 @@ const struct address_space_operations hf .readpage = hfsplus_readpage, .writepage = hfsplus_writepage, .sync_page = block_sync_page, - .prepare_write = hfsplus_prepare_write, - .commit_write = generic_commit_write, + .write_begin= hfsplus_write_begin, + .write_end = generic_write_end, .bmap = hfsplus_bmap, .direct_IO = hfsplus_direct_IO, .writepages = hfsplus_writepages, Index: linux-2.6/fs/hfsplus/extents.c === --- linux-2.6.orig/fs/hfsplus/extents.c +++ linux-2.6/fs/hfsplus/extents.c @@ -443,21 +443,18 @@ void hfsplus_file_truncate(struct inode if (inode-i_size HFSPLUS_I(inode).phys_size) { struct address_space *mapping = inode-i_mapping; struct page *page; - u32 size = inode-i_size - 1; + void *fsdata; + u32 size = inode-i_size; int res; - page = grab_cache_page(mapping, size PAGE_CACHE_SHIFT); - if (!page) - return; - size = PAGE_CACHE_SIZE - 1; - size++; - res = mapping-a_ops-prepare_write(NULL, page, size, size); - if (!res) - res = mapping-a_ops-commit_write(NULL, page, size, size); + res = pagecache_write_begin(NULL, mapping, size, 0, + AOP_FLAG_UNINTERRUPTIBLE, + page, fsdata); if (res) - inode-i_size = HFSPLUS_I(inode).phys_size; - unlock_page(page); - page_cache_release(page); + return; + res = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata); + if (res 0) + return; mark_inode_dirty(inode); return; } else if (inode-i_size == HFSPLUS_I(inode).phys_size) -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 03/41] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83
From: Andrew Morton [EMAIL PROTECTED] This patch fixed the following bug: When prefaulting in the pages in generic_file_buffered_write(), we only faulted in the pages for the firts segment of the iovec. If the second of successive segment described a mmapping of the page into which we're write()ing, and that page is not up-to-date, the fault handler tries to lock the already-locked page (to bring it up to date) and deadlocks. An exploit for this bug is in writev-deadlock-demo.c, in http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz. (These demos assume blocksize PAGE_CACHE_SIZE). The problem with this fix is that it takes the kernel back to doing a single prepare_write()/commit_write() per iovec segment. So in the worst case we'll run prepare_write+commit_write 1024 times where we previously would have run it once. The other problem with the fix is that it fix all the locking problems. insert numbers obtained via ext3-tools's writev-speed.c here And apparently this change killed NFS overwrite performance, because, I suppose, it talks to the server for each prepare_write+commit_write. So just back that patch out - we'll be fixing the deadlock by other means. Cc: Linux Memory Management [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Andrew Morton [EMAIL PROTECTED] Nick says: also it only ever actually papered over the bug, because after faulting in the pages, they might be unmapped or reclaimed. Signed-off-by: Nick Piggin [EMAIL PROTECTED] mm/filemap.c | 18 +++--- 1 file changed, 7 insertions(+), 11 deletions(-) Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1939,21 +1939,14 @@ generic_file_buffered_write(struct kiocb do { unsigned long index; unsigned long offset; + unsigned long maxlen; size_t copied; offset = (pos (PAGE_CACHE_SIZE -1)); /* Within page */ index = pos PAGE_CACHE_SHIFT; bytes = PAGE_CACHE_SIZE - offset; - - /* Limit the size of the copy to the caller's write size */ - bytes = min(bytes, count); - - /* -* Limit the size of the copy to that of the current segment, -* because fault_in_pages_readable() doesn't know how to walk -* segments. -*/ - bytes = min(bytes, cur_iov-iov_len - iov_base); + if (bytes count) + bytes = count; /* * Bring in the user page that we will copy from _first_. @@ -1961,7 +1954,10 @@ generic_file_buffered_write(struct kiocb * same page as we're writing to, without it being marked * up-to-date. */ - fault_in_pages_readable(buf, bytes); + maxlen = cur_iov-iov_len - iov_base; + if (maxlen bytes) + maxlen = bytes; + fault_in_pages_readable(buf, maxlen); page = __grab_cache_page(mapping,index,cached_page,lru_pvec); if (!page) { -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 23/41] hfs convert to new aops.
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/hfs/extent.c | 19 --- fs/hfs/inode.c | 20 2 files changed, 20 insertions(+), 19 deletions(-) Index: linux-2.6/fs/hfs/inode.c === --- linux-2.6.orig/fs/hfs/inode.c +++ linux-2.6/fs/hfs/inode.c @@ -35,10 +35,14 @@ static int hfs_readpage(struct file *fil return block_read_full_page(page, hfs_get_block); } -static int hfs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) -{ - return cont_prepare_write(page, from, to, hfs_get_block, - HFS_I(page-mapping-host)-phys_size); +static int hfs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + hfs_get_block, + HFS_I(mapping-host)-phys_size); } static sector_t hfs_bmap(struct address_space *mapping, sector_t block) @@ -119,8 +123,8 @@ const struct address_space_operations hf .readpage = hfs_readpage, .writepage = hfs_writepage, .sync_page = block_sync_page, - .prepare_write = hfs_prepare_write, - .commit_write = generic_commit_write, + .write_begin= hfs_write_begin, + .write_end = generic_write_end, .bmap = hfs_bmap, .releasepage= hfs_releasepage, }; @@ -129,8 +133,8 @@ const struct address_space_operations hf .readpage = hfs_readpage, .writepage = hfs_writepage, .sync_page = block_sync_page, - .prepare_write = hfs_prepare_write, - .commit_write = generic_commit_write, + .write_begin= hfs_write_begin, + .write_end = generic_write_end, .bmap = hfs_bmap, .direct_IO = hfs_direct_IO, .writepages = hfs_writepages, Index: linux-2.6/fs/hfs/extent.c === --- linux-2.6.orig/fs/hfs/extent.c +++ linux-2.6/fs/hfs/extent.c @@ -464,23 +464,20 @@ void hfs_file_truncate(struct inode *ino (long long)HFS_I(inode)-phys_size, inode-i_size); if (inode-i_size HFS_I(inode)-phys_size) { struct address_space *mapping = inode-i_mapping; + void *fsdata; struct page *page; int res; + /* XXX: Can use generic_cont_expand? */ size = inode-i_size - 1; - page = grab_cache_page(mapping, size PAGE_CACHE_SHIFT); - if (!page) - return; - size = PAGE_CACHE_SIZE - 1; - size++; - res = mapping-a_ops-prepare_write(NULL, page, size, size); - if (!res) - res = mapping-a_ops-commit_write(NULL, page, size, size); + res = pagecache_write_begin(NULL, mapping, size+1, 0, + AOP_FLAG_UNINTERRUPTIBLE, page, fsdata); + if (!res) { + res = pagecache_write_end(NULL, mapping, size+1, 0, 0, + page, fsdata); + } if (res) inode-i_size = HFS_I(inode)-phys_size; - unlock_page(page); - page_cache_release(page); - mark_inode_dirty(inode); return; } else if (inode-i_size == HFS_I(inode)-phys_size) return; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 35/41] hostfs convert to new aops.
This also gets rid of a lot of useless read_file stuff. And also optimises the full page write case by marking a !uptodate page uptodate. Cc: Jeff Dike [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/hostfs/hostfs_kern.c | 70 +++- 1 file changed, 28 insertions(+), 42 deletions(-) Index: linux-2.6/fs/hostfs/hostfs_kern.c === --- linux-2.6.orig/fs/hostfs/hostfs_kern.c +++ linux-2.6/fs/hostfs/hostfs_kern.c @@ -466,56 +466,42 @@ int hostfs_readpage(struct file *file, s return err; } -int hostfs_prepare_write(struct file *file, struct page *page, -unsigned int from, unsigned int to) +int hostfs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - char *buffer; - long long start, tmp; - int err; + pgoff_t index = pos PAGE_CACHE_SHIFT; - start = (long long) page-index PAGE_CACHE_SHIFT; - buffer = kmap(page); - if(from != 0){ - tmp = start; - err = read_file(FILE_HOSTFS_I(file)-fd, tmp, buffer, - from); - if(err 0) goto out; - } - if(to != PAGE_CACHE_SIZE){ - start += to; - err = read_file(FILE_HOSTFS_I(file)-fd, start, buffer + to, - PAGE_CACHE_SIZE - to); - if(err 0) goto out; - } - err = 0; - out: - kunmap(page); - return err; + *pagep = __grab_cache_page(mapping, index); + if (!*pagep) + return -ENOMEM; + return 0; } -int hostfs_commit_write(struct file *file, struct page *page, unsigned from, -unsigned to) +int hostfs_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { - struct address_space *mapping = page-mapping; struct inode *inode = mapping-host; - char *buffer; - long long start; - int err = 0; + void *buffer; + unsigned from = pos (PAGE_CACHE_SIZE - 1); + int err; - start = (((long long) page-index) PAGE_CACHE_SHIFT) + from; buffer = kmap(page); - err = write_file(FILE_HOSTFS_I(file)-fd, start, buffer + from, -to - from); - if(err 0) err = 0; - - /* Actually, if !err, write_file has added to-from to start, so, despite -* the appearance, we are comparing i_size against the _last_ written -* location, as we should. */ + err = write_file(FILE_HOSTFS_I(file)-fd, pos, buffer + from, copied); + kunmap(page); - if(!err (start inode-i_size)) - inode-i_size = start; + if (!PageUptodate(page) err == PAGE_CACHE_SIZE) + SetPageUptodate(page); + unlock_page(page); + page_cache_release(page); + + /* If err 0, write_file has added err to pos, so we are comparing +* i_size against the last byte written. +*/ + if (err 0 (pos inode-i_size)) + inode-i_size = pos; - kunmap(page); return err; } @@ -523,8 +509,8 @@ static const struct address_space_operat .writepage = hostfs_writepage, .readpage = hostfs_readpage, .set_page_dirty = __set_page_dirty_nobuffers, - .prepare_write = hostfs_prepare_write, - .commit_write = hostfs_commit_write + .write_begin= hostfs_write_begin, + .write_end = hostfs_write_end, }; static int init_inode(struct inode *inode, struct dentry *dentry) -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 22/41] adfs convert to new aops.
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/adfs/inode.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) Index: linux-2.6/fs/adfs/inode.c === --- linux-2.6.orig/fs/adfs/inode.c +++ linux-2.6/fs/adfs/inode.c @@ -61,10 +61,14 @@ static int adfs_readpage(struct file *fi return block_read_full_page(page, adfs_get_block); } -static int adfs_prepare_write(struct file *file, struct page *page, unsigned int from, unsigned int to) +static int adfs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return cont_prepare_write(page, from, to, adfs_get_block, - ADFS_I(page-mapping-host)-mmu_private); + *pagep = NULL; + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + adfs_get_block, + ADFS_I(mapping-host)-mmu_private); } static sector_t _adfs_bmap(struct address_space *mapping, sector_t block) @@ -76,8 +80,8 @@ static const struct address_space_operat .readpage = adfs_readpage, .writepage = adfs_writepage, .sync_page = block_sync_page, - .prepare_write = adfs_prepare_write, - .commit_write = generic_commit_write, + .write_begin= adfs_write_begin, + .write_end = generic_write_end, .bmap = _adfs_bmap }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 39/41] sysv convert to new aops.
Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/sysv/dir.c | 51 +++ fs/sysv/itree.c | 23 +++ 2 files changed, 50 insertions(+), 24 deletions(-) Index: linux-2.6/fs/sysv/itree.c === --- linux-2.6.orig/fs/sysv/itree.c +++ linux-2.6/fs/sysv/itree.c @@ -453,23 +453,38 @@ static int sysv_writepage(struct page *p { return block_write_full_page(page,get_block,wbc); } + static int sysv_readpage(struct file *file, struct page *page) { return block_read_full_page(page,get_block); } -static int sysv_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) + +int __sysv_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - return block_prepare_write(page,from,to,get_block); + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + get_block); } + +static int sysv_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + *pagep = NULL; + return __sysv_write_begin(file, mapping, pos, len, flags, pagep, fsdata); +} + static sector_t sysv_bmap(struct address_space *mapping, sector_t block) { return generic_block_bmap(mapping,block,get_block); } + const struct address_space_operations sysv_aops = { .readpage = sysv_readpage, .writepage = sysv_writepage, .sync_page = block_sync_page, - .prepare_write = sysv_prepare_write, - .commit_write = generic_commit_write, + .write_begin = sysv_write_begin, + .write_end = generic_write_end, .bmap = sysv_bmap }; Index: linux-2.6/fs/sysv/dir.c === --- linux-2.6.orig/fs/sysv/dir.c +++ linux-2.6/fs/sysv/dir.c @@ -16,6 +16,7 @@ #include linux/pagemap.h #include linux/highmem.h #include linux/smp_lock.h +#include linux/swap.h #include sysv.h static int sysv_readdir(struct file *, void *, filldir_t); @@ -37,16 +38,22 @@ static inline unsigned long dir_pages(st return (inode-i_size+PAGE_CACHE_SIZE-1)PAGE_CACHE_SHIFT; } -static int dir_commit_chunk(struct page *page, unsigned from, unsigned to) +static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len) { - struct inode *dir = (struct inode *)page-mapping-host; + struct address_space *mapping = page-mapping; + struct inode *dir = mapping-host; int err = 0; - page-mapping-a_ops-commit_write(NULL, page, from, to); + block_write_end(NULL, mapping, pos, len, len, page, NULL); + if (pos+len dir-i_size) { + i_size_write(dir, pos+len); + mark_inode_dirty(dir); + } if (IS_DIRSYNC(dir)) err = write_one_page(page, 1); else unlock_page(page); + mark_page_accessed(page); return err; } @@ -186,7 +193,7 @@ int sysv_add_link(struct dentry *dentry, unsigned long npages = dir_pages(dir); unsigned long n; char *kaddr; - unsigned from, to; + loff_t pos; int err; /* We take care of directory expansion in the same loop */ @@ -212,16 +219,17 @@ int sysv_add_link(struct dentry *dentry, return -EINVAL; got_it: - from = (char*)de - (char*)page_address(page); - to = from + SYSV_DIRSIZE; + pos = (page-index PAGE_CACHE_SHIFT) + + (char*)de - (char*)page_address(page); lock_page(page); - err = page-mapping-a_ops-prepare_write(NULL, page, from, to); + err = __sysv_write_begin(NULL, page-mapping, pos, SYSV_DIRSIZE, + AOP_FLAG_UNINTERRUPTIBLE, page, NULL); if (err) goto out_unlock; memcpy (de-name, name, namelen); memset (de-name + namelen, 0, SYSV_DIRSIZE - namelen - 2); de-inode = cpu_to_fs16(SYSV_SB(inode-i_sb), inode-i_ino); - err = dir_commit_chunk(page, from, to); + err = dir_commit_chunk(page, pos, SYSV_DIRSIZE); dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(dir); out_page: @@ -238,15 +246,15 @@ int sysv_delete_entry(struct sysv_dir_en struct address_space *mapping = page-mapping; struct inode *inode = (struct inode*)mapping-host; char *kaddr = (char*)page_address(page); - unsigned from = (char*)de - kaddr; - unsigned to = from + SYSV_DIRSIZE; + loff_t pos = (page-index PAGE_CACHE_SHIFT) + (char *)de - kaddr; int err; lock_page(page); - err =
[patch 36/41] jffs2 convert to new aops.
Cc: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Cc: Linux Filesystems linux-fsdevel@vger.kernel.org Signed-off-by: Nick Piggin [EMAIL PROTECTED] fs/jffs2/file.c | 105 +++- 1 file changed, 66 insertions(+), 39 deletions(-) Index: linux-2.6/fs/jffs2/file.c === --- linux-2.6.orig/fs/jffs2/file.c +++ linux-2.6/fs/jffs2/file.c @@ -19,10 +19,12 @@ #include linux/jffs2.h #include nodelist.h -static int jffs2_commit_write (struct file *filp, struct page *pg, - unsigned start, unsigned end); -static int jffs2_prepare_write (struct file *filp, struct page *pg, - unsigned start, unsigned end); +static int jffs2_write_end(struct file *filp, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *pg, void *fsdata); +static int jffs2_write_begin(struct file *filp, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); static int jffs2_readpage (struct file *filp, struct page *pg); int jffs2_fsync(struct file *filp, struct dentry *dentry, int datasync) @@ -65,8 +67,8 @@ const struct inode_operations jffs2_file const struct address_space_operations jffs2_file_address_operations = { .readpage = jffs2_readpage, - .prepare_write =jffs2_prepare_write, - .commit_write = jffs2_commit_write + .write_begin = jffs2_write_begin, + .write_end =jffs2_write_end, }; static int jffs2_do_readpage_nolock (struct inode *inode, struct page *pg) @@ -119,15 +121,23 @@ static int jffs2_readpage (struct file * return ret; } -static int jffs2_prepare_write (struct file *filp, struct page *pg, - unsigned start, unsigned end) +static int jffs2_write_begin(struct file *filp, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - struct inode *inode = pg-mapping-host; + struct page *pg; + struct inode *inode = mapping-host; struct jffs2_inode_info *f = JFFS2_INODE_INFO(inode); - uint32_t pageofs = pg-index PAGE_CACHE_SHIFT; + pgoff_t index = pos PAGE_CACHE_SHIFT; + uint32_t pageofs = pos (PAGE_CACHE_SIZE - 1); int ret = 0; - D1(printk(KERN_DEBUG jffs2_prepare_write()\n)); + pg = __grab_cache_page(mapping, index); + if (!pg) + return -ENOMEM; + *pagep = pg; + + D1(printk(KERN_DEBUG jffs2_write_begin()\n)); if (pageofs inode-i_size) { /* Make new hole frag from old EOF to new page */ @@ -142,7 +152,7 @@ static int jffs2_prepare_write (struct f ret = jffs2_reserve_space(c, sizeof(ri), alloc_len, ALLOC_NORMAL, JFFS2_SUMMARY_INODE_SIZE); if (ret) - return ret; + goto out_page; down(f-sem); memset(ri, 0, sizeof(ri)); @@ -172,7 +182,7 @@ static int jffs2_prepare_write (struct f ret = PTR_ERR(fn); jffs2_complete_reservation(c); up(f-sem); - return ret; + goto out_page; } ret = jffs2_add_full_dnode_to_inode(c, f, fn); if (f-metadata) { @@ -181,65 +191,79 @@ static int jffs2_prepare_write (struct f f-metadata = NULL; } if (ret) { - D1(printk(KERN_DEBUG Eep. add_full_dnode_to_inode() failed in prepare_write, returned %d\n, ret)); + D1(printk(KERN_DEBUG Eep. add_full_dnode_to_inode() failed in write_begin, returned %d\n, ret)); jffs2_mark_node_obsolete(c, fn-raw); jffs2_free_full_dnode(fn); jffs2_complete_reservation(c); up(f-sem); - return ret; + goto out_page; } jffs2_complete_reservation(c); inode-i_size = pageofs; up(f-sem); } - /* Read in the page if it wasn't already present, unless it's a whole page */ - if (!PageUptodate(pg) (start || end PAGE_CACHE_SIZE)) { + /* +* Read in the page if it wasn't already present. Cannot optimize away +* the whole page write case until jffs2_write_end can handle the +* case of a short-copy. +*/ + if (!PageUptodate(pg)) { down(f-sem); ret = jffs2_do_readpage_nolock(inode, pg); up(f-sem); + if (ret) +
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
2007/5/25, Neil Brown [EMAIL PROTECTED]: HOW DO MD or DM USE THIS 1/ striping devices. This includes md/raid0 md/linear dm-linear dm-stripe and probably others. These devices can easily support blkdev_issue_flush by simply calling blkdev_issue_flush on all component devices. This ensures that all of the previous requests have been processed but does this guarantee they where successful? This might be too paranoid but if I understood the concept correctly the success of a barrier request should indicate success of all previous request between this barrier and the last one. These devices would find it very hard to support BIO_RW_BARRIER. Doing this would require keeping track of all in-flight requests (which some, possibly all, of the above don't) and then: When a BIO_RW_BARRIER request arrives: wait for all pending writes to complete call blkdev_issue_flush on all devices issue the barrier write to the target device(s) as BIO_RW_BARRIER, if that is -EOPNOTSUP, re-issue, wait, flush. I guess just keep a count of submitted requests and errors since the last barrier might be enough. As long as all of the underlying device support at least support a flush the dm device could pretend to support BIO_RW_BARRIER. dm-linear and dm-stripe simply pass the BIO_RW_BARRIER flag down, which means data may not be flushed correctly: the commit block might be written to one device before a preceding block is written to another device. Hm, even worse: if the barrier requests accidentally end up on a device that does support barriers and another one on the map doesn't. Would any layer/fs above care to issue a flush call? I think the best approach for this class of devices is to return -EOPNOSUP. If the filesystem does the wait (which they all do already) and the blkdev_issue_flush (which is easy to add), they don't need to support BIO_RW_BARRIER. Without any additional code these really should report -EOPNOTSUPP. If disaster strikes there is no way to make assumptions on the real state on disk. 2/ Mirror devices. This includes md/raid1 and dm-raid1. These device can trivially implement blkdev_issue_flush much like the striping devices, and can support BIO_RW_BARRIER to some extent. md/raid1 currently tries. I'm not sure about dm-raid1. I fear this is more broken as with linear and stripe. There is no code to check the features of underlying devices and the request itself isn't sent forward but privately built ones (which do not have the barrier flag)... 3/ Multipath devices Requests are sent to the same device but one different paths. So at least with them the chance of one path supporting barriers but not another one seems little (as long as the paths do not use completely different transport layers). But passing on a request with the barrier flag also doesn't seem to be a good idea since previous requests can arrive at the device later. IMHO the best way to handle barriers for dm would be to add the sequence described to the generic mapping layer of dm (before calling the targets mapping function). There is already some sort of counting in-flight requests (suspend/resume needs that) and I guess the downgrade could also be rather simple. If a flush call to the target (mapped device) fails report -EOPNOTSUPP and stay that way (until next boot). So: some questions to help encourage response: - Is the approach to barriers taken by md appropriate? Should dm do the same? Who will do that? If my assumption about barrier semantics is true, then also md has to somehow make sure all previous requests have _successfully_ completed. In the mirror case I guess it is valid to report success if the mirror itself is in a clean state. Which is all previous requests (and the barrier) where successful on at least one mirror half and this state can be recovered. Question to dm-devel: What do people there think of the possible generic implementation in dm.c? - The comment above blkdev_issue_flush says Caller must run wait_for_completion() on its own. What does that mean? Guess this means it initiates a flush but doesn't wait for completion. So the caller must wait for the completion of the separate requests on its own, doesn't it? - Are there other bit that we could handle better? BIO_RW_FAILFAST? BIO_RW_SYNC? What exactly do they mean? BIO_RW_FAILFAST: means low-level driver shouldn't do much (or no) error recovery. Mainly used by mutlipath targets to avoid long SCSI recovery. This should just be propagated when passing requests on. BIO_RW_SYNC: means this is a bio of a synchronous request. I don't know whether there are more uses to it but this at least causes queues to be flushed immediately instead of waiting for more requests for a short time. Should also just be passed on. Otherwise performance gets poor since something above will
Re: [patch 27/41] qnx4 convert to new aops.
On 2007-05-25 14:22:11, [EMAIL PROTECTED] wrote: Signed-off-by: Nick Piggin [EMAIL PROTECTED] Acked-by: Anders Larsen [EMAIL PROTECTED] (although we might just as well do away with the 'write' methods completely, since write-support is BROKEN anyway) Cheers Anders fs/qnx4/inode.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) Index: linux-2.6/fs/qnx4/inode.c === --- linux-2.6.orig/fs/qnx4/inode.c +++ linux-2.6/fs/qnx4/inode.c @@ -433,16 +433,21 @@ static int qnx4_writepage(struct page *p { return block_write_full_page(page,qnx4_get_block, wbc); } + static int qnx4_readpage(struct file *file, struct page *page) { return block_read_full_page(page,qnx4_get_block); } -static int qnx4_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) -{ - struct qnx4_inode_info *qnx4_inode = qnx4_i(page-mapping-host); - return cont_prepare_write(page, from, to, qnx4_get_block, - qnx4_inode-mmu_private); + +static int qnx4_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) +{ + struct qnx4_inode_info *qnx4_inode = qnx4_i(mapping-host); + *pagep = NULL; + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata, + qnx4_get_block, + qnx4_inode-mmu_private); } static sector_t qnx4_bmap(struct address_space *mapping, sector_t block) { @@ -452,8 +457,8 @@ static const struct address_space_operat .readpage = qnx4_readpage, .writepage = qnx4_writepage, .sync_page = block_sync_page, - .prepare_write = qnx4_prepare_write, - .commit_write = generic_commit_write, + .write_begin= qnx4_write_begin, + .write_end = generic_write_end, .bmap = qnx4_bmap }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Jens Axboe wrote: A barrier write will include a flush, but it may also use the FUA bit to ensure data is on platter. So the only situation where a fallback from a barrier to flush would be valid, is if the device lied and told you it could do FUA but it could not and that is the reason why the barrier write failed. If that is the case, the block layer should stop using FUA and fallback to flush-write-flush. And if it does that, then there's never a valid reason to switch from using barrier writes to blkdev_issue_flush() since both methods would either both work or both fail. IIRC, the FUA bit only forces THAT request to hit the platter before it is completed; it does not flush any previous requests still sitting in the write back queue. Because all io before the barrier must be on the platter as well, setting the FUA bit on the barrier request means you don't have to follow it with a flush, but you still have to precede it with a flush. It's not block layer breakage, it's a device issue. How isn't it block layer breakage? If the device does not support barriers, isn't it the job of the block layer ( probably the scheduler ) to fall back to flush-write-flush? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Neil Brown wrote: There is no guarantee that a device can support BIO_RW_BARRIER - it is always possible that a request will fail with EOPNOTSUPP. Why is it not the job of the block layer to translate for broken devices and send them a flush/write/flush? These devices would find it very hard to support BIO_RW_BARRIER. Doing this would require keeping track of all in-flight requests (which some, possibly all, of the above don't) and then: The device mapper keeps track of in flight requests already. When switching tables it has to hold new requests and wait for in flight requests to complete before switching to the new table. When it gets a barrier request it just needs to do the same thing, only not switch tables. I think the best approach for this class of devices is to return -EOPNOSUP. If the filesystem does the wait (which they all do already) and the blkdev_issue_flush (which is easy to add), they don't need to support BIO_RW_BARRIER. Why? The personalities should just pass the BARRIER flag down to each underlying device, and the dm common code should wait for all in flight io to complete before sending the barrier to the personality. For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to the controller can be tagged as barriers), SCSI will use the SYNCHRONIZE_CACHE command to flush the cache after the barrier request (a bit like the filesystem calling blkdev_issue_flush, but at Don't you have to flush the cache BEFORE the barrier to ensure that previous IO is committed first, THEN the barrier write? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/2] i_version update - vfs part
Concerning the first part of the set, the i_version field of the inode structure has been reused. The field has been redefined as the counter has to be 64-bits. The patch modifies the i_version field of the inode on the VFS layer. The i_version field become a 64bit counter that is set on inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). The aim is to fulfill a NFSv4 requirement for rfc3530: 5.5. Mandatory Attributes - Definitions Name# DataType Access Description ___ change 3 uint64 READ A value created by the server that the client can use to determine if file data, directory contents or attributes of the object have been modified. The servermay return the object's time_metadata attribute for this attribute's value but only if the filesystem object can not be updated more frequently than the resolution of time_metadata. Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED] Index: linux-2.6.22-rc2-ext4-1/fs/binfmt_misc.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/binfmt_misc.c 2007-05-25 18:01:51.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/binfmt_misc.c2007-05-25 18:01:56.0 +0200 @@ -508,6 +508,7 @@ inode-i_blocks = 0; inode-i_atime = inode-i_mtime = inode-i_ctime = current_fs_time(inode-i_sb); + inode-i_version = 1; } return inode; } Index: linux-2.6.22-rc2-ext4-1/fs/libfs.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/libfs.c 2007-05-25 18:01:51.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/libfs.c 2007-05-25 18:01:56.0 +0200 @@ -232,6 +232,7 @@ root-i_mode = S_IFDIR | S_IRUSR | S_IWUSR; root-i_uid = root-i_gid = 0; root-i_atime = root-i_mtime = root-i_ctime = CURRENT_TIME; + root-i_version = 1; dentry = d_alloc(NULL, d_name); if (!dentry) { iput(root); @@ -255,6 +256,8 @@ struct inode *inode = old_dentry-d_inode; inode-i_ctime = dir-i_ctime = dir-i_mtime = CURRENT_TIME; + inode-i_version++; + dir-i_version++; inc_nlink(inode); atomic_inc(inode-i_count); dget(dentry); @@ -287,6 +290,8 @@ struct inode *inode = dentry-d_inode; inode-i_ctime = dir-i_ctime = dir-i_mtime = CURRENT_TIME; + inode-i_version++; + dir-i_version++; drop_nlink(inode); dput(dentry); return 0; @@ -323,6 +328,8 @@ old_dir-i_ctime = old_dir-i_mtime = new_dir-i_ctime = new_dir-i_mtime = inode-i_ctime = CURRENT_TIME; + old_dir-i_version++; + new_dir-i_version++; return 0; } @@ -399,6 +406,7 @@ inode-i_uid = inode-i_gid = 0; inode-i_blocks = 0; inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME; + inode-i_version = 1; inode-i_op = simple_dir_inode_operations; inode-i_fop = simple_dir_operations; inode-i_nlink = 2; @@ -427,6 +435,7 @@ inode-i_uid = inode-i_gid = 0; inode-i_blocks = 0; inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME; + inode-i_version = 1; inode-i_fop = files-ops; inode-i_ino = i; d_add(dentry, inode); Index: linux-2.6.22-rc2-ext4-1/fs/pipe.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/pipe.c 2007-05-25 18:01:51.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/pipe.c 2007-05-25 18:01:56.0 +0200 @@ -882,6 +882,7 @@ inode-i_uid = current-fsuid; inode-i_gid = current-fsgid; inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME; + inode-i_version = 1; return inode; Index: linux-2.6.22-rc2-ext4-1/include/linux/fs.h === --- linux-2.6.22-rc2-ext4-1.orig/include/linux/fs.h 2007-05-25 18:01:51.0 +0200 +++ linux-2.6.22-rc2-ext4-1/include/linux/fs.h 2007-05-25 18:01:56.0 +0200 @@ -549,7 +549,7 @@ uid_t i_uid; gid_t i_gid; dev_t i_rdev; - unsigned long i_version; + uint64_ti_version; loff_t i_size; #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount;
[patch 2/2] i_version update - ext4 part
The patch is on top of the ext4 tree: http://repo.or.cz/w/ext4-patch-queue.git In this part, the i_version counter is stored into 2 32bit fields of the ext4_inode structure osd1.linux1.l_i_version and i_version_hi. I included the ext4_expand_inode_extra_isize patch, which does part of the job, checking if there is enough room for extra fields in the inode (i_version_hi). The other patch increments the counter on inode modifications and set it on inode creation. This patch is on top of the nanosecond timestamp and i_version_hi patches. This patch adds 64-bit inode version support to ext4. The lower 32 bits are stored in the osd1.linux1.l_i_version field while the high 32 bits are stored in the i_version_hi field newly created in the ext4_inode. We need to make sure that existing filesystems can also avail the new fields that have been added to the inode. We use s_want_extra_isize and s_min_extra_isize to decide by how much we should expand the inode. If EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature is set then we expand by max(s_want_extra_isize, s_min_extra_isize , sizeof(ext4_inode) - EXT4_GOOD_OLD_INODE_SIZE) bytes. Actually it is still an open question about whether users should be able to set s_*_extra_isize smaller than the known fields or not. This patch also adds the functionality to expand inodes to include the newly added fields. We start by trying to expand by s_want_extra_isize bytes and if its fails we try to expand by s_min_extra_isize bytes. This is done by changing the i_extra_isize if enough space is available in the inode and no EAs are present. If EAs are present and there is enough space in the inode then the EAs in the inode are shifted to make space. If enough space is not available in the inode due to the EAs then 1 or more EAs are shifted to the external EA block. In the worst case when even the external EA block does not have enough space we inform the user that some EA would need to be deleted or s_min_extra_isize would have to be reduced. This would be online expansion of inodes. I am also working on adding an expand_inodes option to e2fsck which will expand all the used inodes. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Kalpak Shah [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] Index: linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/inode.c 2007-05-25 17:12:37.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c 2007-05-25 17:12:41.0 +0200 @@ -2709,6 +2709,13 @@ EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode); EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode); + ei-i_fs_version = le32_to_cpu(raw_inode-i_disk_version); + if (EXT4_INODE_SIZE(inode-i_sb) EXT4_GOOD_OLD_INODE_SIZE) { + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) + ei-i_fs_version |= + (__u64)(le32_to_cpu(raw_inode-i_version_hi)) 32; + } + if (S_ISREG(inode-i_mode)) { inode-i_op = ext4_file_inode_operations; inode-i_fop = ext4_file_operations; @@ -2852,8 +2859,14 @@ } else for (block = 0; block EXT4_N_BLOCKS; block++) raw_inode-i_block[block] = ei-i_data[block]; - if (ei-i_extra_isize) + raw_inode-i_disk_version = cpu_to_le32(ei-i_fs_version); + if (ei-i_extra_isize) { + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) { + raw_inode-i_version_hi = +cpu_to_le32(ei-i_fs_version 32); + } raw_inode-i_extra_isize = cpu_to_le16(ei-i_extra_isize); + } BUFFER_TRACE(bh, call ext4_journal_dirty_metadata); rc = ext4_journal_dirty_metadata(handle, bh); @@ -3127,10 +3140,32 @@ int ext4_mark_inode_dirty(handle_t *handle, struct inode *inode) { struct ext4_iloc iloc; - int err; + int err, ret; + static int expand_message; might_sleep(); err = ext4_reserve_inode_write(handle, inode, iloc); + if (EXT4_I(inode)-i_extra_isize + EXT4_SB(inode-i_sb)-s_want_extra_isize + !(EXT4_I(inode)-i_state EXT4_STATE_NO_EXPAND)) { + /* We need extra buffer credits since we may write into EA block + * with this same handle */ + if ((jbd2_journal_extend(handle, + EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) { + ret = ext4_expand_extra_isize(inode, + EXT4_SB(inode-i_sb)-s_want_extra_isize, + iloc, handle); + if (ret) { +EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND; +if (!expand_message) { + ext4_warning(inode-i_sb, __FUNCTION__, + Unable to expand inode %lu. Delete some + EAs or run e2fsck., inode-i_ino); + expand_message = 1; +} + } + } + } if (!err) err = ext4_mark_iloc_dirty(handle, inode, iloc); return err; Index: linux-2.6.22-rc2-ext4-1/include/linux/ext4_fs.h === --- linux-2.6.22-rc2-ext4-1.orig/include/linux/ext4_fs.h 2007-05-25 17:12:37.0 +0200 +++ linux-2.6.22-rc2-ext4-1/include/linux/ext4_fs.h 2007-05-25 17:12:41.0 +0200 @@ -202,6 +202,7 @@ #define EXT4_STATE_JDATA
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
--- Jeremy Maitin-Shepard [EMAIL PROTECTED] wrote: Casey Schaufler [EMAIL PROTECTED] writes: On Fedora zcat, gzip and gunzip are all links to the same file. I can imagine (although it is a bit of a stretch) allowing a set of users access to gunzip but not gzip (or the other way around). There are probably more sophisticated programs that have different behavior based on the name they're invoked by that would provide a more compelling arguement, assuming of course that you buy into the behavior-based-on-name scheme. What I think I'm suggesting is that AppArmor might be useful in addressing the fact that a file with multiple hard links is necessarily constrained to have the same access control on each of those names. That assumes one believes that such behavior is flawwed, and I'm not going to try to argue that. The question was about an example, and there is one. This doesn't work. The behavior depends on argv[0], which is not necessarily the same as the name of the file. Sorry, but I don't understand your objection. If AppArmor is configured to allow everyone access to /bin/gzip but only some people access to /bin/gunzip and (important detail) the single binary uses argv[0] as documented and (another important detail) there aren't other links named gunzip to the binary (ok, that's lots of if's) you should be fine. I suppose you could make a shell that lies to exec, but the AppArmor code could certainly check for that in exec by enforcing the argv[0] convention. It would be perfectly reasonable for a system that is so dependent on pathnames to require that. Casey Schaufler [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
Casey Schaufler [EMAIL PROTECTED] writes: --- Jeremy Maitin-Shepard [EMAIL PROTECTED] wrote: Casey Schaufler [EMAIL PROTECTED] writes: On Fedora zcat, gzip and gunzip are all links to the same file. I can imagine (although it is a bit of a stretch) allowing a set of users access to gunzip but not gzip (or the other way around). There are probably more sophisticated programs that have different behavior based on the name they're invoked by that would provide a more compelling arguement, assuming of course that you buy into the behavior-based-on-name scheme. What I think I'm suggesting is that AppArmor might be useful in addressing the fact that a file with multiple hard links is necessarily constrained to have the same access control on each of those names. That assumes one believes that such behavior is flawwed, and I'm not going to try to argue that. The question was about an example, and there is one. This doesn't work. The behavior depends on argv[0], which is not necessarily the same as the name of the file. Sorry, but I don't understand your objection. If AppArmor is configured to allow everyone access to /bin/gzip but only some people access to /bin/gunzip and (important detail) the single binary uses argv[0] as documented and (another important detail) there aren't other links named gunzip to the binary (ok, that's lots of if's) you should be fine. I suppose you could make a shell that lies to exec, but the AppArmor code could certainly check for that in exec by enforcing the argv[0] convention. It would be perfectly reasonable for a system that is so dependent on pathnames to require that. Well, my point was exactly that App Armor doesn't (as far as I know) do anything to enforce the argv[0] convention, nor would it in general prevent a confined program from making a symlink or hard link. Even disregarding that, it seems very fragile in general to make an suid program (there would be no point in confining the execution of a non-suid program) perform essentially access control based on argv[0]. -- Jeremy Maitin-Shepard - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
--- Jeremy Maitin-Shepard [EMAIL PROTECTED] wrote: ... Well, my point was exactly that App Armor doesn't (as far as I know) do anything to enforce the argv[0] convention, Sounds like an opportunity for improvement then. nor would it in general prevent a confined program from making a symlink or hard link. Even disregarding that, it seems very fragile in general to make an suid program (there would be no point in confining the execution of a non-suid program) perform essentially access control based on argv[0]. I think that you're being generous calling it fragile, but that's my view, and I've seen much worse. I agree that it would be a Bad Idea, but the fact that I think it's a bad idea is not going to prevent very many people from trying it, and for those that do try it name based access control might seem like just the ticket to complete their nefarious schemes. Remember that security is a subjective thing, and using argv[0] and AppArmor together might make some people feel better. Casey Schaufler [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
On Friday 25 May 2007 19:43, Casey Schaufler wrote: [...] but the AppArmor code could certainly check for that in exec by enforcing the argv[0] convention. It would be perfectly reasonable for a system that is so dependent on pathnames to require that. Hmm ... that's a strange idea. AppArmor cannot assume anything about argv[0], and it would be a really bad idea to change the well-established semantics of argv[0]. There is no actual need for looking at argv[0], though: AppArmor decides based on the actual pathname of the executable... Thanks, Andreas - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
--- Andreas Gruenbacher [EMAIL PROTECTED] wrote: On Friday 25 May 2007 19:43, Casey Schaufler wrote: [...] but the AppArmor code could certainly check for that in exec by enforcing the argv[0] convention. It would be perfectly reasonable for a system that is so dependent on pathnames to require that. Hmm ... that's a strange idea. Yeah, I get that a lot. AppArmor cannot assume anything about argv[0], and it would be a really bad idea to change the well-established semantics of argv[0]. There is no actual need for looking at argv[0], though: AppArmor decides based on the actual pathname of the executable... Right. My point was that if you wanted to use the gzip/gunzip example of a file with two names being treated differently based on the name accessed as an argument for AppArmor you could. If you don't want to, that's ok too. Jeremy raised a reasonable objection, and AppArmor could address it if y'all chose to do so. I seriously doubt that enforcing the argv[0] convention would break much, and I also expect that if it did there's a Consultant's Retirement to be made fixing the security hole it points out. Casey Schaufler [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 00/41] Buffered write deadlock fix and new aops for 2.6.22-rc2-mm1
On Fri, May 25, 2007 at 10:21:44PM +1000, [EMAIL PROTECTED] wrote: Still unfortunately missing the OCFS2 and GFS2 conversions, which allowed us to remove a lot of code -- I won't ask the maintainers to redo them either until the patchset gets somewhere. Nonetheless, I'll give this a go and try to give you an ocfs2 patch sometime next week. It should be much easier this time anyway. It turns out that the write_begin()/write_end() style interface works very well for ocfs2 internally, so I went back and merged most of that work into ocfs2.git for some of my shared writeable mmap work anyway. --Mark -- Mark Fasheh Senior Software Developer, Oracle [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On May 25, 2007 17:58 +1000, Neil Brown wrote: These devices would find it very hard to support BIO_RW_BARRIER. Doing this would require keeping track of all in-flight requests (which some, possibly all, of the above don't) and then: When a BIO_RW_BARRIER request arrives: wait for all pending writes to complete call blkdev_issue_flush on all devices issue the barrier write to the target device(s) as BIO_RW_BARRIER, if that is -EOPNOTSUP, re-issue, wait, flush. We noticed when testing the SLES10 kernel (which has barriers enabled by default) that ext3 write throughput went from about 170MB/s to about 130MB/s (on high-end RAID storage using no-op scheduler). The reason (as far as we could tell) is that the barriers are implemented by flushing and waiting for all previosly submitted IOs to finish, but all that ext3/jbd really care about is that the journal blocks are safely on disk. Since the journal blocks are only a small fraction of the total IO in flight, the barrier + write cache ends up being a lot worse than just doing synchronous IO with the write cache disabled because no new IO can be submitted past the barrier, and since that IO is large and contiguous it might complete much faster than the scattered metadata updates that are also being checkpointed to disk from the previous transactions. With jbd there can be both a running and a committing transaction, and multiple checkpointing transactions, and the use of barriers breaks this important optimization. If ext3 used an external journal this problem would be avoided, but then there isn't really a need for barriers in the first place, since the jbd code already will handle the wait for the commit block itself. We've got a pretty-much complete version of the ext3 journal checksumming patch that avoids the need to do the pre-commit barrier, since the checksum can verify at recovery time whether all of the transaction's blocks made it to disk or not (which is what the commit block is all about in the end). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
Hello. Casey Schaufler wrote: Sorry, but I don't understand your objection. If AppArmor is configured to allow everyone access to /bin/gzip but only some people access to /bin/gunzip and (important detail) the single binary uses argv[0] as documented and (another important detail) there aren't other links named gunzip to the binary (ok, that's lots of if's) you should be fine. The argv[0] defines the default behavior of hard linked or symbolic linked programs, but the behavior can be overridden using commandline options. If you want to allow access to /bin/gzip but deny access to /bin/gunzip , you also need to deny access to /bin/gzip -d /bin/gzip --decompress /bin/gzip --uncompress. It is impossible to do so because options to override the default behavior depends on program's design and you can't know what programs and what options are there in the system. Even if you know all programs and all options in the system, it is a too tough job to find and reject options that override the default behavior in the kernel space. Well, my point was exactly that App Armor doesn't (as far as I know) do anything to enforce the argv[0] convention, Sounds like an opportunity for improvement then. There are (I think) three types of program invocation. (1) Invocation of hard linked programs. /bin/gzip and /bin/gunzip and /bin/zcat are hard links. There is no problem because you can know which pathname was requested using d_namespace_path() with struct linux_binprm-file . (2) Invocation of symbolic linked programs. /sbin/pidof is a symbolic link to /sbin/killall . There is a problem because you can't know which pathname was requested using d_namespace_path() with struct linux_binprm-file because the symbolic links were already derefernced inside open_exec(). To know which pathname was requested, you need to lookup using struct linux_binprm-filename without LOOKUP_FOLLOW and then use d_namespace_path(). Although there is a race condition that the pathname the symbolic link struct linux_binprm-filename points to may change, but it is inevitable because you can't get dentry and vfsmount of both without LOOKUP_FOLLOW flag and with LOOKUP_FOLLOW flag at the same time. (3) Invocation of dynamically created programs with random names. /usr/sbin/logrotate creates files patterned /tmp/logrotate.?? and executes these dynamically created files. To keep execution of these dynamically created files under control, you need to aggregate pathnames of these files. AppArmor can't define profile if the pathname of programs is random, can it? Usually the argv[0] and the struct linux_binprm-filename are the same, but if you want to do something with argv[0], you will need to handle the (2) case to see whether the argv[0] and struct linux_binprm-filename are the same. Thanks. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] AFS: Implement file locking
On Thu, May 24, 2007 at 05:55:54PM +0100, David Howells wrote: +/* + * initialise the lock manager thread if it isn't already running + */ +static int afs_init_lock_manager(void) +{ + if (!afs_lock_manager) { + afs_lock_manager = create_singlethread_workqueue(kafs_lockd); + if (!afs_lock_manager) + return -ENOMEM; + } + return 0; Doesn't this need some locking? +/* + * request a lock on a file on the server + */ +static int afs_do_setlk(struct file *file, struct file_lock *fl) +{ + struct afs_vnode *vnode = AFS_FS_I(file-f_mapping-host); + afs_lock_type_t type; + struct key *key = file-private_data; + int ret; + + _enter({%x:%u},%u, vnode-fid.vid, vnode-fid.vnode, fl-fl_type); + + /* only whole-file locks are supported */ + if (fl-fl_start != 0 || fl-fl_end != OFFSET_MAX) + return -EINVAL; Do you allow upgrades and downgrades? (Just curious.) + /* if we've already got a readlock on the server and no waiting + * writelocks, then we might be able to instantly grant another Is that comment correct? (You don't really test for waiting writelocks, do you?) + * readlock */ + if (type == AFS_LOCK_READ + vnode-flags (1 AFS_VNODE_READLOCKED)) { + _debug(instant readlock); + ASSERTCMP(vnode-flags + ((1 AFS_VNODE_LOCKING) | +(1 AFS_VNODE_WRITELOCKED)), ==, 0); + ASSERT(!list_empty(vnode-granted_locks)); + goto sharing_existing_lock; + } + } --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] AFS: Implement file locking
On May 25, 2007, at 22:23:42, J. Bruce Fields wrote: On Thu, May 24, 2007 at 05:55:54PM +0100, David Howells wrote: + /* only whole-file locks are supported */ + if (fl-fl_start != 0 || fl-fl_end != OFFSET_MAX) + return -EINVAL; Do you allow upgrades and downgrades? (Just curious.) I was actually under the impression that OpenAFS had support for byte- range locking (as well as lock upgrade/downgrade); though IIRC there was some secondary protocol. That's probably why the support is so basic at the moment; David's getting the basics in first and the more complicated stuff can come later. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
On May 24, 2007, at 14:58:41, Casey Schaufler wrote: On Fedora zcat, gzip and gunzip are all links to the same file. I can imagine (although it is a bit of a stretch) allowing a set of users access to gunzip but not gzip (or the other way around). That is a COMPLETE straw-man argument. I can override your check with this absolutely trivial perl code: exec { /usr/bin/gunzip } gzip, -9, some/file/to.gz; Pathname-based checks are pretty fundamentally insecure. If you want to protect a name, then you should tag the name with security attributes (IE: AppArmor). On the other hand, if you actually want to protect the _data_, then tagging the _name_ is flawed; tag the *DATA* instead. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
Casey Schaufler wrote: --- Andreas Gruenbacher [EMAIL PROTECTED] wrote: AppArmor cannot assume anything about argv[0], and it would be a really bad idea to change the well-established semantics of argv[0]. There is no actual need for looking at argv[0], though: AppArmor decides based on the actual pathname of the executable... Right. My point was that if you wanted to use the gzip/gunzip example of a file with two names being treated differently based on the name accessed as an argument for AppArmor you could. AppArmor detects the pathname of the file exec'd at the time the parent exec's it, and not anything inside the child involving argv[0]. As such, AA can detect whether you did exec(gzip) or exec(gunzip) and apply the policy relevant to the program. It could apply different policies to each of them, so whether it has access to /tmp/mumble/barf depends on whether you called it 'gzip' or 'gunzip'. Caveat: it makes no sense to profile either gzip or gunzip in the AppArmor model, so I won't defend what kind of policy you would put on them. Finally, AA doesn't care what the contents of the executable are. We assume that it is a copy of metasploit or something, and confine it to access only the resources that the policy says. If you don't want to, that's ok too. Jeremy raised a reasonable objection, and AppArmor could address it if y'all chose to do so. I seriously doubt that enforcing the argv[0] convention would break much, and I also expect that if it did there's a Consultant's Retirement to be made fixing the security hole it points out. AppArmor does address it, and I hope this explains how we detect which of multiple hard links to a file you used to access the file without mucking about with argv[0]. Crispin -- Crispin Cowan, Ph.D. http://crispincowan.com/~crispin/ Director of Software Engineering http://novell.com Security: It's not linear - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html