Re: it seems Evolution remove the Tabs

2007-05-25 Thread coly
Andrew,

Thanks for your information :-)

Coly


在 2007-05-25五的 16:33 +1000,andrew hendry写道:
 select your whole mail and use pre-format in evolution.
 or change it to pre-format and do insert-text file.
 this should send it with tabs intact, the confusing bit i think is if
 your testing it by sending it to yourself, then you cant see the tabs
 again.
 
 Read it in another mailer, something like sylpheed or mutt to see if
 the tabs are really there.
 
 On 5/25/07, coly [EMAIL PROTECTED] wrote:
  Hi,
 
  I tested again, it seems Evolution removes the Tabs with blanks.
  How to resolve this issue on Evolution ?  I am trying :-)
 
  Coly
 
  在 2007-05-25五的 07:52 +0200,Jan Engelhardt写道:
   On May 25 2007 09:30, WANG Cong wrote:
   
   Yes, I found all TABs gone when I received the mail. When I post next
   version of the patch, I will test to send to me first :-)
   
   Thanks for your information.
   
   Blame Gmail.
   
   I am using gmail too. That's not gmail's fault,
  
   Then it is one of these:
   - gmail's default settings for web input sucks or
  
   - the web browser reformats it
 (not so much - pastebin.ca suffers from something similar, but *not the
 same*; in that it translates all tabs into spaces, but at least it keeps
 the width.) or
  
   - you are using your own client, and directly SMTPing gmail servers,
 in which case unwanted reformatting by broken MTAs can be bypassed.
  
   I think your email client sucks.
   So which email client are you using, coly? I recommend mutt to you. ;)
  
   X-Mailer:  Evolution 2.6.0
  
   Hm, this looks like another of these Thunderbird cases. (Means,
   Thunderbird users also get their patches wrapped and twangled unless
   they set some option that is not on by default.)
  
  
 Jan
 
  -
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/
 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement file locking

2007-05-25 Thread Jiri Slaby
David Howells napsal(a):
 Implement file locking for AFS.
 
 Signed-off-by: David Howells [EMAIL PROTECTED]
 ---
 
  fs/afs/Makefile|1 
  fs/afs/afs.h   |8 +
  fs/afs/afs_fs.h|3 
  fs/afs/callback.c  |3 
  fs/afs/dir.c   |1 
  fs/afs/file.c  |2 
  fs/afs/flock.c |  558 
 
  fs/afs/fsclient.c  |  155 ++
  fs/afs/internal.h  |   30 +++
  fs/afs/main.c  |1 
  fs/afs/misc.c  |1 
  fs/afs/super.c |3 
  fs/afs/vnode.c |  130 +++-
  include/linux/fs.h |4 
  14 files changed, 885 insertions(+), 15 deletions(-)
 
 diff --git a/fs/afs/Makefile b/fs/afs/Makefile
 index 73ce561..a666710 100644
 --- a/fs/afs/Makefile
 +++ b/fs/afs/Makefile
 @@ -8,6 +8,7 @@ kafs-objs := \
   cmservice.o \
   dir.o \
   file.o \
 + flock.o \
   fsclient.o \
   inode.o \
   main.o \
 diff --git a/fs/afs/afs.h b/fs/afs/afs.h
 index 2452579..c548aa3 100644
 --- a/fs/afs/afs.h
 +++ b/fs/afs/afs.h
 @@ -37,6 +37,13 @@ typedef enum {
   AFS_FTYPE_SYMLINK   = 3,
  } afs_file_type_t;
  
 +typedef enum {
 + AFS_LOCK_READ   = 0,/* read lock request */
 + AFS_LOCK_WRITE  = 1,/* write lock request */
 +} afs_lock_type_t;

Why typedef?

regards,
-- 
http://www.fi.muni.cz/~xslaby/Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8  22A0 32CC 55C3 39D4 7A7E
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Neil Brown

This mail is about an issue that has been of concern to me for quite a
while and I think it is (well past) time to air it more widely and try
to come to a resolution.

This issue is how write barriers (the block-device kind, not the
memory-barrier kind) should be handled by the various layers.

The following is my understanding, which could well be wrong in
various specifics.  Corrections and other comments are more than
welcome.



What are barriers?
==
Barriers (as generated by requests with BIO_RW_BARRIER) are intended
to ensure that the data in the barrier request is not visible until
all writes submitted earlier are safe on the media, and that the data
is safe on the media before any subsequently submitted requests
are visible on the device.

This is achieved by tagging request in the elevator (or any other
request queue) so that no re-ordering is performed around a
BIO_RW_BARRIER request, and by sending appropriate commands to the
device so that any write-behind caching is defeated by the barrier
request.

Along side BIO_RW_BARRIER is blkdev_issue_flush which calls
q-issue_flush_fn.   This can be used to achieve similar effects.

There is no guarantee that a device can support BIO_RW_BARRIER - it is
always possible that a request will fail with EOPNOTSUPP.

Conversely, blkdev_issue_flush must be supported on any device that
uses write-behind caching (it if cannot be supported, then
write-behind caching should be turned off, at least by default).

We can think of there being three types of devices:
 
1/ SAFE.  With a SAFE device, there is no write-behind cache, or if
  there is it is non-volatile.  Once a write completes it is 
  completely safe.  Such a device does not require barriers
  or -issue_flush_fn, and can respond to them either by a
  no-op or with -EOPNOTSUPP (the former is preferred).

2/ FLUSHABLE.
  A FLUSHABLE device may have a volatile write-behind cache.
  This cache can be flushed with a call to blkdev_issue_flush.
  It may not support barrier requests.

3/ BARRIER.
  A BARRIER device supports both blkdev_issue_flush and
  BIO_RW_BARRIER.  Either may be used to synchronise any
  write-behind cache to non-volatile storage (media).

Handling of SAFE and FLUSHABLE devices is essentially the same and can
work on a BARRIER device.  The BARRIER device has the option of more
efficient handling.

How does a filesystem use this?
===

A filesystem will often have a concept of a 'commit' block which makes
an assertion about the correctness of other blocks in the filesystem.
In the most gross sense, this could be the writing of the superblock
of an ext2 filesystem, with the dirty bit clear.  This write commits
all other writes to the filesystem that precede it.

More subtle/useful is the commit block in a journal as with ext3 and
others.  This write commits some number of preceding writes in the
journal or elsewhere.

The filesystem will want to ensure that all preceding writes are safe
before writing the barrier block.  There are two ways to achieve this.

1/  Issue all 'preceding writes', wait for them to complete (bi_endio
   called), call blkdev_issue_flush, issue the commit write, wait
   for it to complete, call blkdev_issue_flush a second time.
   (This is needed for FLUSHABLE)

2/ Set the BIO_RW_BARRIER bit in the write request for the commit
block.
   (This is more efficient on BARRIER).

The second, while much easier, can fail.  So a filesystem should be
prepared to deal with that failure by falling back to the first
option.
Thus the general sequence might be:

  a/ issue all preceding writes.
  b/ issue the commit write with BIO_RW_BARRIER
  c/ wait for the commit to complete.
 If it was successful - done.
 If it failed other than with EOPNOTSUPP, abort
 else continue
  d/ wait for all 'preceding writes' to complete
  e/ call blkdev_issue_flush
  f/ issue commit write without BIO_RW_BARRIER
  g/ wait for commit write to complete
   if it failed, abort
  h/ call blkdev_issue
  DONE

steps b and c can be left out if it is known that the device does not
support barriers.  The only way to discover this to try and see if it
fails.

I don't think any filesystem follows all these steps.

 ext3 has the right structure, but it doesn't include steps e and h.
 reiserfs is similar.  It does have a call to blkdev_issue_flush, but 
  that is only on the fsync path, so it isn't really protecting
  general journal commits.
 XFS - I'm less sure.  I think it does 'a' then 'd', then 'b' or 'f'
   depending on a whether it thinks the device handles barriers,
   and finally 'g'.

 I haven't looked at other filesystems.

So for devices that support BIO_RW_BARRIER, and for devices that don't
need any flush, they work OK, but for device that need flushing, but
don't support BIO_RW_BARRIER, none of them work.  This should be easy
to fix.



Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-05-25 Thread Toshiharu Harada

Hi,

2007/5/24, James Morris [EMAIL PROTECTED]:

I can restate my question and ask why you'd want a security policy like:

 Subject 'sysadmin' has:
read access to /etc/shadow
read/write access to /views/sysadmin/etc/shadow

where the objects referenced by the paths are identical and visible to the
subject along both paths, in keeping with your description of policy may
allow access to some locations but not to others ?


If I understand correctly, the original issue was whether to allow passing
vfsmount to the inode_create LSM hook or not. Which is independent from
AA or pathname based MAC, I think.

It is proven that Linux can be used without that change, however it is
also clear that current LSM cause the ambiguities as AA people has
explained. Clearing ambiguities is a obvious gain to Linux and will make
benefits for auditing besides pathname based MAC.

So here's my opinion. If anybody can't explain clear reason (or needs)
to keep these ambiguities unsolved, we should consider to merge
the proposal.

Thanks.

--
Toshiharu Harada
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread David Chinner
On Fri, May 25, 2007 at 05:58:25PM +1000, Neil Brown wrote:
 We can think of there being three types of devices:
  
 1/ SAFE.  With a SAFE device, there is no write-behind cache, or if
   there is it is non-volatile.  Once a write completes it is 
   completely safe.  Such a device does not require barriers
   or -issue_flush_fn, and can respond to them either by a
 no-op or with -EOPNOTSUPP (the former is preferred).
 
 2/ FLUSHABLE.
   A FLUSHABLE device may have a volatile write-behind cache.
   This cache can be flushed with a call to blkdev_issue_flush.
 It may not support barrier requests.

So returns -EOPNOTSUPP to any barrier request?

 3/ BARRIER.
 A BARRIER device supports both blkdev_issue_flush and
   BIO_RW_BARRIER.  Either may be used to synchronise any
 write-behind cache to non-volatile storage (media).
 
 Handling of SAFE and FLUSHABLE devices is essentially the same and can
 work on a BARRIER device.  The BARRIER device has the option of more
 efficient handling.
 
 How does a filesystem use this?
 ===

 
 The filesystem will want to ensure that all preceding writes are safe
 before writing the barrier block.  There are two ways to achieve this.

Three, actually.

 1/  Issue all 'preceding writes', wait for them to complete (bi_endio
called), call blkdev_issue_flush, issue the commit write, wait
for it to complete, call blkdev_issue_flush a second time.
(This is needed for FLUSHABLE)

*nod*

 2/ Set the BIO_RW_BARRIER bit in the write request for the commit
 block.
(This is more efficient on BARRIER).

*nod*

3/ Use a SAFE device.

 The second, while much easier, can fail.

So we do a test I/O to see if the device supports them before
enabling that mode.  But, as we've recently discovered, this is not
sufficient to detect *correctly functioning* barrier support.

 So a filesystem should be
 prepared to deal with that failure by falling back to the first
 option.

I don't buy that argument.

 Thus the general sequence might be:
 
   a/ issue all preceding writes.
   b/ issue the commit write with BIO_RW_BARRIER

At this point, the filesystem has done everything it needs to ensure
that the block layer has been informed of the I/O ordering
requirements. Why should the filesystem now have to detect block
layer breakage, and then use a different block layer API to issue
the same I/O under the same constraints?

   c/ wait for the commit to complete.
  If it was successful - done.
  If it failed other than with EOPNOTSUPP, abort
  else continue
   d/ wait for all 'preceding writes' to complete
   e/ call blkdev_issue_flush
   f/ issue commit write without BIO_RW_BARRIER
   g/ wait for commit write to complete
if it failed, abort
   h/ call blkdev_issue
_flush?

   DONE
 
 steps b and c can be left out if it is known that the device does not
 support barriers.  The only way to discover this to try and see if it
 fails.

That's a very linear, single-threaded way of looking at it... ;)

 I don't think any filesystem follows all these steps.
 
  ext3 has the right structure, but it doesn't include steps e and h.
  reiserfs is similar.  It does have a call to blkdev_issue_flush, but 
   that is only on the fsync path, so it isn't really protecting
   general journal commits.
  XFS - I'm less sure.  I think it does 'a' then 'd', then 'b' or 'f'
depending on a whether it thinks the device handles barriers,
and finally 'g'.

That's right, except for the g (or c) bit - commit writes are
async and nothing waits for them - the io completion wakes anything
waiting on it's completion

(yes, all XFS barrier I/Os are issued async which is why having to
handle an -EOPNOTSUPP error is a real pain. The fix I currently
have is to reissue the I/O from the completion handler with is
ugly, ugly, ugly.)

 So for devices that support BIO_RW_BARRIER, and for devices that don't
 need any flush, they work OK, but for device that need flushing, but
 don't support BIO_RW_BARRIER, none of them work.  This should be easy
 to fix.

Right - XFS as it stands was designed to work on SAFE devices, and
we've modified it to work on BARRIER devices. We don't support
FLUSHABLE devices at all.

But if the filesystem supports BARRIER devices, I don't see any
reason why a filesystem needs to be modified to support FLUSHABLE
devices - the key point being that by the time the filesystem
has issued the commit write it has already waited for all it's
dependent I/O, and so all the block device needs to do is
issue flushes either side of the commit write

 HOW DO MD or DM USE THIS
 
 
 1/ striping devices.
  This includes md/raid0 md/linear dm-linear dm-stripe and probably
  others. 
 
These devices can easily support blkdev_issue_flush by simply
calling blkdev_issue_flush on all component 

Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Jens Axboe
On Fri, May 25 2007, David Chinner wrote:
  The second, while much easier, can fail.
 
 So we do a test I/O to see if the device supports them before
 enabling that mode.  But, as we've recently discovered, this is not
 sufficient to detect *correctly functioning* barrier support.

Right, those are two different things. But paranoia aside, will this
ever be a real life problem? I've always been of the opinion to just
nicely ignore them. We can't easily detect it and tell the user his hw
is crap.

  So a filesystem should be
  prepared to deal with that failure by falling back to the first
  option.
 
 I don't buy that argument.

The problem with Neils reasoning there is that blkdev_issue_flush() may
use the same method as the barrier to ensure data is on platter.

A barrier write will include a flush, but it may also use the FUA bit to
ensure data is on platter. So the only situation where a fallback from a
barrier to flush would be valid, is if the device lied and told you it
could do FUA but it could not and that is the reason why the barrier
write failed. If that is the case, the block layer should stop using FUA
and fallback to flush-write-flush. And if it does that, then there's
never a valid reason to switch from using barrier writes to
blkdev_issue_flush() since both methods would either both work or both
fail.

  Thus the general sequence might be:
  
a/ issue all preceding writes.
b/ issue the commit write with BIO_RW_BARRIER
 
 At this point, the filesystem has done everything it needs to ensure
 that the block layer has been informed of the I/O ordering
 requirements. Why should the filesystem now have to detect block
 layer breakage, and then use a different block layer API to issue
 the same I/O under the same constraints?

It's not block layer breakage, it's a device issue.

  2/ Mirror devices.  This includes md/raid1 and dm-raid1.
 ..
 Hopefully this is unlikely to happen.  What device would work
 correctly with barriers once, and then not the next time?
 The answer is md/raid1.  If you remove a failed device and add a
 new device that doesn't support barriers, md/raid1 will notice and
 stop supporting barriers.
 
 In case you hadn't already guess, I don't like this behaviour at
 all.  It makes async I/O completion of barrier I/O an ugly, messy
 business, and every place you do sync I/O completion you need to put
 special error handling.

That's unfortunately very true. It's an artifact of the sometimes
problematic device capability discovery.

 If this happens to md/raid1, then why can't it simply do a
 blkdev_issue_flush, write, blkdev_issue_flush sequence to the device
 that doesn't support barriers and then the md device *never changes
 behaviour*. Next time the filesystem is mounted, it will turn off
 barriers because they won't be supported

Because if it doesn't support barriers, blkdev_issue_flush() wouldn't
work either. At least that is the case for SATA/IDE, SCSI is somewhat
different (and has somewhat other issues).

   - Should the various filesystems be fixed as suggested above?  Is 
  someone willing to do that?
 
 Alternate viewpoint - should the block layer be fixed so that the
 filesystems only need to use one barrier API that provides static
 behaviour for the life of the mount?

blkdev_issue_flush() isn't part of the barrier API, and using it as a
work-around for a device that has barrier issues is wrong for the
reasons listed above.

The DRAIN_FUA - DRAIN_FLUSH automatic downgrade I mentioned above
should be added, in which case blkdev_issue_flush() would never be
needed (unless you want to do a data-less barrier, and we should
probably add that specific functionality with an empty bio instead of
providing an alternate way of doing that).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[EMAIL PROTECTED]: Re: [patch 00/41] Buffered write deadlock fix and new aops for 2.6.21-mm2]

2007-05-25 Thread Nick Piggin
I actually forgot to cc linux-fsdevel on this one.

Vladimir found a corner case bug in the case of faulting source
address, which is since fixed, but might be interesting to anyone
else following development...

- Forwarded message from Nick Piggin [EMAIL PROTECTED] -

Date: Wed, 16 May 2007 09:14:06 +0200
From: Nick Piggin [EMAIL PROTECTED]
To: Vladimir V. Saveliev [EMAIL PROTECTED]
Cc: Andrew Morton [EMAIL PROTECTED]
Subject: Re: [patch 00/41] Buffered write deadlock fix and new aops for 
2.6.21-mm2
In-Reply-To: [EMAIL PROTECTED]
User-Agent: Mutt/1.5.9i

cc'ed linux-fsdevel again...

On Tue, May 15, 2007 at 10:00:38PM +0400, Vladimir V. Saveliev wrote:
 Hello
 
 On Tuesday 15 May 2007 02:42, Nick Piggin wrote:
  On Mon, May 14, 2007 at 10:28:45PM +0400, Vladimir V. Saveliev wrote:
   Hello
   
   There is a problem with new write.
   
   If you expand an empty file with truncate and then write so that in one 
   write file tail  is overwritten and something is appended to the file - 
   the write loops forever
   writing to page containing file tail. Something wrong happens writing to 
   uptodate last page of a file, I guess.
   
   I can send a simple program if necessary.
  
  Is this reiserfs with your reiserfs patch? 
 
 No, this is common problem. I get it easily on ext3 and ext2.
 
  Yes, please send a simple 
  program and I'll have a look.

Thanks, that was really helpful!

What the program does is to write a non-faulted page into pagecache,
which means we're relying on fault_in_pages_readable to bring it in
for us. Usually it does, however your specific write pattern required
that 2 pages be brought in, and _also_ that the total number of bytes
to write went past that 2nd page.

This caused the 1st and an Nth (2) page to be faulted in, but the
atomic copy_from_user really needed the 2nd page :)

This fixes it for me.
---

Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h   2007-05-16 16:58:41.0 +1000
+++ linux-2.6/include/linux/fs.h2007-05-16 16:58:47.0 +1000
@@ -419,7 +419,7 @@
 size_t iov_iter_copy_from_user(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(struct iov_iter *i);
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(struct iov_iter *i);
 
 static inline void iov_iter_init(struct iov_iter *i,
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-05-16 14:11:18.0 +1000
+++ linux-2.6/mm/filemap.c  2007-05-16 17:03:29.0 +1000
@@ -1806,11 +1806,10 @@
 }
 EXPORT_SYMBOL(iov_iter_advance);
 
-int iov_iter_fault_in_readable(struct iov_iter *i)
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 {
-   size_t seglen = min(i-iov-iov_len - i-iov_offset, i-count);
char __user *buf = i-iov-iov_base + i-iov_offset;
-   return fault_in_pages_readable(buf, seglen);
+   return fault_in_pages_readable(buf, bytes);
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
@@ -2102,7 +2101,7 @@
 * to check that the address is actually valid, when atomic
 * usercopies are used, below.
 */
-   if (unlikely(iov_iter_fault_in_readable(i))) {
+   if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
status = -EFAULT;
break;
}
@@ -2276,7 +2275,7 @@
 * to check that the address is actually valid, when atomic
 * usercopies are used, below.
 */
-   if (unlikely(iov_iter_fault_in_readable(i))) {
+   if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
status = -EFAULT;
break;
}

- End forwarded message -
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 00/41] Buffered write deadlock fix and new aops for 2.6.22-rc2-mm1

2007-05-25 Thread npiggin
Hi,

This is a resync of the new aops patches to 2.6.22-rc2-mm1

Only one more conversion broken this time, so we're doing OK. AFFS
compile is broken due to cont_prepare_write disappearing, and me not
bringing the conversion patch uptodate (which I won't do again until
something happens with this patchset -- its only affs!). Reiser4
broken because it lost filemap_copy_from_user (it's deadlocky as
well, yay!).

Still unfortunately missing the OCFS2 and GFS2 conversions, which
allowed us to remove a lot of code -- I won't ask the maintainers to
redo them either until the patchset gets somewhere.

Highlight of this release is the reiserfs conversion, and the removal
of the reiserfs-specific generic_cont_expand helper. Also fixed a bug
in my pagecache directory conversions.

Please merge?
-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 01/41] mm: revert KERNEL_DS buffered write optimisation

2007-05-25 Thread npiggin

Revert the patch from Neil Brown to optimise NFSD writev handling.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Cc: Neil Brown [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   32 +---
 1 file changed, 13 insertions(+), 19 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1948,27 +1948,21 @@ generic_file_buffered_write(struct kiocb
/* Limit the size of the copy to the caller's write size */
bytes = min(bytes, count);
 
-   /* We only need to worry about prefaulting when writes are from
-* user-space.  NFSd uses vfs_writev with several non-aligned
-* segments in the vector, and limiting to one segment a time is
-* a noticeable performance for re-write
+   /*
+* Limit the size of the copy to that of the current segment,
+* because fault_in_pages_readable() doesn't know how to walk
+* segments.
 */
-   if (!segment_eq(get_fs(), KERNEL_DS)) {
-   /*
-* Limit the size of the copy to that of the current
-* segment, because fault_in_pages_readable() doesn't
-* know how to walk segments.
-*/
-   bytes = min(bytes, cur_iov-iov_len - iov_base);
+   bytes = min(bytes, cur_iov-iov_len - iov_base);
+
+   /*
+* Bring in the user page that we will copy from _first_.
+* Otherwise there's a nasty deadlock on copying from the
+* same page as we're writing to, without it being marked
+* up-to-date.
+*/
+   fault_in_pages_readable(buf, bytes);
 
-   /*
-* Bring in the user page that we will copy from
-* _first_.  Otherwise there's a nasty deadlock on
-* copying from the same page as we're writing to,
-* without it being marked up-to-date.
-*/
-   fault_in_pages_readable(buf, bytes);
-   }
page = __grab_cache_page(mapping,index,cached_page,lru_pvec);
if (!page) {
status = -ENOMEM;

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 13/41] mm: restore KERNEL_DS optimisations

2007-05-25 Thread npiggin
Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
path.

This may be a pretty questionable gain in most cases, especially after the
legacy 2copy write path is removed, but it doesn't cost much.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2123,7 +2123,7 @@ static ssize_t generic_perform_write_2co
 * cannot take a pagefault with the destination page locked.
 * So pin the source page to copy it.
 */
-   if (!PageUptodate(page)) {
+   if (!PageUptodate(page)  !segment_eq(get_fs(), KERNEL_DS)) {
unlock_page(page);
 
src_page = alloc_page(GFP_KERNEL);
@@ -2248,6 +2248,13 @@ static ssize_t generic_perform_write(str
const struct address_space_operations *a_ops = mapping-a_ops;
long status = 0;
ssize_t written = 0;
+   unsigned int flags = 0;
+
+   /*
+* Copies from kernel address space cannot fail (NFSD is a big user).
+*/
+   if (segment_eq(get_fs(), KERNEL_DS))
+   flags |= AOP_FLAG_UNINTERRUPTIBLE;
 
do {
struct page *page;
@@ -2279,7 +2286,7 @@ again:
break;
}
 
-   status = a_ops-write_begin(file, mapping, pos, bytes, 0,
+   status = a_ops-write_begin(file, mapping, pos, bytes, flags,
page, fsdata);
if (unlikely(status))
break;

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 19/41] xfs convert to new aops.

2007-05-25 Thread npiggin
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/xfs/linux-2.6/xfs_aops.c |   19 ---
 fs/xfs/linux-2.6/xfs_lrw.c  |   35 ---
 2 files changed, 24 insertions(+), 30 deletions(-)

Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
@@ -1479,13 +1479,18 @@ xfs_vm_direct_IO(
 }
 
 STATIC int
-xfs_vm_prepare_write(
+xfs_vm_write_begin(
struct file *file,
-   struct page *page,
-   unsigned intfrom,
-   unsigned intto)
+   struct address_space*mapping,
+   loff_t  pos,
+   unsignedlen,
+   unsignedflags,
+   struct page **pagep,
+   void**fsdata)
 {
-   return block_prepare_write(page, from, to, xfs_get_blocks);
+   *pagep = NULL;
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   xfs_get_blocks);
 }
 
 STATIC sector_t
@@ -1539,8 +1544,8 @@ const struct address_space_operations xf
.sync_page  = block_sync_page,
.releasepage= xfs_vm_releasepage,
.invalidatepage = xfs_vm_invalidatepage,
-   .prepare_write  = xfs_vm_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= xfs_vm_write_begin,
+   .write_end  = generic_write_end,
.bmap   = xfs_vm_bmap,
.direct_IO  = xfs_vm_direct_IO,
.migratepage= buffer_migrate_page,
Index: linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c
===
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_lrw.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c
@@ -134,45 +134,34 @@ xfs_iozero(
loff_t  pos,/* offset in file   */
size_t  count)  /* size of data to zero */
 {
-   unsignedbytes;
struct page *page;
struct address_space*mapping;
int status;
 
mapping = ip-i_mapping;
do {
-   unsigned long index, offset;
+   unsigned offset, bytes;
+   void *fsdata;
 
offset = (pos  (PAGE_CACHE_SIZE -1)); /* Within page */
-   index = pos  PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
if (bytes  count)
bytes = count;
 
-   status = -ENOMEM;
-   page = grab_cache_page(mapping, index);
-   if (!page)
-   break;
-
-   status = mapping-a_ops-prepare_write(NULL, page, offset,
-   offset + bytes);
+   status = pagecache_write_begin(NULL, mapping, pos, bytes,
+   AOP_FLAG_UNINTERRUPTIBLE,
+   page, fsdata);
if (status)
-   goto unlock;
+   break;
 
zero_user_page(page, offset, bytes, KM_USER0);
 
-   status = mapping-a_ops-commit_write(NULL, page, offset,
-   offset + bytes);
-   if (!status) {
-   pos += bytes;
-   count -= bytes;
-   }
-
-unlock:
-   unlock_page(page);
-   page_cache_release(page);
-   if (status)
-   break;
+   status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
+   page, fsdata);
+   WARN_ON(status = 0); /* can't return less than zero! */
+   pos += bytes;
+   count -= bytes;
+   status = 0;
} while (count);
 
return (-status);

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 16/41] ext2 convert to new aops.

2007-05-25 Thread npiggin
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/ext2/dir.c   |   56 ++--
 fs/ext2/ext2.h  |3 +++
 fs/ext2/inode.c |   24 +---
 3 files changed, 54 insertions(+), 29 deletions(-)

Index: linux-2.6/fs/ext2/inode.c
===
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -726,18 +726,21 @@ ext2_readpages(struct file *file, struct
return mpage_readpages(mapping, pages, nr_pages, ext2_get_block);
 }
 
-static int
-ext2_prepare_write(struct file *file, struct page *page,
-   unsigned from, unsigned to)
+int __ext2_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page,from,to,ext2_get_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ext2_get_block);
 }
 
 static int
-ext2_nobh_prepare_write(struct file *file, struct page *page,
-   unsigned from, unsigned to)
+ext2_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return nobh_prepare_write(page,from,to,ext2_get_block);
+   *pagep = NULL;
+   return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
 }
 
 static int ext2_nobh_writepage(struct page *page,
@@ -773,8 +776,8 @@ const struct address_space_operations ex
.readpages  = ext2_readpages,
.writepage  = ext2_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = ext2_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= ext2_write_begin,
+   .write_end  = generic_write_end,
.bmap   = ext2_bmap,
.direct_IO  = ext2_direct_IO,
.writepages = ext2_writepages,
@@ -791,8 +794,7 @@ const struct address_space_operations ex
.readpages  = ext2_readpages,
.writepage  = ext2_nobh_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = ext2_nobh_prepare_write,
-   .commit_write   = nobh_commit_write,
+   /* XXX: todo */
.bmap   = ext2_bmap,
.direct_IO  = ext2_direct_IO,
.writepages = ext2_writepages,
Index: linux-2.6/fs/ext2/dir.c
===
--- linux-2.6.orig/fs/ext2/dir.c
+++ linux-2.6/fs/ext2/dir.c
@@ -22,7 +22,9 @@
  */
 
 #include ext2.h
+#include linux/buffer_head.h
 #include linux/pagemap.h
+#include linux/swap.h
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
 
@@ -61,16 +63,26 @@ ext2_last_byte(struct inode *inode, unsi
return last_byte;
 }
 
-static int ext2_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-   struct inode *dir = page-mapping-host;
+   struct address_space *mapping = page-mapping;
+   struct inode *dir = mapping-host;
int err = 0;
+
dir-i_version++;
-   page-mapping-a_ops-commit_write(NULL, page, from, to);
+   block_write_end(NULL, mapping, pos, len, len, page, NULL);
+
+   if (pos+len  dir-i_size) {
+   i_size_write(dir, pos+len);
+   mark_inode_dirty(dir);
+   }
+
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
unlock_page(page);
+   mark_page_accessed(page);
+
return err;
 }
 
@@ -412,16 +424,18 @@ ino_t ext2_inode_by_name(struct inode * 
 void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
struct page *page, struct inode *inode)
 {
-   unsigned from = (char *) de - (char *) page_address(page);
-   unsigned to = from + le16_to_cpu(de-rec_len);
+   loff_t pos = (page-index  PAGE_CACHE_SHIFT) +
+   (char *) de - (char *) page_address(page);
+   unsigned len = le16_to_cpu(de-rec_len);
int err;
 
lock_page(page);
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __ext2_write_begin(NULL, page-mapping, pos, len,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
BUG_ON(err);
de-inode = cpu_to_le32(inode-i_ino);
-   ext2_set_de_type (de, inode);
-   err = ext2_commit_chunk(page, from, to);
+   ext2_set_de_type(de, inode);
+   err = ext2_commit_chunk(page, pos, len);
ext2_put_page(page);

[patch 30/41] reiserfs use generic_cont_expand_simple

2007-05-25 Thread npiggin
From: Vladimir Saveliev [EMAIL PROTECTED]

This patch makes reiserfs to use AOP_FLAG_CONT_EXPAND
in order to get rid of the special generic_cont_expand routine
 
Signed-off-by: Vladimir Saveliev [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

---
---
 fs/reiserfs/inode.c |   13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6/fs/reiserfs/inode.c
===
--- linux-2.6.orig/fs/reiserfs/inode.c
+++ linux-2.6/fs/reiserfs/inode.c
@@ -2562,13 +2562,20 @@ static int reiserfs_write_begin(struct f
int ret;
int old_ref = 0;
 
+   inode = mapping-host;
+   *fsdata = 0;
+   if (flags  AOP_FLAG_CONT_EXPAND 
+   (pos  (inode-i_sb-s_blocksize - 1)) == 0) {
+   pos ++;
+   *fsdata = (void *)(unsigned long)flags;
+   }
+
index = pos  PAGE_CACHE_SHIFT;
page = __grab_cache_page(mapping, index);
if (!page)
return -ENOMEM;
*pagep = page;
 
-   inode = mapping-host;
reiserfs_wait_on_write_block(inode-i_sb);
fix_tail_page_for_writing(page);
if (reiserfs_transaction_running(inode-i_sb)) {
@@ -2678,6 +2685,8 @@ static int reiserfs_write_end(struct fil
struct reiserfs_transaction_handle *th;
unsigned start;
 
+   if ((unsigned long)fsdata  AOP_FLAG_CONT_EXPAND)
+   pos ++;
 
reiserfs_wait_on_write_block(inode-i_sb);
if (reiserfs_transaction_running(inode-i_sb))
@@ -3066,7 +3075,7 @@ int reiserfs_setattr(struct dentry *dent
}
/* fill in hole pointers in the expanding truncate case. */
if (attr-ia_size  inode-i_size) {
-   error = generic_cont_expand(inode, attr-ia_size);
+   error = generic_cont_expand_simple(inode, 
attr-ia_size);
if (REISERFS_I(inode)-i_prealloc_count  0) {
int err;
struct reiserfs_transaction_handle th;

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 34/41] fuse convert to new aops.

2007-05-25 Thread npiggin
[mszeredi]
 - don't send zero length write requests
 - it is not legal for the filesystem to return with zero written bytes

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

 fs/fuse/file.c |   48 +---
 1 file changed, 33 insertions(+), 15 deletions(-)

Index: linux-2.6/fs/fuse/file.c
===
--- linux-2.6.orig/fs/fuse/file.c
+++ linux-2.6/fs/fuse/file.c
@@ -444,22 +444,25 @@ static size_t fuse_send_write(struct fus
return outarg.size;
 }
 
-static int fuse_prepare_write(struct file *file, struct page *page,
- unsigned offset, unsigned to)
-{
-   /* No op */
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+
+   *pagep = __grab_cache_page(mapping, index);
+   if (!*pagep)
+   return -ENOMEM;
return 0;
 }
 
-static int fuse_commit_write(struct file *file, struct page *page,
-unsigned offset, unsigned to)
+static int fuse_buffered_write(struct file *file, struct inode *inode,
+  loff_t pos, unsigned count, struct page *page)
 {
int err;
size_t nres;
-   unsigned count = to - offset;
-   struct inode *inode = page-mapping-host;
struct fuse_conn *fc = get_fuse_conn(inode);
-   loff_t pos = page_offset(page) + offset;
+   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
struct fuse_req *req;
 
if (is_bad_inode(inode))
@@ -475,20 +478,35 @@ static int fuse_commit_write(struct file
nres = fuse_send_write(req, file, inode, pos, count);
err = req-out.h.error;
fuse_put_request(fc, req);
-   if (!err  nres != count)
+   if (!err  !nres)
err = -EIO;
if (!err) {
-   pos += count;
+   pos += nres;
spin_lock(fc-lock);
if (pos  inode-i_size)
i_size_write(inode, pos);
spin_unlock(fc-lock);
 
-   if (offset == 0  to == PAGE_CACHE_SIZE)
+   if (count == PAGE_CACHE_SIZE)
SetPageUptodate(page);
}
fuse_invalidate_attr(inode);
-   return err;
+   return err ? err : nres;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
+{
+   struct inode *inode = mapping-host;
+   int res = 0;
+
+   if (copied)
+   res = fuse_buffered_write(file, inode, pos, copied, page);
+
+   unlock_page(page);
+   page_cache_release(page);
+   return res;
 }
 
 static void fuse_release_user_pages(struct fuse_req *req, int write)
@@ -819,8 +837,8 @@ static const struct file_operations fuse
 
 static const struct address_space_operations fuse_file_aops  = {
.readpage   = fuse_readpage,
-   .prepare_write  = fuse_prepare_write,
-   .commit_write   = fuse_commit_write,
+   .write_begin= fuse_write_begin,
+   .write_end  = fuse_write_end,
.readpages  = fuse_readpages,
.set_page_dirty = fuse_set_page_dirty,
.bmap   = fuse_bmap,

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 14/41] implement simple fs aops

2007-05-25 Thread npiggin
Implement new aops for some of the simpler filesystems.

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/configfs/inode.c   |4 ++--
 fs/hugetlbfs/inode.c  |   16 ++--
 fs/ramfs/file-mmu.c   |4 ++--
 fs/ramfs/file-nommu.c |4 ++--
 fs/sysfs/inode.c  |4 ++--
 mm/shmem.c|   35 ---
 6 files changed, 46 insertions(+), 21 deletions(-)

Index: linux-2.6/mm/shmem.c
===
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1107,7 +1107,7 @@ static int shmem_getpage(struct inode *i
 * Normally, filepage is NULL on entry, and either found
 * uptodate immediately, or allocated and zeroed, or read
 * in under swappage, which is then assigned to filepage.
-* But shmem_prepare_write passes in a locked filepage,
+* But shmem_write_begin passes in a locked filepage,
 * which may be found not uptodate by other callers too,
 * and may need to be copied from the swappage read in.
 */
@@ -1452,14 +1452,35 @@ static const struct inode_operations shm
 static const struct inode_operations shmem_symlink_inline_operations;
 
 /*
- * Normally tmpfs makes no use of shmem_prepare_write, but it
+ * Normally tmpfs makes no use of shmem_write_begin, but it
  * lets a tmpfs file be used read-write below the loop driver.
  */
 static int
-shmem_prepare_write(struct file *file, struct page *page, unsigned offset, 
unsigned to)
+shmem_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   struct inode *inode = mapping-host;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   *pagep = NULL;
+   return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+}
+
+static int
+shmem_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   struct inode *inode = page-mapping-host;
-   return shmem_getpage(inode, page-index, page, SGP_WRITE, NULL);
+   struct inode *inode = mapping-host;
+
+   set_page_dirty(page);
+   mark_page_accessed(page);
+   page_cache_release(page);
+
+   if (pos+copied  inode-i_size)
+   i_size_write(inode, pos+copied);
+
+   return copied;
 }
 
 static ssize_t
@@ -2353,8 +2374,8 @@ static const struct address_space_operat
.writepage  = shmem_writepage,
.set_page_dirty = __set_page_dirty_no_writeback,
 #ifdef CONFIG_TMPFS
-   .prepare_write  = shmem_prepare_write,
-   .commit_write   = simple_commit_write,
+   .write_begin= shmem_write_begin,
+   .write_end  = shmem_write_end,
 #endif
.migratepage= migrate_page,
 };
Index: linux-2.6/fs/configfs/inode.c
===
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -41,8 +41,8 @@ extern struct super_block * configfs_sb;
 
 static const struct address_space_operations configfs_aops = {
.readpage   = simple_readpage,
-   .prepare_write  = simple_prepare_write,
-   .commit_write   = simple_commit_write
+   .write_begin= simple_write_begin,
+   .write_end  = simple_write_end,
 };
 
 static struct backing_dev_info configfs_backing_dev_info = {
Index: linux-2.6/fs/sysfs/inode.c
===
--- linux-2.6.orig/fs/sysfs/inode.c
+++ linux-2.6/fs/sysfs/inode.c
@@ -21,8 +21,8 @@ extern struct super_block * sysfs_sb;
 
 static const struct address_space_operations sysfs_aops = {
.readpage   = simple_readpage,
-   .prepare_write  = simple_prepare_write,
-   .commit_write   = simple_commit_write
+   .write_begin= simple_write_begin,
+   .write_end  = simple_write_end,
 };
 
 static struct backing_dev_info sysfs_backing_dev_info = {
Index: linux-2.6/fs/ramfs/file-mmu.c
===
--- linux-2.6.orig/fs/ramfs/file-mmu.c
+++ linux-2.6/fs/ramfs/file-mmu.c
@@ -29,8 +29,8 @@
 
 const struct address_space_operations ramfs_aops = {
.readpage   = simple_readpage,
-   .prepare_write  = simple_prepare_write,
-   .commit_write   = simple_commit_write,
+   .write_begin= simple_write_begin,
+   .write_end  = simple_write_end,
.set_page_dirty = __set_page_dirty_no_writeback,
 };
 
Index: linux-2.6/fs/ramfs/file-nommu.c
===
--- linux-2.6.orig/fs/ramfs/file-nommu.c
+++ linux-2.6/fs/ramfs/file-nommu.c
@@ -29,8 +29,8 @@ static int ramfs_nommu_setattr(struct de
 
 const struct address_space_operations 

[patch 32/41] nfs convert to new aops.

2007-05-25 Thread npiggin
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Acked-by: Trond Myklebust [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/nfs/file.c |   49 -
 1 file changed, 36 insertions(+), 13 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -283,27 +283,50 @@ nfs_fsync(struct file *file, struct dent
 }
 
 /*
- * This does the real work of the write. The generic routine has
- * allocated the page, locked it, done all the page alignment stuff
- * calculations etc. Now we should just copy the data from user
- * space and write it back to the real medium..
+ * This does the real work of the write. We must allocate and lock the
+ * page to be sent back to the generic routine, which then copies the
+ * data from user space.
  *
  * If the writer ends up delaying the write, the writer needs to
  * increment the page use counts until he is done with the page.
  */
-static int nfs_prepare_write(struct file *file, struct page *page, unsigned 
offset, unsigned to)
-{
-   return nfs_flush_incompatible(file, page);
+static int nfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   int ret;
+   pgoff_t index;
+   struct page *page;
+   index = pos  PAGE_CACHE_SHIFT;
+
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
+
+   ret = nfs_flush_incompatible(file, page);
+   if (ret) {
+   unlock_page(page);
+   page_cache_release(page);
+   }
+   return ret;
 }
 
-static int nfs_commit_write(struct file *file, struct page *page, unsigned 
offset, unsigned to)
+static int nfs_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   long status;
+   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
+   int status;
 
lock_kernel();
-   status = nfs_updatepage(file, page, offset, to-offset);
+   status = nfs_updatepage(file, page, offset, copied);
unlock_kernel();
-   return status;
+
+   unlock_page(page);
+   page_cache_release(page);
+
+   return status  0 ? status : copied;
 }
 
 static void nfs_invalidate_page(struct page *page, unsigned long offset)
@@ -331,8 +354,8 @@ const struct address_space_operations nf
.set_page_dirty = nfs_set_page_dirty,
.writepage = nfs_writepage,
.writepages = nfs_writepages,
-   .prepare_write = nfs_prepare_write,
-   .commit_write = nfs_commit_write,
+   .write_begin = nfs_write_begin,
+   .write_end = nfs_write_end,
.invalidatepage = nfs_invalidate_page,
.releasepage = nfs_release_page,
 #ifdef CONFIG_NFS_DIRECTIO

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 15/41] block_dev convert to new aops.

2007-05-25 Thread npiggin
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/block_dev.c |   26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

Index: linux-2.6/fs/block_dev.c
===
--- linux-2.6.orig/fs/block_dev.c
+++ linux-2.6/fs/block_dev.c
@@ -380,14 +380,26 @@ static int blkdev_readpage(struct file *
return block_read_full_page(page, blkdev_get_block);
 }
 
-static int blkdev_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
-{
-   return block_prepare_write(page, from, to, blkdev_get_block);
+static int blkdev_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   blkdev_get_block);
 }
 
-static int blkdev_commit_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+static int blkdev_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   return block_commit_write(page, from, to);
+   int ret;
+   ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+
+   unlock_page(page);
+   page_cache_release(page);
+
+   return ret;
 }
 
 /*
@@ -1331,8 +1343,8 @@ const struct address_space_operations de
.readpage   = blkdev_readpage,
.writepage  = blkdev_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = blkdev_prepare_write,
-   .commit_write   = blkdev_commit_write,
+   .write_begin= blkdev_write_begin,
+   .write_end  = blkdev_write_end,
.writepages = generic_writepages,
.direct_IO  = blkdev_direct_IO,
 };

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 17/41] ext3 convert to new aops.

2007-05-25 Thread npiggin
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]


Various fixes and improvements

Signed-off-by: Badari Pulavarty [EMAIL PROTECTED]

 fs/ext3/inode.c |  136 
 1 file changed, 88 insertions(+), 48 deletions(-)

Index: linux-2.6/fs/ext3/inode.c
===
--- linux-2.6.orig/fs/ext3/inode.c
+++ linux-2.6/fs/ext3/inode.c
@@ -1147,51 +1147,68 @@ static int do_journal_get_write_access(h
return ext3_journal_get_write_access(handle, bh);
 }
 
-static int ext3_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ext3_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = mapping-host;
int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
handle_t *handle;
int retries = 0;
+   struct page *page;
+   pgoff_t index;
+   unsigned from, to;
+
+   index = pos  PAGE_CACHE_SHIFT;
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
 
 retry:
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
+
handle = ext3_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
+   unlock_page(page);
+   page_cache_release(page);
ret = PTR_ERR(handle);
goto out;
}
-   if (test_opt(inode-i_sb, NOBH)  ext3_should_writeback_data(inode))
-   ret = nobh_prepare_write(page, from, to, ext3_get_block);
-   else
-   ret = block_prepare_write(page, from, to, ext3_get_block);
+   ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ext3_get_block);
if (ret)
-   goto prepare_write_failed;
+   goto write_begin_failed;
 
if (ext3_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, do_journal_get_write_access);
}
-prepare_write_failed:
-   if (ret)
+write_begin_failed:
+   if (ret) {
ext3_journal_stop(handle);
+   unlock_page(page);
+   page_cache_release(page);
+   }
if (ret == -ENOSPC  ext3_should_retry_alloc(inode-i_sb, retries))
goto retry;
 out:
return ret;
 }
 
+
 int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
 {
int err = journal_dirty_data(handle, bh);
if (err)
ext3_journal_abort_handle(__FUNCTION__, __FUNCTION__,
-   bh, handle,err);
+   bh, handle, err);
return err;
 }
 
-/* For commit_write() in data=journal mode */
-static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
+/* For write_end() in data=journal mode */
+static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
if (!buffer_mapped(bh) || buffer_freed(bh))
return 0;
@@ -1206,78 +1223,100 @@ static int commit_write_fn(handle_t *han
  * ext3 never places buffers on inode-i_mapping-private_list.  metadata
  * buffers are managed internally.
  */
-static int ext3_ordered_commit_write(struct file *file, struct page *page,
-unsigned from, unsigned to)
+static int ext3_ordered_write_end(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
handle_t *handle = ext3_journal_current_handle();
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = file-f_mapping-host;
+   unsigned from, to;
int ret = 0, ret2;
 
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
+
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, ext3_journal_dirty_data);
 
if (ret == 0) {
/*
-* generic_commit_write() will run mark_inode_dirty() if i_size
+* generic_write_end() will run mark_inode_dirty() if i_size
 * changes.  So let's piggyback the i_disksize mark_inode_dirty
 * into that.
 */
loff_t new_i_size;
 
-   new_i_size = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
+   new_i_size = pos + copied;
if (new_i_size  EXT3_I(inode)-i_disksize)
  

[patch 26/41] bfs convert to new aops.

2007-05-25 Thread npiggin
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/bfs/file.c |   12 
 1 file changed, 8 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/bfs/file.c
===
--- linux-2.6.orig/fs/bfs/file.c
+++ linux-2.6/fs/bfs/file.c
@@ -145,9 +145,13 @@ static int bfs_readpage(struct file *fil
return block_read_full_page(page, bfs_get_block);
 }
 
-static int bfs_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+static int bfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page, from, to, bfs_get_block);
+   *pagep = NULL;
+   return block_write_begin(file, mapping, pos, len, flags,
+   pagep, fsdata, bfs_get_block);
 }
 
 static sector_t bfs_bmap(struct address_space *mapping, sector_t block)
@@ -159,8 +163,8 @@ const struct address_space_operations bf
.readpage   = bfs_readpage,
.writepage  = bfs_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = bfs_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= bfs_write_begin,
+   .write_end  = generic_write_end,
.bmap   = bfs_bmap,
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 38/41] udf convert to new aops.

2007-05-25 Thread npiggin
Convert udf to new aops. Also seem to have fixed pagecache corruption in
udf_adinicb_commit_write -- page was marked uptodate when it is not. Also,
fixed the silly setup where prepare_write was doing a kmap to be used in
commit_write: just do kmap_atomic in write_end. Use libfs helpers to make
this easier.

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/udf/file.c  |   32 +---
 fs/udf/inode.c |   11 +++
 2 files changed, 20 insertions(+), 23 deletions(-)

Index: linux-2.6/fs/udf/file.c
===
--- linux-2.6.orig/fs/udf/file.c
+++ linux-2.6/fs/udf/file.c
@@ -74,34 +74,28 @@ static int udf_adinicb_writepage(struct 
return 0;
 }
 
-static int udf_adinicb_prepare_write(struct file *file, struct page *page, 
unsigned offset, unsigned to)
+static int udf_adinicb_write_end(struct file *file, struct address_space 
*mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   kmap(page);
-   return 0;
-}
-
-static int udf_adinicb_commit_write(struct file *file, struct page *page, 
unsigned offset, unsigned to)
-{
-   struct inode *inode = page-mapping-host;
-   char *kaddr = page_address(page);
+   struct inode *inode = mapping-host;
+   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
+   char *kaddr;
 
+   kaddr = kmap_atomic(page, KM_USER0);
memcpy(UDF_I_DATA(inode) + UDF_I_LENEATTR(inode) + offset,
-   kaddr + offset, to - offset);
-   mark_inode_dirty(inode);
-   SetPageUptodate(page);
-   kunmap(page);
-   /* only one page here */
-   if (to  inode-i_size)
-   inode-i_size = to;
-   return 0;
+   kaddr + offset, copied);
+   kunmap_atomic(kaddr, KM_USER0);
+
+   return simple_write_end(file, mapping, pos, len, copied, page, fsdata);
 }
 
 const struct address_space_operations udf_adinicb_aops = {
.readpage   = udf_adinicb_readpage,
.writepage  = udf_adinicb_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = udf_adinicb_prepare_write,
-   .commit_write   = udf_adinicb_commit_write,
+   .write_begin= simple_write_begin,
+   .write_end  = udf_adinicb_write_end,
 };
 
 static ssize_t udf_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
Index: linux-2.6/fs/udf/inode.c
===
--- linux-2.6.orig/fs/udf/inode.c
+++ linux-2.6/fs/udf/inode.c
@@ -123,9 +123,12 @@ static int udf_readpage(struct file *fil
return block_read_full_page(page, udf_get_block);
 }
 
-static int udf_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+static int udf_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page, from, to, udf_get_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   udf_get_block);
 }
 
 static sector_t udf_bmap(struct address_space *mapping, sector_t block)
@@ -137,8 +140,8 @@ const struct address_space_operations ud
.readpage   = udf_readpage,
.writepage  = udf_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = udf_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= udf_write_begin,
+   .write_end  = generic_write_end,
.bmap   = udf_bmap,
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 37/41] ufs convert to new aops.

2007-05-25 Thread npiggin
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/ufs/dir.c   |   56 +---
 fs/ufs/inode.c |   23 +++
 2 files changed, 56 insertions(+), 23 deletions(-)

Index: linux-2.6/fs/ufs/inode.c
===
--- linux-2.6.orig/fs/ufs/inode.c
+++ linux-2.6/fs/ufs/inode.c
@@ -558,24 +558,39 @@ static int ufs_writepage(struct page *pa
 {
return block_write_full_page(page,ufs_getfrag_block,wbc);
 }
+
 static int ufs_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,ufs_getfrag_block);
 }
-static int ufs_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+
+int __ufs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page,from,to,ufs_getfrag_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ufs_getfrag_block);
 }
+
+static int ufs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return __ufs_write_begin(file, mapping, pos, len, flags, pagep, fsdata);
+}
+
 static sector_t ufs_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,ufs_getfrag_block);
 }
+
 const struct address_space_operations ufs_aops = {
.readpage = ufs_readpage,
.writepage = ufs_writepage,
.sync_page = block_sync_page,
-   .prepare_write = ufs_prepare_write,
-   .commit_write = generic_commit_write,
+   .write_begin = ufs_write_begin,
+   .write_end = generic_write_end,
.bmap = ufs_bmap
 };
 
Index: linux-2.6/fs/ufs/dir.c
===
--- linux-2.6.orig/fs/ufs/dir.c
+++ linux-2.6/fs/ufs/dir.c
@@ -19,6 +19,7 @@
 #include linux/time.h
 #include linux/fs.h
 #include linux/ufs_fs.h
+#include linux/swap.h
 
 #include swab.h
 #include util.h
@@ -38,16 +39,23 @@ static inline int ufs_match(struct super
return !memcmp(name, de-d_name, len);
 }
 
-static int ufs_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-   struct inode *dir = page-mapping-host;
+   struct address_space *mapping = page-mapping;
+   struct inode *dir = mapping-host;
int err = 0;
+
dir-i_version++;
-   page-mapping-a_ops-commit_write(NULL, page, from, to);
+   block_write_end(NULL, mapping, pos, len, len, page, NULL);
+   if (pos+len  dir-i_size) {
+   i_size_write(dir, pos+len);
+   mark_inode_dirty(dir);
+   }
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
unlock_page(page);
+   mark_page_accessed(page);
return err;
 }
 
@@ -81,16 +89,20 @@ ino_t ufs_inode_by_name(struct inode *di
 void ufs_set_link(struct inode *dir, struct ufs_dir_entry *de,
  struct page *page, struct inode *inode)
 {
-   unsigned from = (char *) de - (char *) page_address(page);
-   unsigned to = from + fs16_to_cpu(dir-i_sb, de-d_reclen);
+   loff_t pos = (page-index  PAGE_CACHE_SHIFT) +
+   (char *) de - (char *) page_address(page);
+   unsigned len = fs16_to_cpu(dir-i_sb, de-d_reclen);
int err;
 
lock_page(page);
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __ufs_write_begin(NULL, page-mapping, pos, len,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
BUG_ON(err);
+
de-d_ino = cpu_to_fs32(dir-i_sb, inode-i_ino);
ufs_set_de_type(dir-i_sb, de, inode-i_mode);
-   err = ufs_commit_chunk(page, from, to);
+
+   err = ufs_commit_chunk(page, pos, len);
ufs_put_page(page);
dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(dir);
@@ -312,7 +324,7 @@ int ufs_add_link(struct dentry *dentry, 
unsigned long npages = ufs_dir_pages(dir);
unsigned long n;
char *kaddr;
-   unsigned from, to;
+   loff_t pos;
int err;
 
UFSD(ENTER, name %s, namelen %u\n, name, namelen);
@@ -367,9 +379,10 @@ int ufs_add_link(struct dentry *dentry, 
return -EINVAL;
 
 got_it:
-   from = (char*)de - (char*)page_address(page);
-   to = from + rec_len;
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   pos = (page-index  PAGE_CACHE_SHIFT) +
+   (char*)de - (char*)page_address(page);
+  

[patch 18/41] ext4 convert to new aops.

2007-05-25 Thread npiggin
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Convert ext4 to use write_begin()/write_end() methods.

Signed-off-by: Badari Pulavarty [EMAIL PROTECTED]

 fs/ext4/inode.c |  147 +++-
 1 file changed, 93 insertions(+), 54 deletions(-)

Index: linux-2.6/fs/ext4/inode.c
===
--- linux-2.6.orig/fs/ext4/inode.c
+++ linux-2.6/fs/ext4/inode.c
@@ -1146,34 +1146,50 @@ static int do_journal_get_write_access(h
return ext4_journal_get_write_access(handle, bh);
 }
 
-static int ext4_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ext4_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = mapping-host;
int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
handle_t *handle;
int retries = 0;
+   struct page *page;
+   pgoff_t index;
+   unsigned from, to;
+
+   index = pos  PAGE_CACHE_SHIFT;
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
 
 retry:
-   handle = ext4_journal_start(inode, needed_blocks);
-   if (IS_ERR(handle)) {
-   ret = PTR_ERR(handle);
-   goto out;
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
+
+   handle = ext4_journal_start(inode, needed_blocks);
+   if (IS_ERR(handle)) {
+   unlock_page(page);
+   page_cache_release(page);
+   ret = PTR_ERR(handle);
+   goto out;
}
-   if (test_opt(inode-i_sb, NOBH)  ext4_should_writeback_data(inode))
-   ret = nobh_prepare_write(page, from, to, ext4_get_block);
-   else
-   ret = block_prepare_write(page, from, to, ext4_get_block);
-   if (ret)
-   goto prepare_write_failed;
 
-   if (ext4_should_journal_data(inode)) {
+   ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ext4_get_block);
+
+   if (!ret  ext4_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, do_journal_get_write_access);
}
-prepare_write_failed:
-   if (ret)
+
+   if (ret) {
ext4_journal_stop(handle);
+   unlock_page(page);
+   page_cache_release(page);
+   }
+
if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, retries))
goto retry;
 out:
@@ -1185,12 +1201,12 @@ int ext4_journal_dirty_data(handle_t *ha
int err = jbd2_journal_dirty_data(handle, bh);
if (err)
ext4_journal_abort_handle(__FUNCTION__, __FUNCTION__,
-   bh, handle,err);
+   bh, handle, err);
return err;
 }
 
-/* For commit_write() in data=journal mode */
-static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
+/* For write_end() in data=journal mode */
+static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
if (!buffer_mapped(bh) || buffer_freed(bh))
return 0;
@@ -1205,78 +1221,100 @@ static int commit_write_fn(handle_t *han
  * ext4 never places buffers on inode-i_mapping-private_list.  metadata
  * buffers are managed internally.
  */
-static int ext4_ordered_commit_write(struct file *file, struct page *page,
-unsigned from, unsigned to)
+static int ext4_ordered_write_end(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
handle_t *handle = ext4_journal_current_handle();
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = file-f_mapping-host;
+   unsigned from, to;
int ret = 0, ret2;
 
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
+
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, ext4_journal_dirty_data);
 
if (ret == 0) {
/*
-* generic_commit_write() will run mark_inode_dirty() if i_size
+* generic_write_end() will run mark_inode_dirty() if i_size
 * changes.  So let's piggyback the i_disksize mark_inode_dirty
 * into that.
 */
loff_t new_i_size;
 
-   new_i_size = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
+ 

[patch 04/41] mm: clean up buffered write code

2007-05-25 Thread npiggin
From: Andrew Morton [EMAIL PROTECTED]

Rename some variables and fix some types.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   35 ++-
 1 file changed, 18 insertions(+), 17 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1912,16 +1912,15 @@ generic_file_buffered_write(struct kiocb
size_t count, ssize_t written)
 {
struct file *file = iocb-ki_filp;
-   struct address_space * mapping = file-f_mapping;
+   struct address_space *mapping = file-f_mapping;
const struct address_space_operations *a_ops = mapping-a_ops;
struct inode*inode = mapping-host;
longstatus = 0;
struct page *page;
struct page *cached_page = NULL;
-   size_t  bytes;
struct pagevec  lru_pvec;
const struct iovec *cur_iov = iov; /* current iovec */
-   size_t  iov_base = 0;  /* offset in the current iovec */
+   size_t  iov_offset = 0;/* offset in the current iovec */
char __user *buf;
 
pagevec_init(lru_pvec, 0);
@@ -1932,31 +1931,33 @@ generic_file_buffered_write(struct kiocb
if (likely(nr_segs == 1))
buf = iov-iov_base + written;
else {
-   filemap_set_next_iovec(cur_iov, iov_base, written);
-   buf = cur_iov-iov_base + iov_base;
+   filemap_set_next_iovec(cur_iov, iov_offset, written);
+   buf = cur_iov-iov_base + iov_offset;
}
 
do {
-   unsigned long index;
-   unsigned long offset;
-   unsigned long maxlen;
-   size_t copied;
+   pgoff_t index;  /* Pagecache index for current page */
+   unsigned long offset;   /* Offset into pagecache page */
+   unsigned long maxlen;   /* Bytes remaining in current iovec */
+   size_t bytes;   /* Bytes to write to page */
+   size_t copied;  /* Bytes copied from user */
 
-   offset = (pos  (PAGE_CACHE_SIZE -1)); /* Within page */
+   offset = (pos  (PAGE_CACHE_SIZE - 1));
index = pos  PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
if (bytes  count)
bytes = count;
 
+   maxlen = cur_iov-iov_len - iov_offset;
+   if (maxlen  bytes)
+   maxlen = bytes;
+
/*
 * Bring in the user page that we will copy from _first_.
 * Otherwise there's a nasty deadlock on copying from the
 * same page as we're writing to, without it being marked
 * up-to-date.
 */
-   maxlen = cur_iov-iov_len - iov_base;
-   if (maxlen  bytes)
-   maxlen = bytes;
fault_in_pages_readable(buf, maxlen);
 
page = __grab_cache_page(mapping,index,cached_page,lru_pvec);
@@ -1987,7 +1988,7 @@ generic_file_buffered_write(struct kiocb
buf, bytes);
else
copied = filemap_copy_from_user_iovec(page, offset,
-   cur_iov, iov_base, bytes);
+   cur_iov, iov_offset, bytes);
flush_dcache_page(page);
status = a_ops-commit_write(file, page, offset, offset+bytes);
if (status == AOP_TRUNCATED_PAGE) {
@@ -2005,12 +2006,12 @@ generic_file_buffered_write(struct kiocb
buf += status;
if (unlikely(nr_segs  1)) {
filemap_set_next_iovec(cur_iov,
-   iov_base, status);
+   iov_offset, status);
if (count)
buf = cur_iov-iov_base +
-   iov_base;
+   iov_offset;
} else {
-   iov_base += status;
+   iov_offset += status;
}
}
}

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 29/41] reiserfs convert to new aops.

2007-05-25 Thread npiggin
From: Vladimir Saveliev [EMAIL PROTECTED]

Convert reiserfs to new aops

Signed-off-by: Vladimir Saveliev [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

---
---
 fs/reiserfs/inode.c |  177 +---
 fs/reiserfs/ioctl.c |   10 +-
 fs/reiserfs/xattr.c |   16 +++-
 3 files changed, 184 insertions(+), 19 deletions(-)

Index: linux-2.6/fs/reiserfs/inode.c
===
--- linux-2.6.orig/fs/reiserfs/inode.c
+++ linux-2.6/fs/reiserfs/inode.c
@@ -17,11 +17,12 @@
 #include linux/mpage.h
 #include linux/writeback.h
 #include linux/quotaops.h
+#include linux/swap.h
 
-static int reiserfs_commit_write(struct file *f, struct page *page,
-unsigned from, unsigned to);
-static int reiserfs_prepare_write(struct file *f, struct page *page,
- unsigned from, unsigned to);
+int reiserfs_commit_write(struct file *f, struct page *page,
+ unsigned from, unsigned to);
+int reiserfs_prepare_write(struct file *f, struct page *page,
+  unsigned from, unsigned to);
 
 void reiserfs_delete_inode(struct inode *inode)
 {
@@ -2550,8 +2551,71 @@ static int reiserfs_writepage(struct pag
return reiserfs_write_full_page(page, wbc);
 }
 
-static int reiserfs_prepare_write(struct file *f, struct page *page,
- unsigned from, unsigned to)
+static int reiserfs_write_begin(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   struct inode *inode;
+   struct page *page;
+   pgoff_t index;
+   int ret;
+   int old_ref = 0;
+
+   index = pos  PAGE_CACHE_SHIFT;
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
+
+   inode = mapping-host;
+   reiserfs_wait_on_write_block(inode-i_sb);
+   fix_tail_page_for_writing(page);
+   if (reiserfs_transaction_running(inode-i_sb)) {
+   struct reiserfs_transaction_handle *th;
+   th = (struct reiserfs_transaction_handle *)current-
+   journal_info;
+   BUG_ON(!th-t_refcount);
+   BUG_ON(!th-t_trans_id);
+   old_ref = th-t_refcount;
+   th-t_refcount++;
+   }
+   ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   reiserfs_get_block);
+   if (ret  reiserfs_transaction_running(inode-i_sb)) {
+   struct reiserfs_transaction_handle *th = current-journal_info;
+   /* this gets a little ugly.  If reiserfs_get_block returned an
+* error and left a transacstion running, we've got to close it,
+* and we've got to free handle if it was a persistent 
transaction.
+*
+* But, if we had nested into an existing transaction, we need
+* to just drop the ref count on the handle.
+*
+* If old_ref == 0, the transaction is from reiserfs_get_block,
+* and it was a persistent trans.  Otherwise, it was nested 
above.
+*/
+   if (th-t_refcount  old_ref) {
+   if (old_ref)
+   th-t_refcount--;
+   else {
+   int err;
+   reiserfs_write_lock(inode-i_sb);
+   err = reiserfs_end_persistent_transaction(th);
+   reiserfs_write_unlock(inode-i_sb);
+   if (err)
+   ret = err;
+   }
+   }
+   }
+   if (ret) {
+   unlock_page(page);
+   page_cache_release(page);
+   }
+   return ret;
+}
+
+int reiserfs_prepare_write(struct file *f, struct page *page,
+  unsigned from, unsigned to)
 {
struct inode *inode = page-mapping-host;
int ret;
@@ -2604,8 +2668,101 @@ static sector_t reiserfs_aop_bmap(struct
return generic_block_bmap(as, block, reiserfs_bmap);
 }
 
-static int reiserfs_commit_write(struct file *f, struct page *page,
-unsigned from, unsigned to)
+static int reiserfs_write_end(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+   struct inode *inode = page-mapping-host;
+   int ret = 0;
+   int update_sd = 0;
+   struct reiserfs_transaction_handle *th;
+   unsigned start;
+
+
+   reiserfs_wait_on_write_block(inode-i_sb);
+   

[patch 33/41] smb convert to new aops.

2007-05-25 Thread npiggin
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/smbfs/file.c |   34 +-
 1 file changed, 25 insertions(+), 9 deletions(-)

Index: linux-2.6/fs/smbfs/file.c
===
--- linux-2.6.orig/fs/smbfs/file.c
+++ linux-2.6/fs/smbfs/file.c
@@ -291,29 +291,45 @@ out:
  * If the writer ends up delaying the write, the writer needs to
  * increment the page use counts until he is done with the page.
  */
-static int smb_prepare_write(struct file *file, struct page *page, 
-unsigned offset, unsigned to)
-{
+static int smb_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   *pagep = __grab_cache_page(mapping, index);
+   if (!*pagep)
+   return -ENOMEM;
return 0;
 }
 
-static int smb_commit_write(struct file *file, struct page *page,
-   unsigned offset, unsigned to)
+static int smb_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
int status;
+   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
 
-   status = -EFAULT;
lock_kernel();
-   status = smb_updatepage(file, page, offset, to-offset);
+   status = smb_updatepage(file, page, offset, copied);
unlock_kernel();
+
+   if (!status) {
+   if (!PageUptodate(page)  copied == PAGE_CACHE_SIZE)
+   SetPageUptodate(page);
+   status = copied;
+   }
+
+   unlock_page(page);
+   page_cache_release(page);
+
return status;
 }
 
 const struct address_space_operations smb_file_aops = {
.readpage = smb_readpage,
.writepage = smb_writepage,
-   .prepare_write = smb_prepare_write,
-   .commit_write = smb_commit_write
+   .write_begin = smb_write_begin,
+   .write_end = smb_write_end,
 };
 
 /* 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 10/41] mm: buffered write iterator

2007-05-25 Thread npiggin

Add an iterator data structure to operate over an iovec. Add usercopy
operators needed by generic_file_buffered_write, and convert that function
over.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 include/linux/fs.h |   33 
 mm/filemap.c   |  144 +++--
 mm/filemap.h   |  103 -
 3 files changed, 150 insertions(+), 130 deletions(-)

Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -416,6 +416,39 @@ struct page;
 struct address_space;
 struct writeback_control;
 
+struct iov_iter {
+   const struct iovec *iov;
+   unsigned long nr_segs;
+   size_t iov_offset;
+   size_t count;
+};
+
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes);
+size_t iov_iter_copy_from_user(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes);
+void iov_iter_advance(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_readable(struct iov_iter *i);
+size_t iov_iter_single_seg_count(struct iov_iter *i);
+
+static inline void iov_iter_init(struct iov_iter *i,
+   const struct iovec *iov, unsigned long nr_segs,
+   size_t count, size_t written)
+{
+   i-iov = iov;
+   i-nr_segs = nr_segs;
+   i-iov_offset = 0;
+   i-count = count + written;
+
+   iov_iter_advance(i, written);
+}
+
+static inline size_t iov_iter_count(struct iov_iter *i)
+{
+   return i-count;
+}
+
+
 struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -30,7 +30,7 @@
 #include linux/security.h
 #include linux/syscalls.h
 #include linux/cpuset.h
-#include filemap.h
+#include linux/hardirq.h /* for BUG_ON(!in_atomic()) only */
 #include internal.h
 
 /*
@@ -1707,8 +1707,7 @@ int remove_suid(struct dentry *dentry)
 }
 EXPORT_SYMBOL(remove_suid);
 
-size_t
-__filemap_copy_from_user_iovec_inatomic(char *vaddr,
+static size_t __iovec_copy_from_user_inatomic(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
 {
size_t copied = 0, left = 0;
@@ -1731,6 +1730,110 @@ __filemap_copy_from_user_iovec_inatomic(
 }
 
 /*
+ * Copy as much as we can into the page and return the number of bytes which
+ * were sucessfully copied.  If a fault is encountered then return the number 
of
+ * bytes which were copied.
+ */
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+   char *kaddr;
+   size_t copied;
+
+   BUG_ON(!in_atomic());
+   kaddr = kmap_atomic(page, KM_USER0);
+   if (likely(i-nr_segs == 1)) {
+   int left;
+   char __user *buf = i-iov-iov_base + i-iov_offset;
+   left = __copy_from_user_inatomic_nocache(kaddr + offset,
+   buf, bytes);
+   copied = bytes - left;
+   } else {
+   copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+   i-iov, i-iov_offset, bytes);
+   }
+   kunmap_atomic(kaddr, KM_USER0);
+
+   return copied;
+}
+
+/*
+ * This has the same sideeffects and return value as
+ * iov_iter_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
+ * Page must not be locked.
+ */
+size_t iov_iter_copy_from_user(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+   char *kaddr;
+   size_t copied;
+
+   kaddr = kmap(page);
+   if (likely(i-nr_segs == 1)) {
+   int left;
+   char __user *buf = i-iov-iov_base + i-iov_offset;
+   left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+   copied = bytes - left;
+   } else {
+   copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+   i-iov, i-iov_offset, bytes);
+   }
+   kunmap(page);
+   return copied;
+}
+
+static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes)
+{
+   if (likely(i-nr_segs == 1)) {
+   i-iov_offset += bytes;
+   } else {
+   const struct iovec *iov = i-iov;
+   size_t base = i-iov_offset;
+
+   while (bytes) {
+   int copy = min(bytes, iov-iov_len - base);
+
+   bytes -= copy;
+  

[patch 31/41] With reiserfs no longer using the weird generic_cont_expand, remove it completely.

2007-05-25 Thread npiggin
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

---
 fs/buffer.c |   20 
 include/linux/buffer_head.h |1 -
 2 files changed, 21 deletions(-)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2189,25 +2189,6 @@ out:
return err;
 }
 
-int generic_cont_expand(struct inode *inode, loff_t size)
-{
-   unsigned int offset;
-
-   offset = (size  (PAGE_CACHE_SIZE - 1)); /* Within page */
-
-   /* ugh.  in prepare/commit_write, if from==to==start of block, we
-* skip the prepare.  make sure we never send an offset for the start
-* of a block.
-* XXX: actually, this should be handled in those filesystems by
-* checking for the AOP_FLAG_CONT_EXPAND flag.
-*/
-   if ((offset  (inode-i_sb-s_blocksize - 1)) == 0) {
-   /* caller must handle this extra byte. */
-   size++;
-   }
-   return generic_cont_expand_simple(inode, size);
-}
-
 int cont_expand_zero(struct file *file, struct address_space *mapping,
loff_t pos, loff_t *bytes)
 {
@@ -3135,7 +3116,6 @@ EXPORT_SYMBOL(file_fsync);
 EXPORT_SYMBOL(fsync_bdev);
 EXPORT_SYMBOL(generic_block_bmap);
 EXPORT_SYMBOL(generic_commit_write);
-EXPORT_SYMBOL(generic_cont_expand);
 EXPORT_SYMBOL(generic_cont_expand_simple);
 EXPORT_SYMBOL(init_buffer);
 EXPORT_SYMBOL(invalidate_bdev);
Index: linux-2.6/include/linux/buffer_head.h
===
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -217,7 +217,6 @@ int block_prepare_write(struct page*, un
 int cont_write_begin(struct file *, struct address_space *, loff_t,
unsigned, unsigned, struct page **, void **,
get_block_t *, loff_t *);
-int generic_cont_expand(struct inode *inode, loff_t size);
 int generic_cont_expand_simple(struct inode *inode, loff_t size);
 int block_commit_write(struct page *page, unsigned from, unsigned to);
 void block_sync_page(struct page *);

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 09/41] mm: fix pagecache write deadlocks

2007-05-25 Thread npiggin

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.prepare_write
4. unlock_page+vmtruncate
5. copy_from_user
6.  mmap_sem(r)
7.   handle_mm_fault
8.lock_page (filemap_nopage)
9.commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.get_user_pages
e. handle_mm_fault
f.  lock_page (filemap_nopage)

2,8 - recursive deadlock if page is same
2,8;2,8 - ABBA deadlock is page is different
2,6;b,f - ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
atomic usercopies which do not take pagefaults and do not zero the uncopied
tail of the destination. The destination is already uptodate, so we can
commit_write the full length even if there was a partial copy: it does not
matter that the tail was not modified, because if it is dirtied and written
back to disk it will not cause any problems (uptodate *means* that the
destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
information, because atomic usercopies cannot distinguish between
non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
made slightly more optimal), then allocate a temporary page to copy the
source data into. Relock the destination page and continue with the copy.
However, instead of a usercopy (which might take a fault), copy the data
from the pinned temporary page via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

This increases the CPU/memory copy cost by almost 50% on the affected
workloads. That will be solved by introducing a new set of pagecache write
aops in a subsequent patch.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 include/linux/pagemap.h |   11 +++-
 mm/filemap.c|  122 
 2 files changed, 112 insertions(+), 21 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1900,11 +1900,12 @@ generic_file_buffered_write(struct kiocb
filemap_set_next_iovec(cur_iov, nr_segs, iov_offset, written);
 
do {
+   struct page *src_page;
struct page *page;
pgoff_t index;  /* Pagecache index for current page */
unsigned long offset;   /* Offset into pagecache page */
-   unsigned long maxlen;   /* Bytes remaining in current iovec */
-   size_t bytes;   /* Bytes to write to page */
+   unsigned long seglen;   /* Bytes remaining in current iovec */
+   unsigned long bytes;/* Bytes to write to page */
size_t copied;  /* Bytes copied from user */
 
buf = cur_iov-iov_base + iov_offset;
@@ -1914,20 +1915,30 @@ generic_file_buffered_write(struct kiocb
if (bytes  count)
bytes = count;
 
-   maxlen = cur_iov-iov_len - iov_offset;
-   if (maxlen  bytes)
-   maxlen = bytes;
+   /*
+* a non-NULL src_page indicates that we're doing the
+* copy via get_user_pages and kmap.
+*/
+   src_page = NULL;
+
+   seglen = cur_iov-iov_len - iov_offset;
+   if (seglen  bytes)
+   seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
/*
 * Bring in the user page that we will copy from _first_.
 * Otherwise there's a nasty deadlock on copying from the
 * same page as we're writing to, without it being marked
 * up-to-date.
+*
+* Not only is this an optimisation, but it is also required
+* to check that the address is actually valid, when atomic
+* usercopies are used, below.
 */
-   fault_in_pages_readable(buf, maxlen);
-#endif
-
+   if (unlikely(fault_in_pages_readable(buf, seglen))) {
+   status = -EFAULT;
+   break;
+   }
 
page = __grab_cache_page(mapping, index);
if (!page) {
@@ -1935,32 +1946,104 @@ generic_file_buffered_write(struct kiocb
break;
}
 
+   /*
+* non-uptodate pages 

[patch 07/41] mm: buffered write cleanup

2007-05-25 Thread npiggin

Quite a bit of code is used in maintaining these cached pages that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.

Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).

[this gets rid of some cond_resched() calls in readahead.c and mpage.c due
 to clashes in -mm. What put them there, and why? ]

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/mpage.c |   10 ---
 mm/filemap.c   |  145 ++---
 mm/readahead.c |   24 +++--
 3 files changed, 66 insertions(+), 113 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -666,27 +666,22 @@ EXPORT_SYMBOL(find_lock_page);
 struct page *find_or_create_page(struct address_space *mapping,
unsigned long index, gfp_t gfp_mask)
 {
-   struct page *page, *cached_page = NULL;
+   struct page *page;
int err;
 repeat:
page = find_lock_page(mapping, index);
if (!page) {
-   if (!cached_page) {
-   cached_page =
-   __page_cache_alloc(gfp_mask);
-   if (!cached_page)
-   return NULL;
-   }
-   err = add_to_page_cache_lru(cached_page, mapping,
-   index, gfp_mask);
-   if (!err) {
-   page = cached_page;
-   cached_page = NULL;
-   } else if (err == -EEXIST)
-   goto repeat;
+   page = __page_cache_alloc(gfp_mask);
+   if (!page)
+   return NULL;
+   err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+   if (unlikely(err)) {
+   page_cache_release(page);
+   page = NULL;
+   if (err == -EEXIST)
+   goto repeat;
+   }
}
-   if (cached_page)
-   page_cache_release(cached_page);
return page;
 }
 EXPORT_SYMBOL(find_or_create_page);
@@ -883,11 +878,9 @@ void do_generic_mapping_read(struct addr
unsigned long prev_index;
unsigned int prev_offset;
loff_t isize;
-   struct page *cached_page;
int error;
struct file_ra_state ra = *_ra;
 
-   cached_page = NULL;
index = *ppos  PAGE_CACHE_SHIFT;
next_index = index;
prev_index = ra.prev_index;
@@ -1059,23 +1052,20 @@ no_cached_page:
 * Ok, it wasn't cached, so we need to create a new
 * page..
 */
-   if (!cached_page) {
-   cached_page = page_cache_alloc_cold(mapping);
-   if (!cached_page) {
-   desc-error = -ENOMEM;
-   goto out;
-   }
+   page = page_cache_alloc_cold(mapping);
+   if (!page) {
+   desc-error = -ENOMEM;
+   goto out;
}
-   error = add_to_page_cache_lru(cached_page, mapping,
+   error = add_to_page_cache_lru(page, mapping,
index, GFP_KERNEL);
if (error) {
+   page_cache_release(page);
if (error == -EEXIST)
goto find_page;
desc-error = error;
goto out;
}
-   page = cached_page;
-   cached_page = NULL;
goto readpage;
}
 
@@ -1084,8 +1074,6 @@ out:
_ra-prev_index = prev_index;
 
*ppos = ((loff_t) index  PAGE_CACHE_SHIFT) + offset;
-   if (cached_page)
-   page_cache_release(cached_page);
if (filp)
file_accessed(filp);
 }
@@ -1573,35 +1561,28 @@ static struct page *__read_cache_page(st
int (*filler)(void *,struct page*),
void *data)
 {
-   struct page *page, *cached_page = NULL;
+   struct page *page;
int err;
 repeat:

[patch 28/41] reiserfs use generic write.

2007-05-25 Thread npiggin
From: Vladimir Saveliev [EMAIL PROTECTED]

Make reiserfs to write via generic routines.
Original reiserfs write optimized for big writes is deadlock rone

Signed-off-by: Vladimir Saveliev [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

---
---
 fs/reiserfs/file.c | 1240 -
 1 file changed, 1 insertion(+), 1239 deletions(-)

Index: linux-2.6/fs/reiserfs/file.c
===
--- linux-2.6.orig/fs/reiserfs/file.c
+++ linux-2.6/fs/reiserfs/file.c
@@ -153,608 +153,6 @@ static int reiserfs_sync_file(struct fil
return (n_err  0) ? -EIO : 0;
 }
 
-/* I really do not want to play with memory shortage right now, so
-   to simplify the code, we are not going to write more than this much pages at
-   a time. This still should considerably improve performance compared to 4k
-   at a time case. This is 32 pages of 4k size. */
-#define REISERFS_WRITE_PAGES_AT_A_TIME (128 * 1024) / PAGE_CACHE_SIZE
-
-/* Allocates blocks for a file to fulfil write request.
-   Maps all unmapped but prepared pages from the list.
-   Updates metadata with newly allocated blocknumbers as needed */
-static int reiserfs_allocate_blocks_for_region(struct 
reiserfs_transaction_handle *th, struct inode *inode,/* Inode we work with 
*/
-  loff_t pos,  /* Writing 
position */
-  int num_pages,   /* number of 
pages write going
-  to touch */
-  int write_bytes, /* amount of 
bytes to write */
-  struct page **prepared_pages,
/* array of
-   
   prepared pages
-   
 */
-  int blocks_to_allocate   /* 
Amount of blocks we
-  need 
to allocate to
-  fit 
the data into file
-*/
-)
-{
-   struct cpu_key key; // cpu key of item that we are going to deal 
with
-   struct item_head *ih;   // pointer to item head that we are going to 
deal with
-   struct buffer_head *bh; // Buffer head that contains items that we are 
going to deal with
-   __le32 *item;   // pointer to item we are going to deal with
-   INITIALIZE_PATH(path);  // path to item, that we are going to deal with.
-   b_blocknr_t *allocated_blocks;  // Pointer to a place where allocated 
blocknumbers would be stored.
-   reiserfs_blocknr_hint_t hint;   // hint structure for block allocator.
-   size_t res; // return value of various functions that we 
call.
-   int curr_block; // current block used to keep track of unmapped 
blocks.
-   int i;  // loop counter
-   int itempos;// position in item
-   unsigned int from = (pos  (PAGE_CACHE_SIZE - 1));  // writing 
position in
-   // first page
-   unsigned int to = ((pos + write_bytes - 1)  (PAGE_CACHE_SIZE - 1)) + 
1;/* last modified byte offset in last page */
-   __u64 hole_size;// amount of blocks for a file hole, if it 
needed to be created.
-   int modifying_this_item = 0;// Flag for items traversal code to 
keep track
-   // of the fact that we already prepared
-   // current block for journal
-   int will_prealloc = 0;
-   RFALSE(!blocks_to_allocate,
-  green-9004: tried to allocate zero blocks?);
-
-   /* only preallocate if this is a small write */
-   if (REISERFS_I(inode)-i_prealloc_count ||
-   (!(write_bytes  (inode-i_sb-s_blocksize - 1)) 
-blocks_to_allocate 
-REISERFS_SB(inode-i_sb)-s_alloc_options.preallocsize))
-   will_prealloc =
-   REISERFS_SB(inode-i_sb)-s_alloc_options.preallocsize;
-
-   allocated_blocks = kmalloc((blocks_to_allocate + will_prealloc) *
-  sizeof(b_blocknr_t), GFP_NOFS);
-   if (!allocated_blocks)
-   return -ENOMEM;
-
-   /* First we compose a key to point at the writing position, we want to 
do
-  that outside of any locking region. */
-   make_cpu_key(key, inode, pos + 1, TYPE_ANY, 3 /*key length */ );
-
-   /* If we came here, it means we absolutely need to open a transaction,
-  since we need to allocate some blocks */
-   reiserfs_write_lock(inode-i_sb);   // Journaling stuff and we need 
that.
-   res = journal_begin(th, inode-i_sb, JOURNAL_PER_BALANCE_CNT * 3 + 1 + 
2 * 

[patch 12/41] fs: introduce write_begin, write_end, and perform_write aops

2007-05-25 Thread npiggin
These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

API design contributions, code review and fixes. 

Signed-off-by: Mark Fasheh [EMAIL PROTECTED]

 Documentation/filesystems/Locking |9 -
 Documentation/filesystems/vfs.txt |   45 +++
 drivers/block/loop.c  |   75 ---
 fs/buffer.c   |  198 ++-
 fs/libfs.c|   44 +++
 fs/namei.c|   47 +--
 fs/revoked_inode.c|   14 ++
 fs/splice.c   |   70 +--
 include/linux/buffer_head.h   |   10 +
 include/linux/fs.h|   30 
 include/linux/pagemap.h   |2 
 mm/filemap.c  |  238 +-
 12 files changed, 576 insertions(+), 206 deletions(-)

Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -409,6 +409,8 @@ enum positive_aop_returns {
AOP_TRUNCATED_PAGE  = 0x80001,
 };
 
+#define AOP_FLAG_UNINTERRUPTIBLE   0x0001 /* will not do a short write */
+
 /*
  * oh the beauties of C type declarations.
  */
@@ -428,7 +430,7 @@ size_t iov_iter_copy_from_user_atomic(st
 size_t iov_iter_copy_from_user(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(struct iov_iter *i);
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(struct iov_iter *i);
 
 static inline void iov_iter_init(struct iov_iter *i,
@@ -469,6 +471,14 @@ struct address_space_operations {
 */
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+
+   int (*write_begin)(struct file *, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata);
+   int (*write_end)(struct file *, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata);
+
/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidatepage) (struct page *, unsigned long);
@@ -483,6 +493,18 @@ struct address_space_operations {
int (*launder_page) (struct page *);
 };
 
+/*
+ * pagecache_write_begin/pagecache_write_end must be used by general code
+ * to write into the pagecache.
+ */
+int pagecache_write_begin(struct file *, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata);
+
+int pagecache_write_end(struct file *, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata);
+
 struct backing_dev_info;
 struct address_space {
struct inode*host;  /* owner: inode, block_device */
@@ -1894,6 +1916,12 @@ extern int simple_prepare_write(struct f
unsigned offset, unsigned to);
 extern int simple_commit_write(struct file *file, struct page *page,
unsigned offset, unsigned to);
+extern int simple_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata);
+extern int simple_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata);
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct 
nameidata *);
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t 
*);
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1814,11 +1814,10 @@ void iov_iter_advance(struct iov_iter *i
i-count -= bytes;
 }
 
-int iov_iter_fault_in_readable(struct iov_iter *i)
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 {
-   size_t seglen = min(i-iov-iov_len - i-iov_offset, i-count);
char __user *buf = i-iov-iov_base + 

[patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6

2007-05-25 Thread npiggin
From: Andrew Morton [EMAIL PROTECTED]

This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
also revert.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |9 +
 mm/filemap.h |4 ++--
 2 files changed, 3 insertions(+), 10 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1969,12 +1969,6 @@ generic_file_buffered_write(struct kiocb
break;
}
 
-   if (unlikely(bytes == 0)) {
-   status = 0;
-   copied = 0;
-   goto zero_length_segment;
-   }
-
status = a_ops-prepare_write(file, page, offset, offset+bytes);
if (unlikely(status)) {
loff_t isize = i_size_read(inode);
@@ -2004,8 +1998,7 @@ generic_file_buffered_write(struct kiocb
page_cache_release(page);
continue;
}
-zero_length_segment:
-   if (likely(copied = 0)) {
+   if (likely(copied  0)) {
if (!status)
status = copied;
 
Index: linux-2.6/mm/filemap.h
===
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove
const struct iovec *iov = *iovp;
size_t base = *basep;
 
-   do {
+   while (bytes) {
int copy = min(bytes, iov-iov_len - base);
 
bytes -= copy;
@@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove
iov++;
base = 0;
}
-   } while (bytes);
+   }
*iovp = iov;
*basep = base;
 }

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 11/41] fs: fix data-loss on error

2007-05-25 Thread npiggin

New buffers against uptodate pages are simply be marked uptodate, while the
buffer_new bit remains set. This causes error-case code to zero out parts
of those buffers because it thinks they contain stale data: wrong, they
are actually uptodate so this is a data loss situation.

Fix this by actually clearning buffer_new and marking the buffer dirty. It
makes sense to always clear buffer_new before setting a buffer uptodate.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/buffer.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1816,7 +1816,9 @@ static int __block_prepare_write(struct 
unmap_underlying_metadata(bh-b_bdev,
bh-b_blocknr);
if (PageUptodate(page)) {
+   clear_buffer_new(bh);
set_buffer_uptodate(bh);
+   mark_buffer_dirty(bh);
continue;
}
if (block_end  to || block_start  from) {

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 27/41] qnx4 convert to new aops.

2007-05-25 Thread npiggin
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/qnx4/inode.c |   21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/qnx4/inode.c
===
--- linux-2.6.orig/fs/qnx4/inode.c
+++ linux-2.6/fs/qnx4/inode.c
@@ -433,16 +433,21 @@ static int qnx4_writepage(struct page *p
 {
return block_write_full_page(page,qnx4_get_block, wbc);
 }
+
 static int qnx4_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,qnx4_get_block);
 }
-static int qnx4_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
-{
-   struct qnx4_inode_info *qnx4_inode = qnx4_i(page-mapping-host);
-   return cont_prepare_write(page, from, to, qnx4_get_block,
- qnx4_inode-mmu_private);
+
+static int qnx4_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   struct qnx4_inode_info *qnx4_inode = qnx4_i(mapping-host);
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   qnx4_get_block,
+   qnx4_inode-mmu_private);
 }
 static sector_t qnx4_bmap(struct address_space *mapping, sector_t block)
 {
@@ -452,8 +457,8 @@ static const struct address_space_operat
.readpage   = qnx4_readpage,
.writepage  = qnx4_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = qnx4_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= qnx4_write_begin,
+   .write_end  = generic_write_end,
.bmap   = qnx4_bmap
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 06/41] mm: trim more holes

2007-05-25 Thread npiggin

If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
we may have failed the write operation despite prepare_write having
instantiated blocks past i_size. Fix this, and consolidate the trimming into
one place.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   80 +--
 1 file changed, 40 insertions(+), 40 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1969,22 +1969,9 @@ generic_file_buffered_write(struct kiocb
}
 
status = a_ops-prepare_write(file, page, offset, offset+bytes);
-   if (unlikely(status)) {
-   loff_t isize = i_size_read(inode);
+   if (unlikely(status))
+   goto fs_write_aop_error;
 
-   if (status != AOP_TRUNCATED_PAGE)
-   unlock_page(page);
-   page_cache_release(page);
-   if (status == AOP_TRUNCATED_PAGE)
-   continue;
-   /*
-* prepare_write() may have instantiated a few blocks
-* outside i_size.  Trim these off again.
-*/
-   if (pos + bytes  isize)
-   vmtruncate(inode, isize);
-   break;
-   }
if (likely(nr_segs == 1))
copied = filemap_copy_from_user(page, offset,
buf, bytes);
@@ -1993,40 +1980,53 @@ generic_file_buffered_write(struct kiocb
cur_iov, iov_offset, bytes);
flush_dcache_page(page);
status = a_ops-commit_write(file, page, offset, offset+bytes);
-   if (status == AOP_TRUNCATED_PAGE) {
-   page_cache_release(page);
-   continue;
+   if (unlikely(status  0 || status == AOP_TRUNCATED_PAGE))
+   goto fs_write_aop_error;
+   if (unlikely(copied != bytes)) {
+   status = -EFAULT;
+   goto fs_write_aop_error;
}
-   if (likely(copied  0)) {
-   if (!status)
-   status = copied;
+   if (unlikely(status  0)) /* filesystem did partial write */
+   copied = status;
 
-   if (status = 0) {
-   written += status;
-   count -= status;
-   pos += status;
-   buf += status;
-   if (unlikely(nr_segs  1)) {
-   filemap_set_next_iovec(cur_iov,
-   iov_offset, status);
-   if (count)
-   buf = cur_iov-iov_base +
-   iov_offset;
-   } else {
-   iov_offset += status;
-   }
+   if (likely(copied  0)) {
+   written += copied;
+   count -= copied;
+   pos += copied;
+   buf += copied;
+   if (unlikely(nr_segs  1)) {
+   filemap_set_next_iovec(cur_iov,
+   iov_offset, copied);
+   if (count)
+   buf = cur_iov-iov_base + iov_offset;
+   } else {
+   iov_offset += copied;
}
}
-   if (unlikely(copied != bytes))
-   if (status = 0)
-   status = -EFAULT;
unlock_page(page);
mark_page_accessed(page);
page_cache_release(page);
-   if (status  0)
-   break;
balance_dirty_pages_ratelimited(mapping);
cond_resched();
+   continue;
+
+fs_write_aop_error:
+   if (status != AOP_TRUNCATED_PAGE)
+   unlock_page(page);
+   page_cache_release(page);
+
+   /*
+* prepare_write() may have instantiated a few blocks
+* outside i_size.  Trim these off again. Don't need
+* i_size_read because we hold i_mutex.
+*/
+   

[patch 05/41] mm: debug write deadlocks

2007-05-25 Thread npiggin

Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
difficult race where the page may be unmapped before calling copy_from_user.
Makes the race much easier to hit.

This is useful for demonstration and testing purposes, but is removed in a
subsequent patch.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1952,6 +1952,7 @@ generic_file_buffered_write(struct kiocb
if (maxlen  bytes)
maxlen = bytes;
 
+#ifndef CONFIG_DEBUG_VM
/*
 * Bring in the user page that we will copy from _first_.
 * Otherwise there's a nasty deadlock on copying from the
@@ -1959,6 +1960,7 @@ generic_file_buffered_write(struct kiocb
 * up-to-date.
 */
fault_in_pages_readable(buf, maxlen);
+#endif
 
page = __grab_cache_page(mapping,index,cached_page,lru_pvec);
if (!page) {

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 24/41] hfsplus convert to new aops.

2007-05-25 Thread npiggin
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/hfsplus/extents.c |   21 +
 fs/hfsplus/inode.c   |   20 
 2 files changed, 21 insertions(+), 20 deletions(-)

Index: linux-2.6/fs/hfsplus/inode.c
===
--- linux-2.6.orig/fs/hfsplus/inode.c
+++ linux-2.6/fs/hfsplus/inode.c
@@ -27,10 +27,14 @@ static int hfsplus_writepage(struct page
return block_write_full_page(page, hfsplus_get_block, wbc);
 }
 
-static int hfsplus_prepare_write(struct file *file, struct page *page, 
unsigned from, unsigned to)
-{
-   return cont_prepare_write(page, from, to, hfsplus_get_block,
-   HFSPLUS_I(page-mapping-host).phys_size);
+static int hfsplus_write_begin(struct file *file, struct address_space 
*mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   hfsplus_get_block,
+   HFSPLUS_I(mapping-host).phys_size);
 }
 
 static sector_t hfsplus_bmap(struct address_space *mapping, sector_t block)
@@ -114,8 +118,8 @@ const struct address_space_operations hf
.readpage   = hfsplus_readpage,
.writepage  = hfsplus_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = hfsplus_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= hfsplus_write_begin,
+   .write_end  = generic_write_end,
.bmap   = hfsplus_bmap,
.releasepage= hfsplus_releasepage,
 };
@@ -124,8 +128,8 @@ const struct address_space_operations hf
.readpage   = hfsplus_readpage,
.writepage  = hfsplus_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = hfsplus_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= hfsplus_write_begin,
+   .write_end  = generic_write_end,
.bmap   = hfsplus_bmap,
.direct_IO  = hfsplus_direct_IO,
.writepages = hfsplus_writepages,
Index: linux-2.6/fs/hfsplus/extents.c
===
--- linux-2.6.orig/fs/hfsplus/extents.c
+++ linux-2.6/fs/hfsplus/extents.c
@@ -443,21 +443,18 @@ void hfsplus_file_truncate(struct inode 
if (inode-i_size  HFSPLUS_I(inode).phys_size) {
struct address_space *mapping = inode-i_mapping;
struct page *page;
-   u32 size = inode-i_size - 1;
+   void *fsdata;
+   u32 size = inode-i_size;
int res;
 
-   page = grab_cache_page(mapping, size  PAGE_CACHE_SHIFT);
-   if (!page)
-   return;
-   size = PAGE_CACHE_SIZE - 1;
-   size++;
-   res = mapping-a_ops-prepare_write(NULL, page, size, size);
-   if (!res)
-   res = mapping-a_ops-commit_write(NULL, page, size, 
size);
+   res = pagecache_write_begin(NULL, mapping, size, 0,
+   AOP_FLAG_UNINTERRUPTIBLE,
+   page, fsdata);
if (res)
-   inode-i_size = HFSPLUS_I(inode).phys_size;
-   unlock_page(page);
-   page_cache_release(page);
+   return;
+   res = pagecache_write_end(NULL, mapping, size, 0, 0, page, 
fsdata);
+   if (res  0)
+   return;
mark_inode_dirty(inode);
return;
} else if (inode-i_size == HFSPLUS_I(inode).phys_size)

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 03/41] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83

2007-05-25 Thread npiggin
From: Andrew Morton [EMAIL PROTECTED]

This patch fixed the following bug:

  When prefaulting in the pages in generic_file_buffered_write(), we only
  faulted in the pages for the firts segment of the iovec.  If the second of
  successive segment described a mmapping of the page into which we're
  write()ing, and that page is not up-to-date, the fault handler tries to lock
  the already-locked page (to bring it up to date) and deadlocks.

  An exploit for this bug is in writev-deadlock-demo.c, in
  http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

  (These demos assume blocksize  PAGE_CACHE_SIZE).

The problem with this fix is that it takes the kernel back to doing a single
prepare_write()/commit_write() per iovec segment.  So in the worst case we'll
run prepare_write+commit_write 1024 times where we previously would have run
it once. The other problem with the fix is that it fix all the locking problems.


insert numbers obtained via ext3-tools's writev-speed.c here

And apparently this change killed NFS overwrite performance, because, I
suppose, it talks to the server for each prepare_write+commit_write.

So just back that patch out - we'll be fixing the deadlock by other means.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Andrew Morton [EMAIL PROTECTED]

Nick says: also it only ever actually papered over the bug, because after
faulting in the pages, they might be unmapped or reclaimed.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1939,21 +1939,14 @@ generic_file_buffered_write(struct kiocb
do {
unsigned long index;
unsigned long offset;
+   unsigned long maxlen;
size_t copied;
 
offset = (pos  (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos  PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
-
-   /* Limit the size of the copy to the caller's write size */
-   bytes = min(bytes, count);
-
-   /*
-* Limit the size of the copy to that of the current segment,
-* because fault_in_pages_readable() doesn't know how to walk
-* segments.
-*/
-   bytes = min(bytes, cur_iov-iov_len - iov_base);
+   if (bytes  count)
+   bytes = count;
 
/*
 * Bring in the user page that we will copy from _first_.
@@ -1961,7 +1954,10 @@ generic_file_buffered_write(struct kiocb
 * same page as we're writing to, without it being marked
 * up-to-date.
 */
-   fault_in_pages_readable(buf, bytes);
+   maxlen = cur_iov-iov_len - iov_base;
+   if (maxlen  bytes)
+   maxlen = bytes;
+   fault_in_pages_readable(buf, maxlen);
 
page = __grab_cache_page(mapping,index,cached_page,lru_pvec);
if (!page) {

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 23/41] hfs convert to new aops.

2007-05-25 Thread npiggin
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/hfs/extent.c |   19 ---
 fs/hfs/inode.c  |   20 
 2 files changed, 20 insertions(+), 19 deletions(-)

Index: linux-2.6/fs/hfs/inode.c
===
--- linux-2.6.orig/fs/hfs/inode.c
+++ linux-2.6/fs/hfs/inode.c
@@ -35,10 +35,14 @@ static int hfs_readpage(struct file *fil
return block_read_full_page(page, hfs_get_block);
 }
 
-static int hfs_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
-{
-   return cont_prepare_write(page, from, to, hfs_get_block,
- HFS_I(page-mapping-host)-phys_size);
+static int hfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   hfs_get_block,
+   HFS_I(mapping-host)-phys_size);
 }
 
 static sector_t hfs_bmap(struct address_space *mapping, sector_t block)
@@ -119,8 +123,8 @@ const struct address_space_operations hf
.readpage   = hfs_readpage,
.writepage  = hfs_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = hfs_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= hfs_write_begin,
+   .write_end  = generic_write_end,
.bmap   = hfs_bmap,
.releasepage= hfs_releasepage,
 };
@@ -129,8 +133,8 @@ const struct address_space_operations hf
.readpage   = hfs_readpage,
.writepage  = hfs_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = hfs_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= hfs_write_begin,
+   .write_end  = generic_write_end,
.bmap   = hfs_bmap,
.direct_IO  = hfs_direct_IO,
.writepages = hfs_writepages,
Index: linux-2.6/fs/hfs/extent.c
===
--- linux-2.6.orig/fs/hfs/extent.c
+++ linux-2.6/fs/hfs/extent.c
@@ -464,23 +464,20 @@ void hfs_file_truncate(struct inode *ino
   (long long)HFS_I(inode)-phys_size, inode-i_size);
if (inode-i_size  HFS_I(inode)-phys_size) {
struct address_space *mapping = inode-i_mapping;
+   void *fsdata;
struct page *page;
int res;
 
+   /* XXX: Can use generic_cont_expand? */
size = inode-i_size - 1;
-   page = grab_cache_page(mapping, size  PAGE_CACHE_SHIFT);
-   if (!page)
-   return;
-   size = PAGE_CACHE_SIZE - 1;
-   size++;
-   res = mapping-a_ops-prepare_write(NULL, page, size, size);
-   if (!res)
-   res = mapping-a_ops-commit_write(NULL, page, size, 
size);
+   res = pagecache_write_begin(NULL, mapping, size+1, 0,
+   AOP_FLAG_UNINTERRUPTIBLE, page, fsdata);
+   if (!res) {
+   res = pagecache_write_end(NULL, mapping, size+1, 0, 0,
+   page, fsdata);
+   }
if (res)
inode-i_size = HFS_I(inode)-phys_size;
-   unlock_page(page);
-   page_cache_release(page);
-   mark_inode_dirty(inode);
return;
} else if (inode-i_size == HFS_I(inode)-phys_size)
return;

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 35/41] hostfs convert to new aops.

2007-05-25 Thread npiggin
This also gets rid of a lot of useless read_file stuff. And also
optimises the full page write case by marking a !uptodate page uptodate.

Cc: Jeff Dike [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/hostfs/hostfs_kern.c |   70 +++-
 1 file changed, 28 insertions(+), 42 deletions(-)

Index: linux-2.6/fs/hostfs/hostfs_kern.c
===
--- linux-2.6.orig/fs/hostfs/hostfs_kern.c
+++ linux-2.6/fs/hostfs/hostfs_kern.c
@@ -466,56 +466,42 @@ int hostfs_readpage(struct file *file, s
return err;
 }
 
-int hostfs_prepare_write(struct file *file, struct page *page,
-unsigned int from, unsigned int to)
+int hostfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   char *buffer;
-   long long start, tmp;
-   int err;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
 
-   start = (long long) page-index  PAGE_CACHE_SHIFT;
-   buffer = kmap(page);
-   if(from != 0){
-   tmp = start;
-   err = read_file(FILE_HOSTFS_I(file)-fd, tmp, buffer,
-   from);
-   if(err  0) goto out;
-   }
-   if(to != PAGE_CACHE_SIZE){
-   start += to;
-   err = read_file(FILE_HOSTFS_I(file)-fd, start, buffer + to,
-   PAGE_CACHE_SIZE - to);
-   if(err  0) goto out;
-   }
-   err = 0;
- out:
-   kunmap(page);
-   return err;
+   *pagep = __grab_cache_page(mapping, index);
+   if (!*pagep)
+   return -ENOMEM;
+   return 0;
 }
 
-int hostfs_commit_write(struct file *file, struct page *page, unsigned from,
-unsigned to)
+int hostfs_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   struct address_space *mapping = page-mapping;
struct inode *inode = mapping-host;
-   char *buffer;
-   long long start;
-   int err = 0;
+   void *buffer;
+   unsigned from = pos  (PAGE_CACHE_SIZE - 1);
+   int err;
 
-   start = (((long long) page-index)  PAGE_CACHE_SHIFT) + from;
buffer = kmap(page);
-   err = write_file(FILE_HOSTFS_I(file)-fd, start, buffer + from,
-to - from);
-   if(err  0) err = 0;
-
-   /* Actually, if !err, write_file has added to-from to start, so, despite
-* the appearance, we are comparing i_size against the _last_ written
-* location, as we should. */
+   err = write_file(FILE_HOSTFS_I(file)-fd, pos, buffer + from, copied);
+   kunmap(page);
 
-   if(!err  (start  inode-i_size))
-   inode-i_size = start;
+   if (!PageUptodate(page)  err == PAGE_CACHE_SIZE)
+   SetPageUptodate(page);
+   unlock_page(page);
+   page_cache_release(page);
+
+   /* If err  0, write_file has added err to pos, so we are comparing
+* i_size against the last byte written.
+*/
+   if (err  0  (pos  inode-i_size))
+   inode-i_size = pos;
 
-   kunmap(page);
return err;
 }
 
@@ -523,8 +509,8 @@ static const struct address_space_operat
.writepage  = hostfs_writepage,
.readpage   = hostfs_readpage,
.set_page_dirty = __set_page_dirty_nobuffers,
-   .prepare_write  = hostfs_prepare_write,
-   .commit_write   = hostfs_commit_write
+   .write_begin= hostfs_write_begin,
+   .write_end  = hostfs_write_end,
 };
 
 static int init_inode(struct inode *inode, struct dentry *dentry)

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 22/41] adfs convert to new aops.

2007-05-25 Thread npiggin
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/adfs/inode.c |   14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

Index: linux-2.6/fs/adfs/inode.c
===
--- linux-2.6.orig/fs/adfs/inode.c
+++ linux-2.6/fs/adfs/inode.c
@@ -61,10 +61,14 @@ static int adfs_readpage(struct file *fi
return block_read_full_page(page, adfs_get_block);
 }
 
-static int adfs_prepare_write(struct file *file, struct page *page, unsigned 
int from, unsigned int to)
+static int adfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return cont_prepare_write(page, from, to, adfs_get_block,
-   ADFS_I(page-mapping-host)-mmu_private);
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   adfs_get_block,
+   ADFS_I(mapping-host)-mmu_private);
 }
 
 static sector_t _adfs_bmap(struct address_space *mapping, sector_t block)
@@ -76,8 +80,8 @@ static const struct address_space_operat
.readpage   = adfs_readpage,
.writepage  = adfs_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = adfs_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= adfs_write_begin,
+   .write_end  = generic_write_end,
.bmap   = _adfs_bmap
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 39/41] sysv convert to new aops.

2007-05-25 Thread npiggin
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/sysv/dir.c   |   51 +++
 fs/sysv/itree.c |   23 +++
 2 files changed, 50 insertions(+), 24 deletions(-)

Index: linux-2.6/fs/sysv/itree.c
===
--- linux-2.6.orig/fs/sysv/itree.c
+++ linux-2.6/fs/sysv/itree.c
@@ -453,23 +453,38 @@ static int sysv_writepage(struct page *p
 {
return block_write_full_page(page,get_block,wbc);
 }
+
 static int sysv_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,get_block);
 }
-static int sysv_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+
+int __sysv_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page,from,to,get_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   get_block);
 }
+
+static int sysv_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return __sysv_write_begin(file, mapping, pos, len, flags, pagep, 
fsdata);
+}
+
 static sector_t sysv_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,get_block);
 }
+
 const struct address_space_operations sysv_aops = {
.readpage = sysv_readpage,
.writepage = sysv_writepage,
.sync_page = block_sync_page,
-   .prepare_write = sysv_prepare_write,
-   .commit_write = generic_commit_write,
+   .write_begin = sysv_write_begin,
+   .write_end = generic_write_end,
.bmap = sysv_bmap
 };
Index: linux-2.6/fs/sysv/dir.c
===
--- linux-2.6.orig/fs/sysv/dir.c
+++ linux-2.6/fs/sysv/dir.c
@@ -16,6 +16,7 @@
 #include linux/pagemap.h
 #include linux/highmem.h
 #include linux/smp_lock.h
+#include linux/swap.h
 #include sysv.h
 
 static int sysv_readdir(struct file *, void *, filldir_t);
@@ -37,16 +38,22 @@ static inline unsigned long dir_pages(st
return (inode-i_size+PAGE_CACHE_SIZE-1)PAGE_CACHE_SHIFT;
 }
 
-static int dir_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-   struct inode *dir = (struct inode *)page-mapping-host;
+   struct address_space *mapping = page-mapping;
+   struct inode *dir = mapping-host;
int err = 0;
 
-   page-mapping-a_ops-commit_write(NULL, page, from, to);
+   block_write_end(NULL, mapping, pos, len, len, page, NULL);
+   if (pos+len  dir-i_size) {
+   i_size_write(dir, pos+len);
+   mark_inode_dirty(dir);
+   }
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
unlock_page(page);
+   mark_page_accessed(page);
return err;
 }
 
@@ -186,7 +193,7 @@ int sysv_add_link(struct dentry *dentry,
unsigned long npages = dir_pages(dir);
unsigned long n;
char *kaddr;
-   unsigned from, to;
+   loff_t pos;
int err;
 
/* We take care of directory expansion in the same loop */
@@ -212,16 +219,17 @@ int sysv_add_link(struct dentry *dentry,
return -EINVAL;
 
 got_it:
-   from = (char*)de - (char*)page_address(page);
-   to = from + SYSV_DIRSIZE;
+   pos = (page-index  PAGE_CACHE_SHIFT) +
+   (char*)de - (char*)page_address(page);
lock_page(page);
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __sysv_write_begin(NULL, page-mapping, pos, SYSV_DIRSIZE,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
if (err)
goto out_unlock;
memcpy (de-name, name, namelen);
memset (de-name + namelen, 0, SYSV_DIRSIZE - namelen - 2);
de-inode = cpu_to_fs16(SYSV_SB(inode-i_sb), inode-i_ino);
-   err = dir_commit_chunk(page, from, to);
+   err = dir_commit_chunk(page, pos, SYSV_DIRSIZE);
dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(dir);
 out_page:
@@ -238,15 +246,15 @@ int sysv_delete_entry(struct sysv_dir_en
struct address_space *mapping = page-mapping;
struct inode *inode = (struct inode*)mapping-host;
char *kaddr = (char*)page_address(page);
-   unsigned from = (char*)de - kaddr;
-   unsigned to = from + SYSV_DIRSIZE;
+   loff_t pos = (page-index  PAGE_CACHE_SHIFT) + (char *)de - kaddr;
int err;
 
lock_page(page);
-   err = 

[patch 36/41] jffs2 convert to new aops.

2007-05-25 Thread npiggin
Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/jffs2/file.c |  105 +++-
 1 file changed, 66 insertions(+), 39 deletions(-)

Index: linux-2.6/fs/jffs2/file.c
===
--- linux-2.6.orig/fs/jffs2/file.c
+++ linux-2.6/fs/jffs2/file.c
@@ -19,10 +19,12 @@
 #include linux/jffs2.h
 #include nodelist.h
 
-static int jffs2_commit_write (struct file *filp, struct page *pg,
-  unsigned start, unsigned end);
-static int jffs2_prepare_write (struct file *filp, struct page *pg,
-   unsigned start, unsigned end);
+static int jffs2_write_end(struct file *filp, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *pg, void *fsdata);
+static int jffs2_write_begin(struct file *filp, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata);
 static int jffs2_readpage (struct file *filp, struct page *pg);
 
 int jffs2_fsync(struct file *filp, struct dentry *dentry, int datasync)
@@ -65,8 +67,8 @@ const struct inode_operations jffs2_file
 const struct address_space_operations jffs2_file_address_operations =
 {
.readpage = jffs2_readpage,
-   .prepare_write =jffs2_prepare_write,
-   .commit_write = jffs2_commit_write
+   .write_begin =  jffs2_write_begin,
+   .write_end =jffs2_write_end,
 };
 
 static int jffs2_do_readpage_nolock (struct inode *inode, struct page *pg)
@@ -119,15 +121,23 @@ static int jffs2_readpage (struct file *
return ret;
 }
 
-static int jffs2_prepare_write (struct file *filp, struct page *pg,
-   unsigned start, unsigned end)
+static int jffs2_write_begin(struct file *filp, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   struct inode *inode = pg-mapping-host;
+   struct page *pg;
+   struct inode *inode = mapping-host;
struct jffs2_inode_info *f = JFFS2_INODE_INFO(inode);
-   uint32_t pageofs = pg-index  PAGE_CACHE_SHIFT;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   uint32_t pageofs = pos  (PAGE_CACHE_SIZE - 1);
int ret = 0;
 
-   D1(printk(KERN_DEBUG jffs2_prepare_write()\n));
+   pg = __grab_cache_page(mapping, index);
+   if (!pg)
+   return -ENOMEM;
+   *pagep = pg;
+
+   D1(printk(KERN_DEBUG jffs2_write_begin()\n));
 
if (pageofs  inode-i_size) {
/* Make new hole frag from old EOF to new page */
@@ -142,7 +152,7 @@ static int jffs2_prepare_write (struct f
ret = jffs2_reserve_space(c, sizeof(ri), alloc_len,
  ALLOC_NORMAL, 
JFFS2_SUMMARY_INODE_SIZE);
if (ret)
-   return ret;
+   goto out_page;
 
down(f-sem);
memset(ri, 0, sizeof(ri));
@@ -172,7 +182,7 @@ static int jffs2_prepare_write (struct f
ret = PTR_ERR(fn);
jffs2_complete_reservation(c);
up(f-sem);
-   return ret;
+   goto out_page;
}
ret = jffs2_add_full_dnode_to_inode(c, f, fn);
if (f-metadata) {
@@ -181,65 +191,79 @@ static int jffs2_prepare_write (struct f
f-metadata = NULL;
}
if (ret) {
-   D1(printk(KERN_DEBUG Eep. add_full_dnode_to_inode() 
failed in prepare_write, returned %d\n, ret));
+   D1(printk(KERN_DEBUG Eep. add_full_dnode_to_inode() 
failed in write_begin, returned %d\n, ret));
jffs2_mark_node_obsolete(c, fn-raw);
jffs2_free_full_dnode(fn);
jffs2_complete_reservation(c);
up(f-sem);
-   return ret;
+   goto out_page;
}
jffs2_complete_reservation(c);
inode-i_size = pageofs;
up(f-sem);
}
 
-   /* Read in the page if it wasn't already present, unless it's a whole 
page */
-   if (!PageUptodate(pg)  (start || end  PAGE_CACHE_SIZE)) {
+   /*
+* Read in the page if it wasn't already present. Cannot optimize away
+* the whole page write case until jffs2_write_end can handle the
+* case of a short-copy.
+*/
+   if (!PageUptodate(pg)) {
down(f-sem);
ret = jffs2_do_readpage_nolock(inode, pg);
up(f-sem);
+   if (ret)
+

Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Stefan Bader

2007/5/25, Neil Brown [EMAIL PROTECTED]:


HOW DO MD or DM USE THIS


1/ striping devices.
 This includes md/raid0 md/linear dm-linear dm-stripe and probably
 others.

   These devices can easily support blkdev_issue_flush by simply
   calling blkdev_issue_flush on all component devices.



This ensures that all of the previous requests have been processed but
does this guarantee they where successful? This might be too paranoid
but if I understood the concept correctly the success of a barrier
request should indicate success of all previous request between this
barrier and the last one.


   These devices would find it very hard to support BIO_RW_BARRIER.
   Doing this would require keeping track of all in-flight requests
   (which some, possibly all, of the above don't) and then:
 When a BIO_RW_BARRIER request arrives:
wait for all pending writes to complete
call blkdev_issue_flush on all devices
issue the barrier write to the target device(s)
   as BIO_RW_BARRIER,
if that is -EOPNOTSUP, re-issue, wait, flush.



I guess just keep a count of submitted requests and errors since the
last barrier might be enough. As long as all of the underlying device
support at least support a flush the dm device could pretend to
support BIO_RW_BARRIER.



dm-linear and dm-stripe simply pass the BIO_RW_BARRIER flag down,
 which means data may not be flushed correctly:  the commit block
 might be written to one device before a preceding block is
 written to another device.


Hm, even worse: if the barrier requests accidentally end up on a
device that does support barriers and another one on the map doesn't.
Would any layer/fs above care to issue a flush call?


   I think the best approach for this class of devices is to return
   -EOPNOSUP.  If the filesystem does the wait (which they all do
   already) and the blkdev_issue_flush (which is easy to add), they
   don't need to support BIO_RW_BARRIER.


Without any additional code these really should report -EOPNOTSUPP. If
disaster strikes there is no way to make assumptions on the real state
on disk.


2/ Mirror devices.  This includes md/raid1 and dm-raid1.

   These device can trivially implement blkdev_issue_flush much like
   the striping devices, and can support BIO_RW_BARRIER to some
   extent.
   md/raid1 currently tries.  I'm not sure about dm-raid1.


I fear this is more broken as with linear and stripe. There is no code
to check the features of underlying devices and the request itself
isn't sent forward but privately built ones (which do not have the
barrier flag)...

3/ Multipath devices

Requests are sent to the same device but one different paths. So at
least with them the chance of one path supporting barriers but not
another one seems little (as long as the paths do not use completely
different transport layers). But passing on a request with the barrier
flag also doesn't seem to be a good idea since previous requests can
arrive at the device later.

IMHO the best way to handle barriers for dm would be to add the
sequence described to the generic mapping layer of dm (before calling
the targets mapping function). There is already some sort of counting
in-flight requests (suspend/resume needs that) and I guess the
downgrade could also be rather simple. If a flush call to the target
(mapped device) fails report -EOPNOTSUPP and stay that way (until next
boot).


So: some questions to help encourage response:




 - Is the approach to barriers taken by md appropriate?  Should dm
do the same?  Who will do that?


If my assumption about barrier semantics is true, then also md has to
somehow make sure all previous requests have _successfully_ completed.
In the mirror case I guess it is valid to report success if the mirror
itself is in a clean state. Which is all previous requests (and the
barrier) where successful on at least one mirror half and this state
can be recovered.

Question to dm-devel: What do people there think of the possible
generic implementation in dm.c?


 - The comment above blkdev_issue_flush says Caller must run
   wait_for_completion() on its own.  What does that mean?


Guess this means it initiates a flush but doesn't wait for completion.
So the caller must wait for the completion of the separate requests on
its own, doesn't it?


 - Are there other bit that we could handle better?
BIO_RW_FAILFAST?  BIO_RW_SYNC?  What exactly do they mean?


BIO_RW_FAILFAST: means low-level driver shouldn't do much (or no)
error recovery. Mainly used by mutlipath targets to avoid long SCSI
recovery. This should just be propagated when passing requests on.

BIO_RW_SYNC: means this is a bio of a synchronous request. I don't
know whether there are more uses to it but this at least causes queues
to be flushed immediately instead of waiting for more requests for a
short time. Should also just be passed on. Otherwise performance gets
poor since something above will 

Re: [patch 27/41] qnx4 convert to new aops.

2007-05-25 Thread Anders Larsen
On 2007-05-25 14:22:11, [EMAIL PROTECTED] wrote:
 Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Acked-by: Anders Larsen [EMAIL PROTECTED]

(although we might just as well do away with the 'write' methods completely,
 since write-support is  BROKEN anyway)

Cheers
 Anders

 
  fs/qnx4/inode.c |   21 +
  1 file changed, 13 insertions(+), 8 deletions(-)
 
 Index: linux-2.6/fs/qnx4/inode.c
 ===
 --- linux-2.6.orig/fs/qnx4/inode.c
 +++ linux-2.6/fs/qnx4/inode.c
 @@ -433,16 +433,21 @@ static int qnx4_writepage(struct page *p
  {
   return block_write_full_page(page,qnx4_get_block, wbc);
  }
 +
  static int qnx4_readpage(struct file *file, struct page *page)
  {
   return block_read_full_page(page,qnx4_get_block);
  }
 -static int qnx4_prepare_write(struct file *file, struct page *page,
 -   unsigned from, unsigned to)
 -{
 - struct qnx4_inode_info *qnx4_inode = qnx4_i(page-mapping-host);
 - return cont_prepare_write(page, from, to, qnx4_get_block,
 -   qnx4_inode-mmu_private);
 +
 +static int qnx4_write_begin(struct file *file, struct address_space *mapping,
 + loff_t pos, unsigned len, unsigned flags,
 + struct page **pagep, void **fsdata)
 +{
 + struct qnx4_inode_info *qnx4_inode = qnx4_i(mapping-host);
 + *pagep = NULL;
 + return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
 + qnx4_get_block,
 + qnx4_inode-mmu_private);
  }
  static sector_t qnx4_bmap(struct address_space *mapping, sector_t block)
  {
 @@ -452,8 +457,8 @@ static const struct address_space_operat
   .readpage   = qnx4_readpage,
   .writepage  = qnx4_writepage,
   .sync_page  = block_sync_page,
 - .prepare_write  = qnx4_prepare_write,
 - .commit_write   = generic_commit_write,
 + .write_begin= qnx4_write_begin,
 + .write_end  = generic_write_end,
   .bmap   = qnx4_bmap
  };
  
 
 -- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Phillip Susi

Jens Axboe wrote:

A barrier write will include a flush, but it may also use the FUA bit to
ensure data is on platter. So the only situation where a fallback from a
barrier to flush would be valid, is if the device lied and told you it
could do FUA but it could not and that is the reason why the barrier
write failed. If that is the case, the block layer should stop using FUA
and fallback to flush-write-flush. And if it does that, then there's
never a valid reason to switch from using barrier writes to
blkdev_issue_flush() since both methods would either both work or both
fail.


IIRC, the FUA bit only forces THAT request to hit the platter before it 
is completed; it does not flush any previous requests still sitting in 
the write back queue.  Because all io before the barrier must be on the 
platter as well, setting the FUA bit on the barrier request means you 
don't have to follow it with a flush, but you still have to precede it 
with a flush.



It's not block layer breakage, it's a device issue.


How isn't it block layer breakage?  If the device does not support 
barriers, isn't it the job of the block layer ( probably the scheduler ) 
to fall back to flush-write-flush?



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Phillip Susi

Neil Brown wrote:

There is no guarantee that a device can support BIO_RW_BARRIER - it is
always possible that a request will fail with EOPNOTSUPP.


Why is it not the job of the block layer to translate for broken devices 
and send them a flush/write/flush?



   These devices would find it very hard to support BIO_RW_BARRIER.
   Doing this would require keeping track of all in-flight requests
   (which some, possibly all, of the above don't) and then:


The device mapper keeps track of in flight requests already.  When 
switching tables it has to hold new requests and wait for in flight 
requests to complete before switching to the new table.  When it gets a 
barrier request it just needs to do the same thing, only not switch 
tables.



   I think the best approach for this class of devices is to return
   -EOPNOSUP.  If the filesystem does the wait (which they all do
   already) and the blkdev_issue_flush (which is easy to add), they
   don't need to support BIO_RW_BARRIER.


Why?  The personalities should just pass the BARRIER flag down to each 
underlying device, and the dm common code should wait for all in flight 
io to complete before sending the barrier to the personality.



For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to
the controller can be tagged as barriers), SCSI will use the
SYNCHRONIZE_CACHE command to flush the cache after the barrier
request (a bit like the filesystem calling blkdev_issue_flush, but at


Don't you have to flush the cache BEFORE the barrier to ensure that 
previous IO is committed first, THEN the barrier write?



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/2] i_version update - vfs part

2007-05-25 Thread Jean noel Cordenner

Concerning the first part of the set, the i_version field of the inode
structure has been reused. The field has been redefined as the counter 
has to be 64-bits.


The patch modifies the i_version field of the inode on the VFS layer.
The i_version field become a 64bit counter that is set on inode creation and
that is incremented every time the inode data is modified (similarly to the
ctime time-stamp).
The aim is to fulfill a NFSv4 requirement for rfc3530:
5.5.  Mandatory Attributes - Definitions
Name#   DataType   Access   Description
___
change  3   uint64   READ A value created by the
server that the client can use to determine if file
data, directory contents or attributes of the object
have been modified.  The servermay return the object's
time_metadata attribute for this attribute's value but
only if the filesystem object can not be updated more
frequently than the resolution of time_metadata.

Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED]

Index: linux-2.6.22-rc2-ext4-1/fs/binfmt_misc.c
===
--- linux-2.6.22-rc2-ext4-1.orig/fs/binfmt_misc.c   2007-05-25 
18:01:51.0 +0200
+++ linux-2.6.22-rc2-ext4-1/fs/binfmt_misc.c2007-05-25 18:01:56.0 
+0200
@@ -508,6 +508,7 @@
inode-i_blocks = 0;
inode-i_atime = inode-i_mtime = inode-i_ctime =
current_fs_time(inode-i_sb);
+   inode-i_version = 1;
}
return inode;
 }
Index: linux-2.6.22-rc2-ext4-1/fs/libfs.c
===
--- linux-2.6.22-rc2-ext4-1.orig/fs/libfs.c 2007-05-25 18:01:51.0 
+0200
+++ linux-2.6.22-rc2-ext4-1/fs/libfs.c  2007-05-25 18:01:56.0 +0200
@@ -232,6 +232,7 @@
root-i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
root-i_uid = root-i_gid = 0;
root-i_atime = root-i_mtime = root-i_ctime = CURRENT_TIME;
+   root-i_version = 1;
dentry = d_alloc(NULL, d_name);
if (!dentry) {
iput(root);
@@ -255,6 +256,8 @@
struct inode *inode = old_dentry-d_inode;
 
inode-i_ctime = dir-i_ctime = dir-i_mtime = CURRENT_TIME;
+   inode-i_version++;
+   dir-i_version++;
inc_nlink(inode);
atomic_inc(inode-i_count);
dget(dentry);
@@ -287,6 +290,8 @@
struct inode *inode = dentry-d_inode;
 
inode-i_ctime = dir-i_ctime = dir-i_mtime = CURRENT_TIME;
+   inode-i_version++;
+   dir-i_version++;
drop_nlink(inode);
dput(dentry);
return 0;
@@ -323,6 +328,8 @@
 
old_dir-i_ctime = old_dir-i_mtime = new_dir-i_ctime =
new_dir-i_mtime = inode-i_ctime = CURRENT_TIME;
+   old_dir-i_version++;
+   new_dir-i_version++;
 
return 0;
 }
@@ -399,6 +406,7 @@
inode-i_uid = inode-i_gid = 0;
inode-i_blocks = 0;
inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME;
+   inode-i_version = 1;
inode-i_op = simple_dir_inode_operations;
inode-i_fop = simple_dir_operations;
inode-i_nlink = 2;
@@ -427,6 +435,7 @@
inode-i_uid = inode-i_gid = 0;
inode-i_blocks = 0;
inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME;
+   inode-i_version = 1;
inode-i_fop = files-ops;
inode-i_ino = i;
d_add(dentry, inode);
Index: linux-2.6.22-rc2-ext4-1/fs/pipe.c
===
--- linux-2.6.22-rc2-ext4-1.orig/fs/pipe.c  2007-05-25 18:01:51.0 
+0200
+++ linux-2.6.22-rc2-ext4-1/fs/pipe.c   2007-05-25 18:01:56.0 +0200
@@ -882,6 +882,7 @@
inode-i_uid = current-fsuid;
inode-i_gid = current-fsgid;
inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME;
+   inode-i_version = 1;
 
return inode;
 
Index: linux-2.6.22-rc2-ext4-1/include/linux/fs.h
===
--- linux-2.6.22-rc2-ext4-1.orig/include/linux/fs.h 2007-05-25 
18:01:51.0 +0200
+++ linux-2.6.22-rc2-ext4-1/include/linux/fs.h  2007-05-25 18:01:56.0 
+0200
@@ -549,7 +549,7 @@
uid_t   i_uid;
gid_t   i_gid;
dev_t   i_rdev;
-   unsigned long   i_version;
+   uint64_ti_version;
loff_t  i_size;
 #ifdef __NEED_I_SIZE_ORDERED
seqcount_t  i_size_seqcount;


[patch 2/2] i_version update - ext4 part

2007-05-25 Thread Jean noel Cordenner

The patch is on top of the ext4 tree:
http://repo.or.cz/w/ext4-patch-queue.git

In this part, the i_version counter is stored into 2 32bit fields of
the ext4_inode structure osd1.linux1.l_i_version and i_version_hi.

I included the ext4_expand_inode_extra_isize patch, which does part of 
the job, checking if there is enough room for extra fields in the inode 
(i_version_hi). The other patch increments the counter on inode 
modifications and set it on inode creation.
This patch is on top of the nanosecond timestamp and i_version_hi
patches. 

This patch adds 64-bit inode version support to ext4. The lower 32 bits
are stored in the osd1.linux1.l_i_version field while the high 32 bits
are stored in the i_version_hi field newly created in the ext4_inode.

We need to make sure that existing filesystems can also avail the new
fields that have been added to the inode. We use s_want_extra_isize and
s_min_extra_isize to decide by how much we should expand the inode. If
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature is set then we expand by
max(s_want_extra_isize, s_min_extra_isize , sizeof(ext4_inode) -
EXT4_GOOD_OLD_INODE_SIZE) bytes. Actually it is still an open question
about whether users should be able to set s_*_extra_isize smaller than
the known fields or not. 

This patch also adds the functionality to expand inodes to include the
newly added fields. We start by trying to expand by s_want_extra_isize
bytes and if its fails we try to expand by s_min_extra_isize bytes. This
is done by changing the i_extra_isize if enough space is available in
the inode and no EAs are present. If EAs are present and there is enough
space in the inode then the EAs in the inode are shifted to make space.
If enough space is not available in the inode due to the EAs then 1 or
more EAs are shifted to the external EA block. In the worst case when
even the external EA block does not have enough space we inform the user
that some EA would need to be deleted or s_min_extra_isize would have to
be reduced.

This would be online expansion of inodes. I am also working on adding an
expand_inodes option to e2fsck which will expand all the used inodes.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
Signed-off-by: Mingming Cao [EMAIL PROTECTED]

Index: linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c
===
--- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/inode.c	2007-05-25 17:12:37.0 +0200
+++ linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c	2007-05-25 17:12:41.0 +0200
@@ -2709,6 +2709,13 @@
 	EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode);
 	EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
 
+	ei-i_fs_version = le32_to_cpu(raw_inode-i_disk_version);
+	if (EXT4_INODE_SIZE(inode-i_sb)  EXT4_GOOD_OLD_INODE_SIZE) {
+		if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
+			ei-i_fs_version |=
+			(__u64)(le32_to_cpu(raw_inode-i_version_hi))  32;
+	}
+
 	if (S_ISREG(inode-i_mode)) {
 		inode-i_op = ext4_file_inode_operations;
 		inode-i_fop = ext4_file_operations;
@@ -2852,8 +2859,14 @@
 	} else for (block = 0; block  EXT4_N_BLOCKS; block++)
 		raw_inode-i_block[block] = ei-i_data[block];
 
-	if (ei-i_extra_isize)
+	raw_inode-i_disk_version = cpu_to_le32(ei-i_fs_version);
+	if (ei-i_extra_isize) {
+		if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) {
+			raw_inode-i_version_hi =
+cpu_to_le32(ei-i_fs_version  32);
+		}
 		raw_inode-i_extra_isize = cpu_to_le16(ei-i_extra_isize);
+	}
 
 	BUFFER_TRACE(bh, call ext4_journal_dirty_metadata);
 	rc = ext4_journal_dirty_metadata(handle, bh);
@@ -3127,10 +3140,32 @@
 int ext4_mark_inode_dirty(handle_t *handle, struct inode *inode)
 {
 	struct ext4_iloc iloc;
-	int err;
+	int err, ret;
+	static int expand_message;
 
 	might_sleep();
 	err = ext4_reserve_inode_write(handle, inode, iloc);
+	if (EXT4_I(inode)-i_extra_isize 
+	EXT4_SB(inode-i_sb)-s_want_extra_isize 
+	!(EXT4_I(inode)-i_state  EXT4_STATE_NO_EXPAND)) {
+		/* We need extra buffer credits since we may write into EA block
+		 * with this same handle */
+		if ((jbd2_journal_extend(handle,
+			 EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) {
+			ret = ext4_expand_extra_isize(inode,
+  	EXT4_SB(inode-i_sb)-s_want_extra_isize,
+	iloc, handle);
+			if (ret) {
+EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND;
+if (!expand_message) {
+	ext4_warning(inode-i_sb, __FUNCTION__,
+	Unable to expand inode %lu. Delete some
+	 EAs or run e2fsck., inode-i_ino);
+	expand_message = 1;
+}
+			}
+		}
+	}
 	if (!err)
 		err = ext4_mark_iloc_dirty(handle, inode, iloc);
 	return err;
Index: linux-2.6.22-rc2-ext4-1/include/linux/ext4_fs.h
===
--- linux-2.6.22-rc2-ext4-1.orig/include/linux/ext4_fs.h	2007-05-25 17:12:37.0 +0200
+++ linux-2.6.22-rc2-ext4-1/include/linux/ext4_fs.h	2007-05-25 17:12:41.0 +0200
@@ -202,6 +202,7 @@
 #define EXT4_STATE_JDATA		

Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-05-25 Thread Casey Schaufler

--- Jeremy Maitin-Shepard [EMAIL PROTECTED] wrote:

 Casey Schaufler [EMAIL PROTECTED] writes:
 
  On Fedora zcat, gzip and gunzip are all links to the same file.
  I can imagine (although it is a bit of a stretch) allowing a set
  of users access to gunzip but not gzip (or the other way around).
  There are probably more sophisticated programs that have different
  behavior based on the name they're invoked by that would provide
  a more compelling arguement, assuming of course that you buy into
  the behavior-based-on-name scheme. What I think I'm suggesting is
  that AppArmor might be useful in addressing the fact that a file
  with multiple hard links is necessarily constrained to have the
  same access control on each of those names. That assumes one
  believes that such behavior is flawwed, and I'm not going to try
  to argue that. The question was about an example, and there is one.
 
 This doesn't work.  The behavior depends on argv[0], which is not
 necessarily the same as the name of the file.

Sorry, but I don't understand your objection. If AppArmor is configured
to allow everyone access to /bin/gzip but only some people access to
/bin/gunzip and (important detail) the single binary uses argv[0]
as documented and (another important detail) there aren't other links
named gunzip to the binary (ok, that's lots of if's) you should be fine.
I suppose you could make a shell that lies to exec, but the AppArmor
code could certainly check for that in exec by enforcing the argv[0]
convention. It would be perfectly reasonable for a system that is so
dependent on pathnames to require that.


Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-05-25 Thread Jeremy Maitin-Shepard
Casey Schaufler [EMAIL PROTECTED] writes:

 --- Jeremy Maitin-Shepard [EMAIL PROTECTED] wrote:
 Casey Schaufler [EMAIL PROTECTED] writes:
 
  On Fedora zcat, gzip and gunzip are all links to the same file.
  I can imagine (although it is a bit of a stretch) allowing a set
  of users access to gunzip but not gzip (or the other way around).
  There are probably more sophisticated programs that have different
  behavior based on the name they're invoked by that would provide
  a more compelling arguement, assuming of course that you buy into
  the behavior-based-on-name scheme. What I think I'm suggesting is
  that AppArmor might be useful in addressing the fact that a file
  with multiple hard links is necessarily constrained to have the
  same access control on each of those names. That assumes one
  believes that such behavior is flawwed, and I'm not going to try
  to argue that. The question was about an example, and there is one.
 
 This doesn't work.  The behavior depends on argv[0], which is not
 necessarily the same as the name of the file.

 Sorry, but I don't understand your objection. If AppArmor is configured
 to allow everyone access to /bin/gzip but only some people access to
 /bin/gunzip and (important detail) the single binary uses argv[0]
 as documented and (another important detail) there aren't other links
 named gunzip to the binary (ok, that's lots of if's) you should be fine.
 I suppose you could make a shell that lies to exec, but the AppArmor
 code could certainly check for that in exec by enforcing the argv[0]
 convention. It would be perfectly reasonable for a system that is so
 dependent on pathnames to require that.

Well, my point was exactly that App Armor doesn't (as far as I know) do
anything to enforce the argv[0] convention, nor would it in general
prevent a confined program from making a symlink or hard link.  Even
disregarding that, it seems very fragile in general to make an suid
program (there would be no point in confining the execution of a
non-suid program) perform essentially access control based on argv[0].

-- 
Jeremy Maitin-Shepard
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-05-25 Thread Casey Schaufler

--- Jeremy Maitin-Shepard [EMAIL PROTECTED] wrote:

 ...
 Well, my point was exactly that App Armor doesn't (as far as I know) do
 anything to enforce the argv[0] convention,

Sounds like an opportunity for improvement then.

 nor would it in general
 prevent a confined program from making a symlink or hard link.  Even
 disregarding that, it seems very fragile in general to make an suid
 program (there would be no point in confining the execution of a
 non-suid program) perform essentially access control based on argv[0].

I think that you're being generous calling it fragile, but that's
my view, and I've seen much worse. I agree that it would be a Bad Idea,
but the fact that I think it's a bad idea is not going to prevent very
many people from trying it, and for those that do try it name based
access control might seem like just the ticket to complete their
nefarious schemes. Remember that security is a subjective thing, and
using argv[0] and AppArmor together might make some people feel better.



Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-05-25 Thread Andreas Gruenbacher
On Friday 25 May 2007 19:43, Casey Schaufler wrote:
 [...] but the AppArmor code could certainly check for that in exec by
 enforcing the argv[0] convention. It would be perfectly reasonable for a
 system that is so dependent on pathnames to require that.

Hmm ... that's a strange idea. AppArmor cannot assume anything about argv[0], 
and it would be a really bad idea to change the well-established semantics of 
argv[0].

There is no actual need for looking at argv[0], though: AppArmor decides based 
on the actual pathname of the executable...

Thanks,
Andreas
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-05-25 Thread Casey Schaufler

--- Andreas Gruenbacher [EMAIL PROTECTED] wrote:

 On Friday 25 May 2007 19:43, Casey Schaufler wrote:
  [...] but the AppArmor code could certainly check for that in exec by
  enforcing the argv[0] convention. It would be perfectly reasonable for a
  system that is so dependent on pathnames to require that.
 
 Hmm ... that's a strange idea.

Yeah, I get that a lot.

 AppArmor cannot assume anything about argv[0],
 
 and it would be a really bad idea to change the well-established semantics of
 
 argv[0].
 
 There is no actual need for looking at argv[0], though: AppArmor decides
 based 
 on the actual pathname of the executable...

Right. My point was that if you wanted to use the gzip/gunzip
example of a file with two names being treated differently based
on the name accessed as an argument for AppArmor you could. If
you don't want to, that's ok too. Jeremy raised a reasonable objection,
and AppArmor could address it if y'all chose to do so. I seriously
doubt that enforcing the argv[0] convention would break much, and I
also expect that if it did there's a Consultant's Retirement to be
made fixing the security hole it points out.


Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 00/41] Buffered write deadlock fix and new aops for 2.6.22-rc2-mm1

2007-05-25 Thread Mark Fasheh
On Fri, May 25, 2007 at 10:21:44PM +1000, [EMAIL PROTECTED] wrote:
 Still unfortunately missing the OCFS2 and GFS2 conversions, which
 allowed us to remove a lot of code -- I won't ask the maintainers to
 redo them either until the patchset gets somewhere.

Nonetheless, I'll give this a go and try to give you an ocfs2 patch sometime
next week. It should be much easier this time anyway. It turns out that the
write_begin()/write_end() style interface works very well for ocfs2
internally, so I went back and merged most of that work into ocfs2.git for
some of my shared writeable mmap work anyway.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Andreas Dilger
On May 25, 2007  17:58 +1000, Neil Brown wrote:
These devices would find it very hard to support BIO_RW_BARRIER.
Doing this would require keeping track of all in-flight requests
(which some, possibly all, of the above don't) and then:
  When a BIO_RW_BARRIER request arrives:
 wait for all pending writes to complete
 call blkdev_issue_flush on all devices
 issue the barrier write to the target device(s)
as BIO_RW_BARRIER,
 if that is -EOPNOTSUP, re-issue, wait, flush.

We noticed when testing the SLES10 kernel (which has barriers enabled
by default) that ext3 write throughput went from about 170MB/s to about
130MB/s (on high-end RAID storage using no-op scheduler).

The reason (as far as we could tell) is that the barriers are implemented
by flushing and waiting for all previosly submitted IOs to finish, but
all that ext3/jbd really care about is that the journal blocks are safely
on disk.

Since the journal blocks are only a small fraction of the total IO in
flight, the barrier + write cache ends up being a lot worse than just
doing synchronous IO with the write cache disabled because no new IO can
be submitted past the barrier, and since that IO is large and contiguous
it might complete much faster than the scattered metadata updates that are
also being checkpointed to disk from the previous transactions.  With jbd
there can be both a running and a committing transaction, and multiple
checkpointing transactions, and the use of barriers breaks this important
optimization.

If ext3 used an external journal this problem would be avoided,
but then there isn't really a need for barriers in the first place, since
the jbd code already will handle the wait for the commit block itself.

We've got a pretty-much complete version of the ext3 journal checksumming
patch that avoids the need to do the pre-commit barrier, since the checksum
can verify at recovery time whether all of the transaction's blocks made
it to disk or not (which is what the commit block is all about in the end).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-05-25 Thread Tetsuo Handa
Hello.

Casey Schaufler wrote:
 Sorry, but I don't understand your objection. If AppArmor is configured
 to allow everyone access to /bin/gzip but only some people access to
 /bin/gunzip and (important detail) the single binary uses argv[0]
 as documented and (another important detail) there aren't other links
 named gunzip to the binary (ok, that's lots of if's) you should be fine.

The argv[0] defines the default behavior of hard linked or symbolic linked 
programs,
but the behavior can be overridden using commandline options.
If you want to allow access to /bin/gzip but deny access to /bin/gunzip ,
you also need to deny access to /bin/gzip -d /bin/gzip --decompress 
/bin/gzip --uncompress.
It is impossible to do so because options to override the default behavior
depends on program's design and you can't know
what programs and what options are there in the system.
Even if you know all programs and all options in the system,
it is a too tough job to find and reject options
that override the default behavior in the kernel space.

  Well, my point was exactly that App Armor doesn't (as far as I know) do
  anything to enforce the argv[0] convention,
 Sounds like an opportunity for improvement then.

There are (I think) three types of program invocation.

(1) Invocation of hard linked programs.

/bin/gzip and /bin/gunzip and /bin/zcat are hard links.

There is no problem because you can know which pathname was requested
using d_namespace_path() with struct linux_binprm-file .

(2) Invocation of symbolic linked programs.

/sbin/pidof is a symbolic link to /sbin/killall .

There is a problem because you can't know which pathname was requested
using d_namespace_path() with struct linux_binprm-file
because the symbolic links were already derefernced inside open_exec().
To know which pathname was requested, you need to lookup
using struct linux_binprm-filename without LOOKUP_FOLLOW
and then use d_namespace_path().
Although there is a race condition that the pathname
the symbolic link struct linux_binprm-filename points to
may change, but it is inevitable because you can't get
dentry and vfsmount of both without LOOKUP_FOLLOW flag and
with LOOKUP_FOLLOW flag at the same time.

(3) Invocation of dynamically created programs with random names.

/usr/sbin/logrotate creates files patterned /tmp/logrotate.??
and executes these dynamically created files.

To keep execution of these dynamically created files under control,
you need to aggregate pathnames of these files.
AppArmor can't define profile if the pathname of programs is random, can it?

Usually the argv[0] and the struct linux_binprm-filename are the same,
but if you want to do something with argv[0], you will need to handle the (2) 
case
to see whether the argv[0] and struct linux_binprm-filename are the same.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement file locking

2007-05-25 Thread J. Bruce Fields
On Thu, May 24, 2007 at 05:55:54PM +0100, David Howells wrote:
 +/*
 + * initialise the lock manager thread if it isn't already running
 + */
 +static int afs_init_lock_manager(void)
 +{
 + if (!afs_lock_manager) {
 + afs_lock_manager = create_singlethread_workqueue(kafs_lockd);
 + if (!afs_lock_manager)
 + return -ENOMEM;
 + }
 + return 0;

Doesn't this need some locking?

 +/*
 + * request a lock on a file on the server
 + */
 +static int afs_do_setlk(struct file *file, struct file_lock *fl)
 +{
 + struct afs_vnode *vnode = AFS_FS_I(file-f_mapping-host);
 + afs_lock_type_t type;
 + struct key *key = file-private_data;
 + int ret;
 +
 + _enter({%x:%u},%u, vnode-fid.vid, vnode-fid.vnode, fl-fl_type);
 +
 + /* only whole-file locks are supported */
 + if (fl-fl_start != 0 || fl-fl_end != OFFSET_MAX)
 + return -EINVAL;

Do you allow upgrades and downgrades?  (Just curious.)

 + /* if we've already got a readlock on the server and no waiting
 +  * writelocks, then we might be able to instantly grant another

Is that comment correct?  (You don't really test for waiting
writelocks, do you?)

 +  * readlock */
 + if (type == AFS_LOCK_READ 
 + vnode-flags  (1  AFS_VNODE_READLOCKED)) {
 + _debug(instant readlock);
 + ASSERTCMP(vnode-flags 
 +   ((1  AFS_VNODE_LOCKING) |
 +(1  AFS_VNODE_WRITELOCKED)), ==, 0);
 + ASSERT(!list_empty(vnode-granted_locks));
 + goto sharing_existing_lock;
 + }
 + }

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement file locking

2007-05-25 Thread Kyle Moffett

On May 25, 2007, at 22:23:42, J. Bruce Fields wrote:

On Thu, May 24, 2007 at 05:55:54PM +0100, David Howells wrote:

+   /* only whole-file locks are supported */
+   if (fl-fl_start != 0 || fl-fl_end != OFFSET_MAX)
+   return -EINVAL;


Do you allow upgrades and downgrades?  (Just curious.)


I was actually under the impression that OpenAFS had support for byte- 
range locking (as well as lock upgrade/downgrade); though IIRC there  
was some secondary protocol.  That's probably why the support is so  
basic at the moment; David's getting the basics in first and the more  
complicated stuff can come later.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-05-25 Thread Kyle Moffett

On May 24, 2007, at 14:58:41, Casey Schaufler wrote:
On Fedora zcat, gzip and gunzip are all links to the same file.  I  
can imagine (although it is a bit of a stretch) allowing a set of  
users access to gunzip but not gzip (or the other way around).


That is a COMPLETE straw-man argument.  I can override your check  
with this absolutely trivial perl code:


exec { /usr/bin/gunzip } gzip, -9, some/file/to.gz;

Pathname-based checks are pretty fundamentally insecure.  If you want  
to protect a name, then you should tag the name with security  
attributes (IE: AppArmor).  On the other hand, if you actually want  
to protect the _data_, then tagging the _name_ is flawed; tag the  
*DATA* instead.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-05-25 Thread Crispin Cowan
Casey Schaufler wrote:
 --- Andreas Gruenbacher [EMAIL PROTECTED] wrote:
   
 AppArmor cannot assume anything about argv[0],

 and it would be a really bad idea to change the well-established semantics of

 argv[0].

 There is no actual need for looking at argv[0], though: AppArmor decides
 based 
 on the actual pathname of the executable...
 
 Right. My point was that if you wanted to use the gzip/gunzip
 example of a file with two names being treated differently based
 on the name accessed as an argument for AppArmor you could.
AppArmor detects the pathname of the file exec'd at the time the parent
exec's it, and not anything inside the child involving argv[0].

As such, AA can detect whether you did exec(gzip) or exec(gunzip)
and apply the policy relevant to the program. It could apply different
policies to each of them, so whether it has access to /tmp/mumble/barf
depends on whether you called it 'gzip' or 'gunzip'. Caveat: it makes no
sense to profile either gzip or gunzip in the AppArmor model, so I won't
defend what kind of policy you would put on them.

Finally, AA doesn't care what the contents of the executable are. We
assume that it is a copy of metasploit or something, and confine it to
access only the resources that the policy says.

  If
 you don't want to, that's ok too. Jeremy raised a reasonable objection,
 and AppArmor could address it if y'all chose to do so. I seriously
 doubt that enforcing the argv[0] convention would break much, and I
 also expect that if it did there's a Consultant's Retirement to be
 made fixing the security hole it points out.
   
AppArmor does address it, and I hope this explains how we detect which
of multiple hard links to a file you used to access the file without
mucking about with argv[0].

Crispin

-- 
Crispin Cowan, Ph.D.   http://crispincowan.com/~crispin/
Director of Software Engineering   http://novell.com
   Security: It's not linear

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html