date:20070423

Re: ChunkFS - measuring cross-chunk references

2007-04-23 Thread Karuna sagar K


Hi,

The tool estimates the cross-chunk references from an extt2/3 file
system. It considers a block group as one chunk and calcuates how many
block groups does a file span across. So, the block group size gives
the estimate of chunk size.

The file systems were aged for about 3-4 months on a developers laptop.

Should have given the background before. Below is the explanations for
the tool. Valh and others came up with this idea.

-
Chunkfs will only work if we have few cross-chunk references.  We
can estimate the effect of chunk size on the number of these
references using an existing ext2/3 file system and treating the block
groups as though they are chunks.  The basic idea is that we figure
out what the block group boundaries are and then find out which files
and directories span two or more block groups.

Step 1:
---

Get a real-world ext2/3 file system. A file system which has been in
use is required. One from a laptop or a server of any sort will do
fine.

Step 2:
---

Figure out where the block group boundaries are on disk. Two things
are to be known:

1. Which inode numbers are in which block group?
2. Which blocks are in which block group?

At the end of this step we should have a list that looks something like:

Block group 1: Inodes 11-343, blocks 1000-2
Block group 2: Inodes 344-576, blocks 2-4
[...]

Step 3:
---

For each file, get the inode number and use mapping from step 2 to
figure out which block group it is in.  Now use bmap() on each block
in the file, and find out the block number.  Use mapping from step 2
to figure out which block groups it has data in. For each file, record
the list of all block groups.

For each directory, get the inode number and map that to a block
group. Then get the inode numbers of all entries in the directory
(ignore symlinks) and map them to a block group.  For each directory,
record the list of all block groups.

Step 4:
---

Count the number of cross-chunk references this file system would
need.  This is done by going through each directory and file, and
adding up the number of block groups it uses MINUS one.  So if a file
was in block groups 3, 7, and 24, then you would add 2 to the total
number of cross-chunk references.  If a file was only in block group
2, then you would add 0 to the total.


On 4/22/07, Amit Gud [EMAIL PROTECTED] wrote:

Karuna sagar K wrote:
 Hi,

 The attached code contains program to estimate the cross-chunk
 references for ChunkFS file system (idea from Valh). Below are the
 results:


Nice to see some numbers! But would be really nice to know:

- what the chunk size is
- how the files were created or, more vaguely, how 'aged' the fs is
- what is the chunk allocation algorithm


Best,
AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud





Thanks,
Karuna
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Testing framework

2007-04-23 Thread Kalpak Shah

On Mon, 2007-04-23 at 02:16 +0530, Karuna sagar K wrote:
 Hi,
 
 For some time I had been working on this file system test framework.
 Now I have a implementation for the same and below is the explanation.
 Any comments are welcome.
 
 Introduction:
 The testing tools and benchmarks available around do not take into
 account the repair and recovery aspects of file systems. The test
 framework described here focuses on repair and recovery capabilities
 of file systems. Since most file systems use 'fsck' to recover from
 file system inconsistencies, the test framework characterizes file
 systems based on outcomes of running 'fsck'.

snip

 Higher level perspective/approach:
 In this approach the file system is viewed as a tree of nodes, where
 nodes are either files or directories. The metadata information
 corresponding to some randomly chosen nodes of the tree are corrupted.
 Nodes which are corrupted are marked or recorded to be able to replay
 later. This file system is called source file system while the file
 system on which we need to replay the corruption is called target file
 system. The assumption is that the target file system contains a set
 of files and directories which is a superset of that in the source
 file system. Hence to replay the corruption we need point out which
 nodes in the source file system were corrupted in the source file
 system and corrupt the corresponding nodes in the target file system.
 
 A major disadvantage with this approach is that on-disk structures
 (like superblocks, block group descriptors, etc.) are not considered
 for corruption.
 
 Lower level perspective/approach:
 The file system is looked upon as a set of blocks (more precisely
 metadata blocks). We randomly choose from this set of blocks to
 corrupt. Hence we would be able to overcome the deficiency of the
 previous approach. However this approach makes it difficult to have a
 replayable corruption. Further thought about this approach has to be
 given.
 

Fill a test filesystem with data and save it. Corrupt it by copying a
chunk of data from random locations A to B. Save positions A and B so
that you can reproduce the corruption. 

Or corrupt random bits (ideally in metadata blocks) and maintain the
list of the bit numbers for reproducing the corruption.

 We could have a blend of both the approaches in the program to
 compromise between corruption and replayability.
 
 Repair Phase:
 The corrupted file system is repaired and recovered with 'fsck' or any
 other tools; this phase considers the repair and recovery action on
 the file system as a black box. The time taken to repair by the tool
 is measured.

I see that you are running fsck just once on the test filesystem. It
might be a good idea to run it twice and if second fsck does not find
the filesystem to be completely clean that means it is a bug in fsck.

snip

 Summary Phase:
 This is the final phase in the model. A report file is prepared which
 summarizes the result of this test run. The summary contains:
 
 Average time taken for recovery
 Number of files lost at the end of each iteration
 Number of files with metadata corruption at the end of each iteration
 Number of files with data corruption at the end of each iteration
 Number of files lost and found at the end of each iteration
 
 Putting it all together:
 The Corruption, Repair and Comparison phases could be repeated a
 number of times (each repetition is called an iteration) before the
 summary of that test run is prepared.
 
 TODO:
 Account for files in the lost+found directory during the comparison phase.
 Support for other file systems (only ext2 is supported currently)
 State of the either file system is stored, which may be huge, time
 consuming and not necessary. So, we could have better ways of storing
 the state.

Also, people may want to test with different mount options, so something
like mount -t $fstype -o loop,$MOUNT_OPTIONS $imgname $mountpt may be
useful. Similarly it may also be useful to have MKFS_OPTIONS while
formatting the filesystem.

Thanks,
Kalpak.

 
 Comments are welcome!!
 
 Thanks,
 Karuna

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Testing framework

2007-04-23 Thread Karuna sagar K


On 4/23/07, Kalpak Shah [EMAIL PROTECTED] wrote:

On Mon, 2007-04-23 at 02:16 +0530, Karuna sagar K wrote:


.

The file system is looked upon as a set of blocks (more precisely
metadata blocks). We randomly choose from this set of blocks to
corrupt. Hence we would be able to overcome the deficiency of the
previous approach. However this approach makes it difficult to have a
replayable corruption. Further thought about this approach has to be
given.


Fill a test filesystem with data and save it. Corrupt it by copying a
chunk of data from random locations A to B. Save positions A and B so
that you can reproduce the corruption.



Hey, thats a nice idea :). But, this woundnt reproduce the same
corruption right? Because, say, on first run of the tool there is
metadata stored at locations A and B and then on the second run there
may be user data present. I mean the allocation may be different.


Or corrupt random bits (ideally in metadata blocks) and maintain the
list of the bit numbers for reproducing the corruption.



.

The corrupted file system is repaired and recovered with 'fsck' or any
other tools; this phase considers the repair and recovery action on
the file system as a black box. The time taken to repair by the tool
is measured


I see that you are running fsck just once on the test filesystem. It
might be a good idea to run it twice and if second fsck does not find
the filesystem to be completely clean that means it is a bug in fsck.


You are right. Will modify that.



snip



..

State of the either file system is stored, which may be huge, time
consuming and not necessary. So, we could have better ways of storing
the state.


Also, people may want to test with different mount options, so something
like mount -t $fstype -o loop,$MOUNT_OPTIONS $imgname $mountpt may be
useful. Similarly it may also be useful to have MKFS_OPTIONS while
formatting the filesystem.



Right. I didnt think of that. Will look into it.


Thanks,
Kalpak.


 Comments are welcome!!

 Thanks,
 Karuna




Thanks,
Karuna
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Amit Gud



This is an initial implementation of ChunkFS technique, briefly discussed
at: http://lwn.net/Articles/190222 and 
http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf


This implementation is done within ext2 driver. Every chunk is an 
independent ext2 file system. The knowledge about chunks is kept within 
ext2 and 'continuation inodes', which are used to allow files and 
directories span across multiple chunks, are managed within ext2.


At mount time, super blocks for all the chunks are created and linked with 
the global super_blocks list maintained by VFS. This allows independent 
behavior or individual chunks and also helps writebacks to happen 
seamlessly.


Apart from this, chunkfs code in ext2 effectively only provides knowledge of:

- what inode's which block number to look for, for a given file's logical 
block number

- in which chunk to allocate next inode / block
- number of inodes to scan when a directory is being read

To maintain the ext2's inode number uniqueness property, 8 msb bits of 
inode number are used to indicate the chunk number in which it resides.


As said, this is a preliminary implementation and lots of changes are 
expected before this code is even sanely usable. Some known issues and 
obvious optimizations are listed in the TODO file in the chunkfs patch.


http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
- one big patch
- applies to 2.6.18

Attached - ext2-chunkfs-diff.patch.gz
- since the code is a spin-off of ext2, this patch explains better what
  has changed from the ext2.

git://cislinux.cis.ksu.edu/chunkfs-tools
- mkfs, and fsck for chunkfs.

http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
- config file used; tested mostly on UML with loopback file systems.

NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
should be no for clean compile.



Please comment, suggest, criticize. Patches most welcome.


Best,
AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

ext2-chunkfs-diff.patch.gz
Description: Binary data

Re: 2.6.21-rc7 new aops patchset

2007-04-23 Thread Miklos Szeredi

Nick,

Thanks for converting fuse, and testing.  Here's a minor update to
fs-fuse-aops.patch.

Miklos


Convert fuse to new aops.

[mszeredi]
 - don't send zero length write requests
 - it is not legal for the filesystem to return with zero written bytes

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

Index: linux/fs/fuse/file.c
===
--- linux.orig/fs/fuse/file.c   2007-04-23 12:04:10.0 +0200
+++ linux/fs/fuse/file.c2007-04-23 13:56:48.0 +0200
@@ -443,22 +443,25 @@ static size_t fuse_send_write(struct fus
return outarg.size;
 }
 
-static int fuse_prepare_write(struct file *file, struct page *page,
- unsigned offset, unsigned to)
-{
-   /* No op */
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+
+   *pagep = __grab_cache_page(mapping, index);
+   if (!*pagep)
+   return -ENOMEM;
return 0;
 }
 
-static int fuse_commit_write(struct file *file, struct page *page,
-unsigned offset, unsigned to)
+static int fuse_buffered_write(struct file *file, struct inode *inode,
+  loff_t pos, unsigned count, struct page *page)
 {
int err;
size_t nres;
-   unsigned count = to - offset;
-   struct inode *inode = page-mapping-host;
struct fuse_conn *fc = get_fuse_conn(inode);
-   loff_t pos = page_offset(page) + offset;
+   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
struct fuse_req *req;
 
if (is_bad_inode(inode))
@@ -474,20 +477,35 @@ static int fuse_commit_write(struct file
nres = fuse_send_write(req, file, inode, pos, count);
err = req-out.h.error;
fuse_put_request(fc, req);
-   if (!err  nres != count)
+   if (!err  !nres)
err = -EIO;
if (!err) {
-   pos += count;
+   pos += nres;
spin_lock(fc-lock);
if (pos  inode-i_size)
i_size_write(inode, pos);
spin_unlock(fc-lock);
 
-   if (offset == 0  to == PAGE_CACHE_SIZE)
+   if (count == PAGE_CACHE_SIZE)
SetPageUptodate(page);
}
fuse_invalidate_attr(inode);
-   return err;
+   return err ? err : nres;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
+{
+   struct inode *inode = mapping-host;
+   int res = 0;
+
+   if (copied)
+   res = fuse_buffered_write(file, inode, pos, copied, page);
+
+   unlock_page(page);
+   page_cache_release(page);
+   return res;
 }
 
 static void fuse_release_user_pages(struct fuse_req *req, int write)
@@ -817,8 +835,8 @@ static const struct file_operations fuse
 
 static const struct address_space_operations fuse_file_aops  = {
.readpage   = fuse_readpage,
-   .prepare_write  = fuse_prepare_write,
-   .commit_write   = fuse_commit_write,
+   .write_begin= fuse_write_begin,
+   .write_end  = fuse_write_end,
.readpages  = fuse_readpages,
.set_page_dirty = fuse_set_page_dirty,
.bmap   = fuse_bmap,
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Testing framework

2007-04-23 Thread Avishay Traeger

On Mon, 2007-04-23 at 02:16 +0530, Karuna sagar K wrote:
 For some time I had been working on this file system test framework.
 Now I have a implementation for the same and below is the explanation.
 Any comments are welcome.

snip

You may want to check out the paper EXPLODE: A Lightweight, General
System for Finding Serious Storage System Errors from OSDI 2006 (if you
haven't already).  The idea sounds very similar to me, although I
haven't read all the details of your proposal.

Avishay

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Suparna Bhattacharya

On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote:
 
 This is an initial implementation of ChunkFS technique, briefly discussed
 at: http://lwn.net/Articles/190222 and 
 http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
 
 This implementation is done within ext2 driver. Every chunk is an 
 independent ext2 file system. The knowledge about chunks is kept within 
 ext2 and 'continuation inodes', which are used to allow files and 
 directories span across multiple chunks, are managed within ext2.
 
 At mount time, super blocks for all the chunks are created and linked with 
 the global super_blocks list maintained by VFS. This allows independent 
 behavior or individual chunks and also helps writebacks to happen 
 seamlessly.
 
 Apart from this, chunkfs code in ext2 effectively only provides knowledge 
 of:
 
 - what inode's which block number to look for, for a given file's logical 
 block number
 - in which chunk to allocate next inode / block
 - number of inodes to scan when a directory is being read
 
 To maintain the ext2's inode number uniqueness property, 8 msb bits of 
 inode number are used to indicate the chunk number in which it resides.
 
 As said, this is a preliminary implementation and lots of changes are 
 expected before this code is even sanely usable. Some known issues and 
 obvious optimizations are listed in the TODO file in the chunkfs patch.
 
 http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
 - one big patch
 - applies to 2.6.18


Could you send this out as a patch to ext2 codebase, so we can just look
at the changes for chunkfs ? That might also make it small enough
to inline your patch in email for review. 

What kind of results are you planning to gather to evaluate/optimize this ?

Regards
Suparna

 
 Attached - ext2-chunkfs-diff.patch.gz
 - since the code is a spin-off of ext2, this patch explains better what
   has changed from the ext2.
 
 git://cislinux.cis.ksu.edu/chunkfs-tools
 - mkfs, and fsck for chunkfs.
 
 http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
 - config file used; tested mostly on UML with loopback file systems.
 
 NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
 should be no for clean compile.
 
 
 Please comment, suggest, criticize. Patches most welcome.
 
 
 Best,
 AG
 --
 May the source be with you.
 http://www.cis.ksu.edu/~gud



-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Suparna Bhattacharya

On Mon, Apr 23, 2007 at 09:58:49PM +0530, Suparna Bhattacharya wrote:
 On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote:
  
  This is an initial implementation of ChunkFS technique, briefly discussed
  at: http://lwn.net/Articles/190222 and 
  http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
  
  This implementation is done within ext2 driver. Every chunk is an 
  independent ext2 file system. The knowledge about chunks is kept within 
  ext2 and 'continuation inodes', which are used to allow files and 
  directories span across multiple chunks, are managed within ext2.
  
  At mount time, super blocks for all the chunks are created and linked with 
  the global super_blocks list maintained by VFS. This allows independent 
  behavior or individual chunks and also helps writebacks to happen 
  seamlessly.
  
  Apart from this, chunkfs code in ext2 effectively only provides knowledge 
  of:
  
  - what inode's which block number to look for, for a given file's logical 
  block number
  - in which chunk to allocate next inode / block
  - number of inodes to scan when a directory is being read
  
  To maintain the ext2's inode number uniqueness property, 8 msb bits of 
  inode number are used to indicate the chunk number in which it resides.
  
  As said, this is a preliminary implementation and lots of changes are 
  expected before this code is even sanely usable. Some known issues and 
  obvious optimizations are listed in the TODO file in the chunkfs patch.
  
  http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
  - one big patch
  - applies to 2.6.18
 
 
 Could you send this out as a patch to ext2 codebase, so we can just look
 at the changes for chunkfs ? That might also make it small enough
 to inline your patch in email for review. 

Sorry, I missed the part about ext2-chunkfs-diff below.

Regards
suparna

 
 What kind of results are you planning to gather to evaluate/optimize this ?
 
 Regards
 Suparna
 
  
  Attached - ext2-chunkfs-diff.patch.gz
  - since the code is a spin-off of ext2, this patch explains better what
has changed from the ext2.
  
  git://cislinux.cis.ksu.edu/chunkfs-tools
  - mkfs, and fsck for chunkfs.
  
  http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
  - config file used; tested mostly on UML with loopback file systems.
  
  NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
  should be no for clean compile.
  
  
  Please comment, suggest, criticize. Patches most welcome.
  
  
  Best,
  AG
  --
  May the source be with you.
  http://www.cis.ksu.edu/~gud
 
 
 
 -- 
 Suparna Bhattacharya ([EMAIL PROTECTED])
 Linux Technology Center
 IBM Software Lab, India
 

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ChunkFS - measuring cross-chunk references

2007-04-23 Thread Andreas Dilger

On Apr 23, 2007  15:04 +0530, Kalpak Shah wrote:
 On Mon, 2007-04-23 at 12:49 +0530, Karuna sagar K wrote:
  The tool estimates the cross-chunk references from an extt2/3 file
  system. It considers a block group as one chunk and calcuates how many
  block groups does a file span across. So, the block group size gives
  the estimate of chunk size.
  
  The file systems were aged for about 3-4 months on a developers laptop.
 
 With a blocksize of 4KB, a block group would be 128 MB. In the original
 Chunkfs paper, Valh had mentioned 1GB chunks and I believe it will be
 possible to use 2GB, 4GB or 8GB chunks in the future. As the chunk size
 increases the number of cross-chunk references will reduce and hence it
 might be a good idea to present these statistics considering different
 chunk sizes starting from 512MB upto 2GB.

Also, given that cross-chunk references will be more expensive to fix, I
can imagine the allocation policy for chunkfs will try to avoid this if
possible, further reducing the number of cross-chunk inodes.  I guess it
should be more clear whether the cross-chunk references are due to inode
block references, or because of e.g. directories referencing inodes in
another chunk.

Also, is it considered a cross-chunk reference if a directory entry is
referencing an inode in another group?  Should there be a continuation
inode in the local group, or is the directory entry itself enough?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Amit Gud


Suparna Bhattacharya wrote:

Could you send this out as a patch to ext2 codebase, so we can just look
at the changes for chunkfs ? That might also make it small enough
to inline your patch in email for review. 


What kind of results are you planning to gather to evaluate/optimize this ?



Mainly I'm trying to gather following:

- Graph of continuation inodes vs. the file system fragmentation (or 
aging) factor with varying configurations of chunk sizes


- Graph of wall clock time vs. disk size + data on the disk with both 
chunkfs and native ext2, and/or other file systems



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Testing framework

2007-04-23 Thread Ric Wheeler


Avishay Traeger wrote:

On Mon, 2007-04-23 at 02:16 +0530, Karuna sagar K wrote:

For some time I had been working on this file system test framework.
Now I have a implementation for the same and below is the explanation.
Any comments are welcome.


snip

You may want to check out the paper EXPLODE: A Lightweight, General
System for Finding Serious Storage System Errors from OSDI 2006 (if you
haven't already).  The idea sounds very similar to me, although I
haven't read all the details of your proposal.

Avishay



It would also be interesting to use the disk error injection patches 
that Mark Lord sent out recently to introduce real sector level 
corruption.  When your file systems are large enough and old enough, 
getting bad sectors and IO errors during an fsck stresses things in 
interesting ways ;-)


ric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ChunkFS - measuring cross-chunk references

2007-04-23 Thread Theodore Tso

On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote:
  With a blocksize of 4KB, a block group would be 128 MB. In the original
  Chunkfs paper, Valh had mentioned 1GB chunks and I believe it will be
  possible to use 2GB, 4GB or 8GB chunks in the future. As the chunk size
  increases the number of cross-chunk references will reduce and hence it
  might be a good idea to present these statistics considering different
  chunk sizes starting from 512MB upto 2GB.
 
 Also, given that cross-chunk references will be more expensive to fix, I
 can imagine the allocation policy for chunkfs will try to avoid this if
 possible, further reducing the number of cross-chunk inodes.  I guess it
 should be more clear whether the cross-chunk references are due to inode
 block references, or because of e.g. directories referencing inodes in
 another chunk.

It would also be good to distinguish between directories referencing
files in another chunk, and directories referencing subdirectories in
another chunk (which would be simpler to handle, given the topological
restrictions on directories, as compared to files and hard links).

There may also be special things we will need to do to handle
scenarios such as BackupPC, where if it looks like a directory
contains a huge number of hard links to a particular chunk, we'll need
to make sure that directory is either created in the right chunk
(possibly with hints from the application) or migrated to the right
chunk (but this might cause the inode number of the directory to
change --- maybe we allow this as long as the directory has never been
stat'ed, so that the inode number has never been observed).

The other thing which we should consider is that chunkfs really
requires a 64-bit inode number space, which means either we only allow
it on 64-bit systems, or we need to consider a migration so that even
on 32-bit platforms, stat() functions like stat64(), insofar that it
uses a stat structure which returns a 64-bit ino_t.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.21-rc7 new aops patchset

2007-04-23 Thread Nick Piggin

On Mon, Apr 23, 2007 at 02:17:55PM +0200, Miklos Szeredi wrote:
 Nick,
 
 Thanks for converting fuse, and testing.  Here's a minor update to
 fs-fuse-aops.patch.
 
 Miklos
 
 
 Convert fuse to new aops.
 
 [mszeredi]
  - don't send zero length write requests
  - it is not legal for the filesystem to return with zero written bytes
 
 Signed-off-by: Nick Piggin [EMAIL PROTECTED]
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

Thanks, applied.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ChunkFS - measuring cross-chunk references

2007-04-23 Thread Arjan van de Ven


 The other thing which we should consider is that chunkfs really
 requires a 64-bit inode number space, which means either we only allow

does it?
I'd think it needs a chunk space number and a 32 bit local inode
number ;) (same for blocks)

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ChunkFS - measuring cross-chunk references

2007-04-23 Thread Amit Gud


On Mon, 23 Apr 2007, Arjan van de Ven wrote:




The other thing which we should consider is that chunkfs really
requires a 64-bit inode number space, which means either we only allow


does it?
I'd think it needs a chunk space number and a 32 bit local inode
number ;) (same for blocks)



For inodes, yes, either 64-bit inode or some field for the chunk id in 
which the inode is. But for block numbers, you don't. Because individual 
chunks manage part of the whole file system in an independent way. They 
have their block bitmaps starting at an offset. Inode bitmaps, however, 
remains same.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ChunkFS - measuring cross-chunk references

2007-04-23 Thread Amit Gud


On Mon, 23 Apr 2007, Amit Gud wrote:


On Mon, 23 Apr 2007, Arjan van de Ven wrote:



  The other thing which we should consider is that chunkfs really
  requires a 64-bit inode number space, which means either we only allow

 does it?
 I'd think it needs a chunk space number and a 32 bit local inode
 number ;) (same for blocks)



For inodes, yes, either 64-bit inode or some field for the chunk id in which 
the inode is. But for block numbers, you don't. Because individual chunks 
manage part of the whole file system in an independent way. They have their 
block bitmaps starting at an offset. Inode bitmaps, however, remains same.




In that sense, we also can do away without having chunk identifier encoded 
into inode number and chunkfs would still be fine with it. But we will 
then loose inode uniqueness property, which could well be OK as it is with 
other file systems in which inode number is not sufficient for unique 
identification of an inode.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 00/44] Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1

2007-04-23 Thread Nick Piggin

Hi, these patches are against 2.6.21-rc6-mm1. Aside from OCFS2, there
were no major clashes between -mm and mainline diffs, which is nice.

These patches aim to solve the long standing buffered write deadlocks,
and then go on to introduce a pair of new write a_op methods which
allow the deadlock to be solved without taking the performance hit of
the backwards compatible solutions using the old APIs.

Reiserfs (and Reiser4, in -mm) are the only filesystems left unconverted,
although there are a number of less common ones still untested.

Thanks,
Nick

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 05/44] mm: debug write deadlocks

2007-04-23 Thread Nick Piggin


Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
difficult race where the page may be unmapped before calling copy_from_user.
Makes the race much easier to hit.

This is useful for demonstration and testing purposes, but is removed in a
subsequent patch.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1984,6 +1984,7 @@ generic_file_buffered_write(struct kiocb
if (maxlen  bytes)
maxlen = bytes;
 
+#ifndef CONFIG_DEBUG_VM
/*
 * Bring in the user page that we will copy from _first_.
 * Otherwise there's a nasty deadlock on copying from the
@@ -1991,6 +1992,7 @@ generic_file_buffered_write(struct kiocb
 * up-to-date.
 */
fault_in_pages_readable(buf, maxlen);
+#endif
 
page = __grab_cache_page(mapping,index,cached_page,lru_pvec);
if (!page) {

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 02/44] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6

2007-04-23 Thread Nick Piggin

From: Andrew Morton [EMAIL PROTECTED]

This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
also revert.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |9 +
 mm/filemap.h |4 ++--
 2 files changed, 3 insertions(+), 10 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2001,12 +2001,6 @@ generic_file_buffered_write(struct kiocb
break;
}
 
-   if (unlikely(bytes == 0)) {
-   status = 0;
-   copied = 0;
-   goto zero_length_segment;
-   }
-
status = a_ops-prepare_write(file, page, offset, offset+bytes);
if (unlikely(status)) {
loff_t isize = i_size_read(inode);
@@ -2036,8 +2030,7 @@ generic_file_buffered_write(struct kiocb
page_cache_release(page);
continue;
}
-zero_length_segment:
-   if (likely(copied = 0)) {
+   if (likely(copied  0)) {
if (!status)
status = copied;
 
Index: linux-2.6/mm/filemap.h
===
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove
const struct iovec *iov = *iovp;
size_t base = *basep;
 
-   do {
+   while (bytes) {
int copy = min(bytes, iov-iov_len - base);
 
bytes -= copy;
@@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove
iov++;
base = 0;
}
-   } while (bytes);
+   }
*iovp = iov;
*basep = base;
 }

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 01/44] mm: revert KERNEL_DS buffered write optimisation

2007-04-23 Thread Nick Piggin


Revert the patch from Neil Brown to optimise NFSD writev handling.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Cc: Neil Brown [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   32 +---
 1 file changed, 13 insertions(+), 19 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1980,27 +1980,21 @@ generic_file_buffered_write(struct kiocb
/* Limit the size of the copy to the caller's write size */
bytes = min(bytes, count);
 
-   /* We only need to worry about prefaulting when writes are from
-* user-space.  NFSd uses vfs_writev with several non-aligned
-* segments in the vector, and limiting to one segment a time is
-* a noticeable performance for re-write
+   /*
+* Limit the size of the copy to that of the current segment,
+* because fault_in_pages_readable() doesn't know how to walk
+* segments.
 */
-   if (!segment_eq(get_fs(), KERNEL_DS)) {
-   /*
-* Limit the size of the copy to that of the current
-* segment, because fault_in_pages_readable() doesn't
-* know how to walk segments.
-*/
-   bytes = min(bytes, cur_iov-iov_len - iov_base);
+   bytes = min(bytes, cur_iov-iov_len - iov_base);
+
+   /*
+* Bring in the user page that we will copy from _first_.
+* Otherwise there's a nasty deadlock on copying from the
+* same page as we're writing to, without it being marked
+* up-to-date.
+*/
+   fault_in_pages_readable(buf, bytes);
 
-   /*
-* Bring in the user page that we will copy from
-* _first_.  Otherwise there's a nasty deadlock on
-* copying from the same page as we're writing to,
-* without it being marked up-to-date.
-*/
-   fault_in_pages_readable(buf, bytes);
-   }
page = __grab_cache_page(mapping,index,cached_page,lru_pvec);
if (!page) {
status = -ENOMEM;

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 04/44] mm: clean up buffered write code

2007-04-23 Thread Nick Piggin

From: Andrew Morton [EMAIL PROTECTED]

Rename some variables and fix some types.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   35 ++-
 1 file changed, 18 insertions(+), 17 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1944,16 +1944,15 @@ generic_file_buffered_write(struct kiocb
size_t count, ssize_t written)
 {
struct file *file = iocb-ki_filp;
-   struct address_space * mapping = file-f_mapping;
+   struct address_space *mapping = file-f_mapping;
const struct address_space_operations *a_ops = mapping-a_ops;
struct inode*inode = mapping-host;
longstatus = 0;
struct page *page;
struct page *cached_page = NULL;
-   size_t  bytes;
struct pagevec  lru_pvec;
const struct iovec *cur_iov = iov; /* current iovec */
-   size_t  iov_base = 0;  /* offset in the current iovec */
+   size_t  iov_offset = 0;/* offset in the current iovec */
char __user *buf;
 
pagevec_init(lru_pvec, 0);
@@ -1964,31 +1963,33 @@ generic_file_buffered_write(struct kiocb
if (likely(nr_segs == 1))
buf = iov-iov_base + written;
else {
-   filemap_set_next_iovec(cur_iov, iov_base, written);
-   buf = cur_iov-iov_base + iov_base;
+   filemap_set_next_iovec(cur_iov, iov_offset, written);
+   buf = cur_iov-iov_base + iov_offset;
}
 
do {
-   unsigned long index;
-   unsigned long offset;
-   unsigned long maxlen;
-   size_t copied;
+   pgoff_t index;  /* Pagecache index for current page */
+   unsigned long offset;   /* Offset into pagecache page */
+   unsigned long maxlen;   /* Bytes remaining in current iovec */
+   size_t bytes;   /* Bytes to write to page */
+   size_t copied;  /* Bytes copied from user */
 
-   offset = (pos  (PAGE_CACHE_SIZE -1)); /* Within page */
+   offset = (pos  (PAGE_CACHE_SIZE - 1));
index = pos  PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
if (bytes  count)
bytes = count;
 
+   maxlen = cur_iov-iov_len - iov_offset;
+   if (maxlen  bytes)
+   maxlen = bytes;
+
/*
 * Bring in the user page that we will copy from _first_.
 * Otherwise there's a nasty deadlock on copying from the
 * same page as we're writing to, without it being marked
 * up-to-date.
 */
-   maxlen = cur_iov-iov_len - iov_base;
-   if (maxlen  bytes)
-   maxlen = bytes;
fault_in_pages_readable(buf, maxlen);
 
page = __grab_cache_page(mapping,index,cached_page,lru_pvec);
@@ -2019,7 +2020,7 @@ generic_file_buffered_write(struct kiocb
buf, bytes);
else
copied = filemap_copy_from_user_iovec(page, offset,
-   cur_iov, iov_base, bytes);
+   cur_iov, iov_offset, bytes);
flush_dcache_page(page);
status = a_ops-commit_write(file, page, offset, offset+bytes);
if (status == AOP_TRUNCATED_PAGE) {
@@ -2037,12 +2038,12 @@ generic_file_buffered_write(struct kiocb
buf += status;
if (unlikely(nr_segs  1)) {
filemap_set_next_iovec(cur_iov,
-   iov_base, status);
+   iov_offset, status);
if (count)
buf = cur_iov-iov_base +
-   iov_base;
+   iov_offset;
} else {
-   iov_base += status;
+   iov_offset += status;
}
}
}

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 06/44] mm: trim more holes

2007-04-23 Thread Nick Piggin


If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
we may have failed the write operation despite prepare_write having
instantiated blocks past i_size. Fix this, and consolidate the trimming into
one place.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   80 +--
 1 file changed, 40 insertions(+), 40 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2001,22 +2001,9 @@ generic_file_buffered_write(struct kiocb
}
 
status = a_ops-prepare_write(file, page, offset, offset+bytes);
-   if (unlikely(status)) {
-   loff_t isize = i_size_read(inode);
+   if (unlikely(status))
+   goto fs_write_aop_error;
 
-   if (status != AOP_TRUNCATED_PAGE)
-   unlock_page(page);
-   page_cache_release(page);
-   if (status == AOP_TRUNCATED_PAGE)
-   continue;
-   /*
-* prepare_write() may have instantiated a few blocks
-* outside i_size.  Trim these off again.
-*/
-   if (pos + bytes  isize)
-   vmtruncate(inode, isize);
-   break;
-   }
if (likely(nr_segs == 1))
copied = filemap_copy_from_user(page, offset,
buf, bytes);
@@ -2025,40 +2012,53 @@ generic_file_buffered_write(struct kiocb
cur_iov, iov_offset, bytes);
flush_dcache_page(page);
status = a_ops-commit_write(file, page, offset, offset+bytes);
-   if (status == AOP_TRUNCATED_PAGE) {
-   page_cache_release(page);
-   continue;
+   if (unlikely(status  0))
+   goto fs_write_aop_error;
+   if (unlikely(copied != bytes)) {
+   status = -EFAULT;
+   goto fs_write_aop_error;
}
-   if (likely(copied  0)) {
-   if (!status)
-   status = copied;
+   if (unlikely(status  0)) /* filesystem did partial write */
+   copied = status;
 
-   if (status = 0) {
-   written += status;
-   count -= status;
-   pos += status;
-   buf += status;
-   if (unlikely(nr_segs  1)) {
-   filemap_set_next_iovec(cur_iov,
-   iov_offset, status);
-   if (count)
-   buf = cur_iov-iov_base +
-   iov_offset;
-   } else {
-   iov_offset += status;
-   }
+   if (likely(copied  0)) {
+   written += copied;
+   count -= copied;
+   pos += copied;
+   buf += copied;
+   if (unlikely(nr_segs  1)) {
+   filemap_set_next_iovec(cur_iov,
+   iov_offset, copied);
+   if (count)
+   buf = cur_iov-iov_base + iov_offset;
+   } else {
+   iov_offset += copied;
}
}
-   if (unlikely(copied != bytes))
-   if (status = 0)
-   status = -EFAULT;
unlock_page(page);
mark_page_accessed(page);
page_cache_release(page);
-   if (status  0)
-   break;
balance_dirty_pages_ratelimited(mapping);
cond_resched();
+   continue;
+
+fs_write_aop_error:
+   if (status != AOP_TRUNCATED_PAGE)
+   unlock_page(page);
+   page_cache_release(page);
+
+   /*
+* prepare_write() may have instantiated a few blocks
+* outside i_size.  Trim these off again. Don't need
+* i_size_read because we hold i_mutex.
+*/
+   if (pos + bytes  inode-i_size)
+

[patch 03/44] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83

2007-04-23 Thread Nick Piggin

From: Andrew Morton [EMAIL PROTECTED]

This patch fixed the following bug:

  When prefaulting in the pages in generic_file_buffered_write(), we only
  faulted in the pages for the firts segment of the iovec.  If the second of
  successive segment described a mmapping of the page into which we're
  write()ing, and that page is not up-to-date, the fault handler tries to lock
  the already-locked page (to bring it up to date) and deadlocks.

  An exploit for this bug is in writev-deadlock-demo.c, in
  http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

  (These demos assume blocksize  PAGE_CACHE_SIZE).

The problem with this fix is that it takes the kernel back to doing a single
prepare_write()/commit_write() per iovec segment.  So in the worst case we'll
run prepare_write+commit_write 1024 times where we previously would have run
it once. The other problem with the fix is that it fix all the locking problems.


insert numbers obtained via ext3-tools's writev-speed.c here

And apparently this change killed NFS overwrite performance, because, I
suppose, it talks to the server for each prepare_write+commit_write.

So just back that patch out - we'll be fixing the deadlock by other means.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Andrew Morton [EMAIL PROTECTED]

Nick says: also it only ever actually papered over the bug, because after
faulting in the pages, they might be unmapped or reclaimed.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1971,21 +1971,14 @@ generic_file_buffered_write(struct kiocb
do {
unsigned long index;
unsigned long offset;
+   unsigned long maxlen;
size_t copied;
 
offset = (pos  (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos  PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
-
-   /* Limit the size of the copy to the caller's write size */
-   bytes = min(bytes, count);
-
-   /*
-* Limit the size of the copy to that of the current segment,
-* because fault_in_pages_readable() doesn't know how to walk
-* segments.
-*/
-   bytes = min(bytes, cur_iov-iov_len - iov_base);
+   if (bytes  count)
+   bytes = count;
 
/*
 * Bring in the user page that we will copy from _first_.
@@ -1993,7 +1986,10 @@ generic_file_buffered_write(struct kiocb
 * same page as we're writing to, without it being marked
 * up-to-date.
 */
-   fault_in_pages_readable(buf, bytes);
+   maxlen = cur_iov-iov_len - iov_base;
+   if (maxlen  bytes)
+   maxlen = bytes;
+   fault_in_pages_readable(buf, maxlen);
 
page = __grab_cache_page(mapping,index,cached_page,lru_pvec);
if (!page) {

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 07/44] mm: buffered write cleanup

2007-04-23 Thread Nick Piggin


Quite a bit of code is used in maintaining these cached pages that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.

Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).

[this gets rid of some cond_resched() calls in readahead.c and mpage.c due
 to clashes in -mm. What put them there, and why? ]

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/mpage.c |   12 
 mm/filemap.c   |  144 ++---
 mm/readahead.c |   28 +++
 3 files changed, 66 insertions(+), 118 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -689,26 +689,22 @@ EXPORT_SYMBOL(probe_page);
 struct page *find_or_create_page(struct address_space *mapping,
unsigned long index, gfp_t gfp_mask)
 {
-   struct page *page, *cached_page = NULL;
+   struct page *page;
int err;
 repeat:
page = find_lock_page(mapping, index);
if (!page) {
-   if (!cached_page) {
-   cached_page = alloc_page(gfp_mask);
-   if (!cached_page)
-   return NULL;
-   }
-   err = add_to_page_cache_lru(cached_page, mapping,
-   index, gfp_mask);
-   if (!err) {
-   page = cached_page;
-   cached_page = NULL;
-   } else if (err == -EEXIST)
-   goto repeat;
+   page = alloc_page(gfp_mask);
+   if (!page)
+   return NULL;
+   err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+   if (unlikely(err)) {
+   page_cache_release(page);
+   page = NULL;
+   if (err == -EEXIST)
+   goto repeat;
+   }
}
-   if (cached_page)
-   page_cache_release(cached_page);
return page;
 }
 EXPORT_SYMBOL(find_or_create_page);
@@ -903,11 +899,9 @@ void do_generic_mapping_read(struct addr
unsigned long next_index;
unsigned long prev_index;
loff_t isize;
-   struct page *cached_page;
int error;
struct file_ra_state ra = *_ra;
 
-   cached_page = NULL;
index = *ppos  PAGE_CACHE_SHIFT;
next_index = index;
prev_index = ra.prev_page;
@@ -1084,23 +1078,20 @@ no_cached_page:
 * Ok, it wasn't cached, so we need to create a new
 * page..
 */
-   if (!cached_page) {
-   cached_page = page_cache_alloc_cold(mapping);
-   if (!cached_page) {
-   desc-error = -ENOMEM;
-   goto out;
-   }
+   page = page_cache_alloc_cold(mapping);
+   if (!page) {
+   desc-error = -ENOMEM;
+   goto out;
}
-   error = add_to_page_cache_lru(cached_page, mapping,
+   error = add_to_page_cache_lru(page, mapping,
index, GFP_KERNEL);
if (error) {
+   page_cache_release(page);
if (error == -EEXIST)
goto find_page;
desc-error = error;
goto out;
}
-   page = cached_page;
-   cached_page = NULL;
goto readpage;
}
 
@@ -1110,8 +1101,6 @@ out:
_ra-prev_page = prev_index;
 
*ppos = ((loff_t) index  PAGE_CACHE_SHIFT) + offset;
-   if (cached_page)
-   page_cache_release(cached_page);
if (filp)
file_accessed(filp);
 }
@@ -1605,35 +1594,28 @@ static struct page *__read_cache_page(st
int (*filler)(void *,struct page*),
void *data)
 {
-   struct page *page, *cached_page = NULL;
+   struct page *page;
int err;
 repeat:
page = find_get_page(mapping, index);

[patch 10/44] mm: buffered write iterator

2007-04-23 Thread Nick Piggin


Add an iterator data structure to operate over an iovec. Add usercopy
operators needed by generic_file_buffered_write, and convert that function
over.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 include/linux/fs.h |   33 
 mm/filemap.c   |  144 +++--
 mm/filemap.h   |  103 -
 3 files changed, 150 insertions(+), 130 deletions(-)

Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -398,6 +398,39 @@ struct page;
 struct address_space;
 struct writeback_control;
 
+struct iov_iter {
+   const struct iovec *iov;
+   unsigned long nr_segs;
+   size_t iov_offset;
+   size_t count;
+};
+
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes);
+size_t iov_iter_copy_from_user(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes);
+void iov_iter_advance(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_readable(struct iov_iter *i);
+size_t iov_iter_single_seg_count(struct iov_iter *i);
+
+static inline void iov_iter_init(struct iov_iter *i,
+   const struct iovec *iov, unsigned long nr_segs,
+   size_t count, size_t written)
+{
+   i-iov = iov;
+   i-nr_segs = nr_segs;
+   i-iov_offset = 0;
+   i-count = count + written;
+
+   iov_iter_advance(i, written);
+}
+
+static inline size_t iov_iter_count(struct iov_iter *i)
+{
+   return i-count;
+}
+
+
 struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -30,7 +30,7 @@
 #include linux/security.h
 #include linux/syscalls.h
 #include linux/cpuset.h
-#include filemap.h
+#include linux/hardirq.h /* for BUG_ON(!in_atomic()) only */
 #include internal.h
 
 /*
@@ -1740,8 +1740,7 @@ int remove_suid(struct dentry *dentry)
 }
 EXPORT_SYMBOL(remove_suid);
 
-size_t
-__filemap_copy_from_user_iovec_inatomic(char *vaddr,
+static size_t __iovec_copy_from_user_inatomic(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
 {
size_t copied = 0, left = 0;
@@ -1764,6 +1763,110 @@ __filemap_copy_from_user_iovec_inatomic(
 }
 
 /*
+ * Copy as much as we can into the page and return the number of bytes which
+ * were sucessfully copied.  If a fault is encountered then return the number 
of
+ * bytes which were copied.
+ */
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+   char *kaddr;
+   size_t copied;
+
+   BUG_ON(!in_atomic());
+   kaddr = kmap_atomic(page, KM_USER0);
+   if (likely(i-nr_segs == 1)) {
+   int left;
+   char __user *buf = i-iov-iov_base + i-iov_offset;
+   left = __copy_from_user_inatomic_nocache(kaddr + offset,
+   buf, bytes);
+   copied = bytes - left;
+   } else {
+   copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+   i-iov, i-iov_offset, bytes);
+   }
+   kunmap_atomic(kaddr, KM_USER0);
+
+   return copied;
+}
+
+/*
+ * This has the same sideeffects and return value as
+ * iov_iter_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
+ * Page must not be locked.
+ */
+size_t iov_iter_copy_from_user(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+   char *kaddr;
+   size_t copied;
+
+   kaddr = kmap(page);
+   if (likely(i-nr_segs == 1)) {
+   int left;
+   char __user *buf = i-iov-iov_base + i-iov_offset;
+   left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+   copied = bytes - left;
+   } else {
+   copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+   i-iov, i-iov_offset, bytes);
+   }
+   kunmap(page);
+   return copied;
+}
+
+static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes)
+{
+   if (likely(i-nr_segs == 1)) {
+   i-iov_offset += bytes;
+   } else {
+   const struct iovec *iov = i-iov;
+   size_t base = i-iov_offset;
+
+   while (bytes) {
+   int copy = min(bytes, iov-iov_len - base);
+
+   bytes -= copy;
+

[patch 18/44] ext3 convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]


Various fixes and improvements

Signed-off-by: Badari Pulavarty [EMAIL PROTECTED]

 fs/ext3/inode.c |  136 
 1 file changed, 88 insertions(+), 48 deletions(-)

Index: linux-2.6/fs/ext3/inode.c
===
--- linux-2.6.orig/fs/ext3/inode.c
+++ linux-2.6/fs/ext3/inode.c
@@ -1147,51 +1147,68 @@ static int do_journal_get_write_access(h
return ext3_journal_get_write_access(handle, bh);
 }
 
-static int ext3_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ext3_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = mapping-host;
int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
handle_t *handle;
int retries = 0;
+   struct page *page;
+   pgoff_t index;
+   unsigned from, to;
+
+   index = pos  PAGE_CACHE_SHIFT;
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
 
 retry:
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
+
handle = ext3_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
+   unlock_page(page);
+   page_cache_release(page);
ret = PTR_ERR(handle);
goto out;
}
-   if (test_opt(inode-i_sb, NOBH)  ext3_should_writeback_data(inode))
-   ret = nobh_prepare_write(page, from, to, ext3_get_block);
-   else
-   ret = block_prepare_write(page, from, to, ext3_get_block);
+   ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ext3_get_block);
if (ret)
-   goto prepare_write_failed;
+   goto write_begin_failed;
 
if (ext3_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, do_journal_get_write_access);
}
-prepare_write_failed:
-   if (ret)
+write_begin_failed:
+   if (ret) {
ext3_journal_stop(handle);
+   unlock_page(page);
+   page_cache_release(page);
+   }
if (ret == -ENOSPC  ext3_should_retry_alloc(inode-i_sb, retries))
goto retry;
 out:
return ret;
 }
 
+
 int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
 {
int err = journal_dirty_data(handle, bh);
if (err)
ext3_journal_abort_handle(__FUNCTION__, __FUNCTION__,
-   bh, handle,err);
+   bh, handle, err);
return err;
 }
 
-/* For commit_write() in data=journal mode */
-static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
+/* For write_end() in data=journal mode */
+static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
if (!buffer_mapped(bh) || buffer_freed(bh))
return 0;
@@ -1206,78 +1223,100 @@ static int commit_write_fn(handle_t *han
  * ext3 never places buffers on inode-i_mapping-private_list.  metadata
  * buffers are managed internally.
  */
-static int ext3_ordered_commit_write(struct file *file, struct page *page,
-unsigned from, unsigned to)
+static int ext3_ordered_write_end(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
handle_t *handle = ext3_journal_current_handle();
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = file-f_mapping-host;
+   unsigned from, to;
int ret = 0, ret2;
 
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
+
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, ext3_journal_dirty_data);
 
if (ret == 0) {
/*
-* generic_commit_write() will run mark_inode_dirty() if i_size
+* generic_write_end() will run mark_inode_dirty() if i_size
 * changes.  So let's piggyback the i_disksize mark_inode_dirty
 * into that.
 */
loff_t new_i_size;
 
-   new_i_size = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
+   new_i_size = pos + copied;
if (new_i_size  EXT3_I(inode)-i_disksize)

[patch 19/44] ext4 convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Convert ext4 to use write_begin()/write_end() methods.

Signed-off-by: Badari Pulavarty [EMAIL PROTECTED]

 fs/ext4/inode.c |  147 +++-
 1 file changed, 93 insertions(+), 54 deletions(-)

Index: linux-2.6/fs/ext4/inode.c
===
--- linux-2.6.orig/fs/ext4/inode.c
+++ linux-2.6/fs/ext4/inode.c
@@ -1146,34 +1146,50 @@ static int do_journal_get_write_access(h
return ext4_journal_get_write_access(handle, bh);
 }
 
-static int ext4_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ext4_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = mapping-host;
int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
handle_t *handle;
int retries = 0;
+   struct page *page;
+   pgoff_t index;
+   unsigned from, to;
+
+   index = pos  PAGE_CACHE_SHIFT;
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
 
 retry:
-   handle = ext4_journal_start(inode, needed_blocks);
-   if (IS_ERR(handle)) {
-   ret = PTR_ERR(handle);
-   goto out;
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
+
+   handle = ext4_journal_start(inode, needed_blocks);
+   if (IS_ERR(handle)) {
+   unlock_page(page);
+   page_cache_release(page);
+   ret = PTR_ERR(handle);
+   goto out;
}
-   if (test_opt(inode-i_sb, NOBH)  ext4_should_writeback_data(inode))
-   ret = nobh_prepare_write(page, from, to, ext4_get_block);
-   else
-   ret = block_prepare_write(page, from, to, ext4_get_block);
-   if (ret)
-   goto prepare_write_failed;
 
-   if (ext4_should_journal_data(inode)) {
+   ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ext4_get_block);
+
+   if (!ret  ext4_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, do_journal_get_write_access);
}
-prepare_write_failed:
-   if (ret)
+
+   if (ret) {
ext4_journal_stop(handle);
+   unlock_page(page);
+   page_cache_release(page);
+   }
+
if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, retries))
goto retry;
 out:
@@ -1185,12 +1201,12 @@ int ext4_journal_dirty_data(handle_t *ha
int err = jbd2_journal_dirty_data(handle, bh);
if (err)
ext4_journal_abort_handle(__FUNCTION__, __FUNCTION__,
-   bh, handle,err);
+   bh, handle, err);
return err;
 }
 
-/* For commit_write() in data=journal mode */
-static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
+/* For write_end() in data=journal mode */
+static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
if (!buffer_mapped(bh) || buffer_freed(bh))
return 0;
@@ -1205,78 +1221,100 @@ static int commit_write_fn(handle_t *han
  * ext4 never places buffers on inode-i_mapping-private_list.  metadata
  * buffers are managed internally.
  */
-static int ext4_ordered_commit_write(struct file *file, struct page *page,
-unsigned from, unsigned to)
+static int ext4_ordered_write_end(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
handle_t *handle = ext4_journal_current_handle();
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = file-f_mapping-host;
+   unsigned from, to;
int ret = 0, ret2;
 
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
+
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, ext4_journal_dirty_data);
 
if (ret == 0) {
/*
-* generic_commit_write() will run mark_inode_dirty() if i_size
+* generic_write_end() will run mark_inode_dirty() if i_size
 * changes.  So let's piggyback the i_disksize mark_inode_dirty
 * into that.
 */
loff_t new_i_size;
 
-   new_i_size = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
+

[patch 17/44] ext2 convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/ext2/dir.c   |   47 +--
 fs/ext2/ext2.h  |3 +++
 fs/ext2/inode.c |   24 +---
 3 files changed, 45 insertions(+), 29 deletions(-)

Index: linux-2.6/fs/ext2/inode.c
===
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -726,18 +726,21 @@ ext2_readpages(struct file *file, struct
return mpage_readpages(mapping, pages, nr_pages, ext2_get_block);
 }
 
-static int
-ext2_prepare_write(struct file *file, struct page *page,
-   unsigned from, unsigned to)
+int __ext2_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page,from,to,ext2_get_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ext2_get_block);
 }
 
 static int
-ext2_nobh_prepare_write(struct file *file, struct page *page,
-   unsigned from, unsigned to)
+ext2_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return nobh_prepare_write(page,from,to,ext2_get_block);
+   *pagep = NULL;
+   return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
 }
 
 static int ext2_nobh_writepage(struct page *page,
@@ -773,8 +776,8 @@ const struct address_space_operations ex
.readpages  = ext2_readpages,
.writepage  = ext2_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = ext2_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= ext2_write_begin,
+   .write_end  = generic_write_end,
.bmap   = ext2_bmap,
.direct_IO  = ext2_direct_IO,
.writepages = ext2_writepages,
@@ -791,8 +794,7 @@ const struct address_space_operations ex
.readpages  = ext2_readpages,
.writepage  = ext2_nobh_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = ext2_nobh_prepare_write,
-   .commit_write   = nobh_commit_write,
+   /* XXX: todo */
.bmap   = ext2_bmap,
.direct_IO  = ext2_direct_IO,
.writepages = ext2_writepages,
Index: linux-2.6/fs/ext2/dir.c
===
--- linux-2.6.orig/fs/ext2/dir.c
+++ linux-2.6/fs/ext2/dir.c
@@ -22,6 +22,7 @@
  */
 
 #include ext2.h
+#include linux/buffer_head.h
 #include linux/pagemap.h
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
@@ -61,12 +62,14 @@ ext2_last_byte(struct inode *inode, unsi
return last_byte;
 }
 
-static int ext2_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-   struct inode *dir = page-mapping-host;
+   struct address_space *mapping = page-mapping;
+   struct inode *dir = mapping-host;
int err = 0;
+
dir-i_version++;
-   page-mapping-a_ops-commit_write(NULL, page, from, to);
+   block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
@@ -412,16 +415,18 @@ ino_t ext2_inode_by_name(struct inode * 
 void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
struct page *page, struct inode *inode)
 {
-   unsigned from = (char *) de - (char *) page_address(page);
-   unsigned to = from + le16_to_cpu(de-rec_len);
+   loff_t pos = (page-index  PAGE_CACHE_SHIFT) +
+   (char *) de - (char *) page_address(page);
+   unsigned len = le16_to_cpu(de-rec_len);
int err;
 
lock_page(page);
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __ext2_write_begin(NULL, page-mapping, pos, len,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
BUG_ON(err);
de-inode = cpu_to_le32(inode-i_ino);
-   ext2_set_de_type (de, inode);
-   err = ext2_commit_chunk(page, from, to);
+   ext2_set_de_type(de, inode);
+   err = ext2_commit_chunk(page, pos, len);
ext2_put_page(page);
dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC;
EXT2_I(dir)-i_flags = ~EXT2_BTREE_FL;
@@ -444,7 +449,7 @@ int ext2_add_link (struct dentry *dentry
unsigned long npages = dir_pages(dir);
unsigned long n;
char *kaddr;
-

[patch 16/44] rd convert to new aops

2007-04-23 Thread Nick Piggin

Also clean up various little things.

I've got rid of the comment from akpm, because now that make_page_uptodate
is only called from 2 places, it is pretty easy to see that the buffers
are in an uptodate state at the time of the call. Actually, it was OK before
my patch as well, because the memset is equivalent to reading from disk
of course... however it is more explicit where the updates come from now.

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 drivers/block/rd.c |  125 ++---
 1 file changed, 73 insertions(+), 52 deletions(-)

Index: linux-2.6/drivers/block/rd.c
===
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -104,50 +104,60 @@ static void make_page_uptodate(struct pa
struct buffer_head *head = bh;
 
do {
-   if (!buffer_uptodate(bh)) {
-   memset(bh-b_data, 0, bh-b_size);
-   /*
-* akpm: I'm totally undecided about this.  The
-* buffer has just been magically brought up to
-* date, but nobody should want to be reading
-* it anyway, because it hasn't been used for
-* anything yet.  It is still in a not read
-* from disk yet state.
-*
-* But non-uptodate buffers against an uptodate
-* page are against the rules.  So do it anyway.
-*/
+   if (!buffer_uptodate(bh))
 set_buffer_uptodate(bh);
-   }
} while ((bh = bh-b_this_page) != head);
-   } else {
-   memset(page_address(page), 0, PAGE_CACHE_SIZE);
}
-   flush_dcache_page(page);
SetPageUptodate(page);
 }
 
 static int ramdisk_readpage(struct file *file, struct page *page)
 {
-   if (!PageUptodate(page))
+   if (!PageUptodate(page)) {
+   memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
make_page_uptodate(page);
+   }
unlock_page(page);
return 0;
 }
 
-static int ramdisk_prepare_write(struct file *file, struct page *page,
-   unsigned offset, unsigned to)
-{
-   if (!PageUptodate(page))
-   make_page_uptodate(page);
+static int ramdisk_write_begin(struct file *file, struct address_space 
*mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   struct page *page;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
return 0;
 }
 
-static int ramdisk_commit_write(struct file *file, struct page *page,
-   unsigned offset, unsigned to)
-{
+static int ramdisk_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
+{
+   if (!PageUptodate(page)) {
+   if (copied != PAGE_CACHE_SIZE) {
+   void *dst;
+   unsigned from = pos  (PAGE_CACHE_SIZE - 1);
+   unsigned to = from + copied;
+
+   dst = kmap_atomic(page, KM_USER0);
+   memset(dst, 0, from);
+   memset(dst + to, 0, PAGE_CACHE_SIZE - to);
+   flush_dcache_page(page);
+   kunmap_atomic(dst, KM_USER0);
+   }
+   make_page_uptodate(page);
+   }
+
set_page_dirty(page);
-   return 0;
+   unlock_page(page);
+   page_cache_release(page);
+   return copied;
 }
 
 /*
@@ -191,8 +201,8 @@ static int ramdisk_set_page_dirty(struct
 
 static const struct address_space_operations ramdisk_aops = {
.readpage   = ramdisk_readpage,
-   .prepare_write  = ramdisk_prepare_write,
-   .commit_write   = ramdisk_commit_write,
+   .write_begin= ramdisk_write_begin,
+   .write_end  = ramdisk_write_end,
.writepage  = ramdisk_writepage,
.set_page_dirty = ramdisk_set_page_dirty,
.writepages = ramdisk_writepages,
@@ -201,13 +211,14 @@ static const struct address_space_operat
 static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
struct address_space *mapping)
 {
-   pgoff_t index = sector  (PAGE_CACHE_SHIFT - 9);
+   loff_t pos = sector  9;
unsigned int vec_offset =

[patch 20/44] xfs convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/xfs/linux-2.6/xfs_aops.c |   19 ---
 fs/xfs/linux-2.6/xfs_lrw.c  |   35 ---
 2 files changed, 24 insertions(+), 30 deletions(-)

Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
@@ -1414,13 +1414,18 @@ xfs_vm_direct_IO(
 }
 
 STATIC int
-xfs_vm_prepare_write(
+xfs_vm_write_begin(
struct file *file,
-   struct page *page,
-   unsigned intfrom,
-   unsigned intto)
+   struct address_space*mapping,
+   loff_t  pos,
+   unsignedlen,
+   unsignedflags,
+   struct page **pagep,
+   void**fsdata)
 {
-   return block_prepare_write(page, from, to, xfs_get_blocks);
+   *pagep = NULL;
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   xfs_get_blocks);
 }
 
 STATIC sector_t
@@ -1474,8 +1479,8 @@ const struct address_space_operations xf
.sync_page  = block_sync_page,
.releasepage= xfs_vm_releasepage,
.invalidatepage = xfs_vm_invalidatepage,
-   .prepare_write  = xfs_vm_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= xfs_vm_write_begin,
+   .write_end  = generic_write_end,
.bmap   = xfs_vm_bmap,
.direct_IO  = xfs_vm_direct_IO,
.migratepage= buffer_migrate_page,
Index: linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c
===
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_lrw.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c
@@ -134,45 +134,34 @@ xfs_iozero(
loff_t  pos,/* offset in file   */
size_t  count)  /* size of data to zero */
 {
-   unsignedbytes;
struct page *page;
struct address_space*mapping;
int status;
 
mapping = ip-i_mapping;
do {
-   unsigned long index, offset;
+   unsigned offset, bytes;
+   void *fsdata;
 
offset = (pos  (PAGE_CACHE_SIZE -1)); /* Within page */
-   index = pos  PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
if (bytes  count)
bytes = count;
 
-   status = -ENOMEM;
-   page = grab_cache_page(mapping, index);
-   if (!page)
-   break;
-
-   status = mapping-a_ops-prepare_write(NULL, page, offset,
-   offset + bytes);
+   status = pagecache_write_begin(NULL, mapping, pos, bytes,
+   AOP_FLAG_UNINTERRUPTIBLE,
+   page, fsdata);
if (status)
-   goto unlock;
+   break;
 
memclear_highpage_flush(page, offset, bytes);
 
-   status = mapping-a_ops-commit_write(NULL, page, offset,
-   offset + bytes);
-   if (!status) {
-   pos += bytes;
-   count -= bytes;
-   }
-
-unlock:
-   unlock_page(page);
-   page_cache_release(page);
-   if (status)
-   break;
+   status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
+   page, fsdata);
+   WARN_ON(status = 0); /* can't return less than zero! */
+   pos += bytes;
+   count -= bytes;
+   status = 0;
} while (count);
 
return (-status);

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 13/44] mm: restore KERNEL_DS optimisations

2007-04-23 Thread Nick Piggin

Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
path.

This may be a pretty questionable gain in most cases, especially after the
legacy 2copy write path is removed, but it doesn't cost much.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 mm/filemap.c |   11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2157,7 +2157,7 @@ static ssize_t generic_perform_write_2co
 * cannot take a pagefault with the destination page locked.
 * So pin the source page to copy it.
 */
-   if (!PageUptodate(page)) {
+   if (!PageUptodate(page)  !segment_eq(get_fs(), KERNEL_DS)) {
unlock_page(page);
 
src_page = alloc_page(GFP_KERNEL);
@@ -2282,6 +2282,13 @@ static ssize_t generic_perform_write(str
const struct address_space_operations *a_ops = mapping-a_ops;
long status = 0;
ssize_t written = 0;
+   unsigned int flags = 0;
+
+   /*
+* Copies from kernel address space cannot fail (NFSD is a big user).
+*/
+   if (segment_eq(get_fs(), KERNEL_DS))
+   flags |= AOP_FLAG_UNINTERRUPTIBLE;
 
do {
struct page *page;
@@ -2313,7 +2320,7 @@ again:
break;
}
 
-   status = a_ops-write_begin(file, mapping, pos, bytes, 0,
+   status = a_ops-write_begin(file, mapping, pos, bytes, flags,
page, fsdata);
if (unlikely(status))
break;

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 22/44] fat convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/fat/inode.c |   27 ---
 1 file changed, 16 insertions(+), 11 deletions(-)

Index: linux-2.6/fs/fat/inode.c
===
--- linux-2.6.orig/fs/fat/inode.c
+++ linux-2.6/fs/fat/inode.c
@@ -140,19 +140,24 @@ static int fat_readpages(struct file *fi
return mpage_readpages(mapping, pages, nr_pages, fat_get_block);
 }
 
-static int fat_prepare_write(struct file *file, struct page *page,
-unsigned from, unsigned to)
+static int fat_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return cont_prepare_write(page, from, to, fat_get_block,
- MSDOS_I(page-mapping-host)-mmu_private);
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   fat_get_block,
+   MSDOS_I(mapping-host)-mmu_private);
 }
 
-static int fat_commit_write(struct file *file, struct page *page,
-   unsigned from, unsigned to)
+static int fat_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *pagep, void *fsdata)
 {
-   struct inode *inode = page-mapping-host;
-   int err = generic_commit_write(file, page, from, to);
-   if (!err  !(MSDOS_I(inode)-i_attrs  ATTR_ARCH)) {
+   struct inode *inode = mapping-host;
+   int err;
+   err = generic_write_end(file, mapping, pos, len, copied, pagep, fsdata);
+   if (!(err  0)  !(MSDOS_I(inode)-i_attrs  ATTR_ARCH)) {
inode-i_mtime = inode-i_ctime = CURRENT_TIME_SEC;
MSDOS_I(inode)-i_attrs |= ATTR_ARCH;
mark_inode_dirty(inode);
@@ -201,8 +206,8 @@ static const struct address_space_operat
.writepage  = fat_writepage,
.writepages = fat_writepages,
.sync_page  = block_sync_page,
-   .prepare_write  = fat_prepare_write,
-   .commit_write   = fat_commit_write,
+   .write_begin= fat_write_begin,
+   .write_end  = fat_write_end,
.direct_IO  = fat_direct_IO,
.bmap   = _fat_bmap
 };

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 28/44] bfs convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/bfs/file.c |   12 
 1 file changed, 8 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/bfs/file.c
===
--- linux-2.6.orig/fs/bfs/file.c
+++ linux-2.6/fs/bfs/file.c
@@ -145,9 +145,13 @@ static int bfs_readpage(struct file *fil
return block_read_full_page(page, bfs_get_block);
 }
 
-static int bfs_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+static int bfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page, from, to, bfs_get_block);
+   *pagep = NULL;
+   return block_write_begin(file, mapping, pos, len, flags,
+   pagep, fsdata, bfs_get_block);
 }
 
 static sector_t bfs_bmap(struct address_space *mapping, sector_t block)
@@ -159,8 +163,8 @@ const struct address_space_operations bf
.readpage   = bfs_readpage,
.writepage  = bfs_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = bfs_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= bfs_write_begin,
+   .write_end  = generic_write_end,
.bmap   = bfs_bmap,
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 26/44] hfsplus convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/hfsplus/extents.c |   21 +
 fs/hfsplus/inode.c   |   20 
 2 files changed, 21 insertions(+), 20 deletions(-)

Index: linux-2.6/fs/hfsplus/inode.c
===
--- linux-2.6.orig/fs/hfsplus/inode.c
+++ linux-2.6/fs/hfsplus/inode.c
@@ -26,10 +26,14 @@ static int hfsplus_writepage(struct page
return block_write_full_page(page, hfsplus_get_block, wbc);
 }
 
-static int hfsplus_prepare_write(struct file *file, struct page *page, 
unsigned from, unsigned to)
-{
-   return cont_prepare_write(page, from, to, hfsplus_get_block,
-   HFSPLUS_I(page-mapping-host).phys_size);
+static int hfsplus_write_begin(struct file *file, struct address_space 
*mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   hfsplus_get_block,
+   HFSPLUS_I(mapping-host).phys_size);
 }
 
 static sector_t hfsplus_bmap(struct address_space *mapping, sector_t block)
@@ -113,8 +117,8 @@ const struct address_space_operations hf
.readpage   = hfsplus_readpage,
.writepage  = hfsplus_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = hfsplus_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= hfsplus_write_begin,
+   .write_end  = generic_write_end,
.bmap   = hfsplus_bmap,
.releasepage= hfsplus_releasepage,
 };
@@ -123,8 +127,8 @@ const struct address_space_operations hf
.readpage   = hfsplus_readpage,
.writepage  = hfsplus_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = hfsplus_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= hfsplus_write_begin,
+   .write_end  = generic_write_end,
.bmap   = hfsplus_bmap,
.direct_IO  = hfsplus_direct_IO,
.writepages = hfsplus_writepages,
Index: linux-2.6/fs/hfsplus/extents.c
===
--- linux-2.6.orig/fs/hfsplus/extents.c
+++ linux-2.6/fs/hfsplus/extents.c
@@ -443,21 +443,18 @@ void hfsplus_file_truncate(struct inode 
if (inode-i_size  HFSPLUS_I(inode).phys_size) {
struct address_space *mapping = inode-i_mapping;
struct page *page;
-   u32 size = inode-i_size - 1;
+   void *fsdata;
+   u32 size = inode-i_size;
int res;
 
-   page = grab_cache_page(mapping, size  PAGE_CACHE_SHIFT);
-   if (!page)
-   return;
-   size = PAGE_CACHE_SIZE - 1;
-   size++;
-   res = mapping-a_ops-prepare_write(NULL, page, size, size);
-   if (!res)
-   res = mapping-a_ops-commit_write(NULL, page, size, 
size);
+   res = pagecache_write_begin(NULL, mapping, size, 0,
+   AOP_FLAG_UNINTERRUPTIBLE,
+   page, fsdata);
if (res)
-   inode-i_size = HFSPLUS_I(inode).phys_size;
-   unlock_page(page);
-   page_cache_release(page);
+   return;
+   res = pagecache_write_end(NULL, mapping, size, 0, 0, page, 
fsdata);
+   if (res  0)
+   return;
mark_inode_dirty(inode);
return;
} else if (inode-i_size == HFSPLUS_I(inode).phys_size)

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 29/44] qnx4 convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/qnx4/inode.c |   21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/qnx4/inode.c
===
--- linux-2.6.orig/fs/qnx4/inode.c
+++ linux-2.6/fs/qnx4/inode.c
@@ -433,16 +433,21 @@ static int qnx4_writepage(struct page *p
 {
return block_write_full_page(page,qnx4_get_block, wbc);
 }
+
 static int qnx4_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,qnx4_get_block);
 }
-static int qnx4_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
-{
-   struct qnx4_inode_info *qnx4_inode = qnx4_i(page-mapping-host);
-   return cont_prepare_write(page, from, to, qnx4_get_block,
- qnx4_inode-mmu_private);
+
+static int qnx4_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   struct qnx4_inode_info *qnx4_inode = qnx4_i(mapping-host);
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   qnx4_get_block,
+   qnx4_inode-mmu_private);
 }
 static sector_t qnx4_bmap(struct address_space *mapping, sector_t block)
 {
@@ -452,8 +457,8 @@ static const struct address_space_operat
.readpage   = qnx4_readpage,
.writepage  = qnx4_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = qnx4_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= qnx4_write_begin,
+   .write_end  = generic_write_end,
.bmap   = qnx4_bmap
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 37/44] hostfs convert to new aops

2007-04-23 Thread Nick Piggin

This also gets rid of a lot of useless read_file stuff. And also
optimises the full page write case by marking a !uptodate page uptodate.

Cc: Jeff Dike [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/hostfs/hostfs_kern.c |   70 +++-
 1 file changed, 28 insertions(+), 42 deletions(-)

Index: linux-2.6/fs/hostfs/hostfs_kern.c
===
--- linux-2.6.orig/fs/hostfs/hostfs_kern.c
+++ linux-2.6/fs/hostfs/hostfs_kern.c
@@ -461,56 +461,42 @@ int hostfs_readpage(struct file *file, s
return(err);
 }
 
-int hostfs_prepare_write(struct file *file, struct page *page,
-unsigned int from, unsigned int to)
+int hostfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   char *buffer;
-   long long start, tmp;
-   int err;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
 
-   start = (long long) page-index  PAGE_CACHE_SHIFT;
-   buffer = kmap(page);
-   if(from != 0){
-   tmp = start;
-   err = read_file(FILE_HOSTFS_I(file)-fd, tmp, buffer,
-   from);
-   if(err  0) goto out;
-   }
-   if(to != PAGE_CACHE_SIZE){
-   start += to;
-   err = read_file(FILE_HOSTFS_I(file)-fd, start, buffer + to,
-   PAGE_CACHE_SIZE - to);
-   if(err  0) goto out;
-   }
-   err = 0;
- out:
-   kunmap(page);
-   return(err);
+   *pagep = __grab_cache_page(mapping, index);
+   if (!*pagep)
+   return -ENOMEM;
+   return 0;
 }
 
-int hostfs_commit_write(struct file *file, struct page *page, unsigned from,
-unsigned to)
+int hostfs_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   struct address_space *mapping = page-mapping;
struct inode *inode = mapping-host;
-   char *buffer;
-   long long start;
-   int err = 0;
+   void *buffer;
+   unsigned from = pos  (PAGE_CACHE_SIZE - 1);
+   int err;
 
-   start = (((long long) page-index)  PAGE_CACHE_SHIFT) + from;
buffer = kmap(page);
-   err = write_file(FILE_HOSTFS_I(file)-fd, start, buffer + from,
-to - from);
-   if(err  0) err = 0;
-
-   /* Actually, if !err, write_file has added to-from to start, so, despite
-* the appearance, we are comparing i_size against the _last_ written
-* location, as we should. */
+   err = write_file(FILE_HOSTFS_I(file)-fd, pos, buffer + from, copied);
+   kunmap(page);
+
+   if (!PageUptodate(page)  err == PAGE_CACHE_SIZE)
+   SetPageUptodate(page);
+   unlock_page(page);
+   page_cache_release(page);
 
-   if(!err  (start  inode-i_size))
-   inode-i_size = start;
+   /* If err  0, write_file has added err to pos, so we are comparing
+* i_size against the last byte written.
+*/
+   if (err  0  (pos  inode-i_size))
+   inode-i_size = pos;
 
-   kunmap(page);
return(err);
 }
 
@@ -518,8 +504,8 @@ static const struct address_space_operat
.writepage  = hostfs_writepage,
.readpage   = hostfs_readpage,
.set_page_dirty = __set_page_dirty_nobuffers,
-   .prepare_write  = hostfs_prepare_write,
-   .commit_write   = hostfs_commit_write
+   .write_begin= hostfs_write_begin,
+   .write_end  = hostfs_write_end,
 };
 
 static int init_inode(struct inode *inode, struct dentry *dentry)

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 39/44] cifs convert to new aops

2007-04-23 Thread Nick Piggin

Convert to new aops, and fix security hole where page is set uptodate
before contents are uptodate.

Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/cifs/file.c |   89 -
 1 file changed, 51 insertions(+), 38 deletions(-)

Index: linux-2.6/fs/cifs/file.c
===
--- linux-2.6.orig/fs/cifs/file.c
+++ linux-2.6/fs/cifs/file.c
@@ -103,7 +103,7 @@ static inline int cifs_open_inode_helper
 
/* want handles we can use to read with first
   in the list so we do not have to walk the
-  list to search for one in prepare_write */
+  list to search for one in write_begin */
if ((file-f_flags  O_ACCMODE) == O_WRONLY) {
list_add_tail(pCifsFile-flist, 
  pCifsInode-openFileList);
@@ -1358,40 +1358,37 @@ static int cifs_writepage(struct page* p
return rc;
 }
 
-static int cifs_commit_write(struct file *file, struct page *page,
-   unsigned offset, unsigned to)
+static int cifs_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
int xid;
int rc = 0;
-   struct inode *inode = page-mapping-host;
-   loff_t position = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
+   struct inode *inode = mapping-host;
+   loff_t position = pos + copied;
char *page_data;
 
xid = GetXid();
-   cFYI(1, (commit write for page %p up to position %lld for %d, 
-page, position, to));
+   cFYI(1, (write end for page %p at pos %lld, copied %d,
+page, pos, copied));
spin_lock(inode-i_lock);
if (position  inode-i_size) {
i_size_write(inode, position);
}
spin_unlock(inode-i_lock);
+   if (!PageUptodate(page)  copied == PAGE_CACHE_SIZE)
+   SetPageUptodate(page);
+
if (!PageUptodate(page)) {
-   position =  ((loff_t)page-index  PAGE_CACHE_SHIFT) + offset;
-   /* can not rely on (or let) writepage write this data */
-   if (to  offset) {
-   cFYI(1, (Illegal offsets, can not copy from %d to %d,
-   offset, to));
-   FreeXid(xid);
-   return rc;
-   }
+   unsigned long offset = pos  (PAGE_CACHE_SIZE - 1);
+
/* this is probably better than directly calling
   partialpage_write since in this function the file handle is
   known which we might as well leverage */
/* BB check if anything else missing out of ppw
   such as updating last write time */
page_data = kmap(page);
-   rc = cifs_write(file, page_data + offset, to-offset,
-   position);
+   rc = cifs_write(file, page_data + offset, copied, pos);
if (rc  0)
rc = 0;
/* else if (rc  0) should we set writebehind rc? */
@@ -1399,9 +1396,12 @@ static int cifs_commit_write(struct file
} else {
set_page_dirty(page);
}
-
FreeXid(xid);
-   return rc;
+
+   unlock_page(page);
+   page_cache_release(page);
+
+   return rc  0 ? rc : copied;
 }
 
 int cifs_fsync(struct file *file, struct dentry *dentry, int datasync)
@@ -1928,34 +1928,47 @@ int is_size_safe_to_change(struct cifsIn
return 1;
 }
 
-static int cifs_prepare_write(struct file *file, struct page *page,
-   unsigned from, unsigned to)
+static int cifs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
int rc = 0;
loff_t i_size;
loff_t offset;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   struct page *page;
+
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
 
-   cFYI(1, (prepare write for page %p from %d to %d,page,from,to));
+   cFYI(1, (write begin for page %p at pos %lld, length %d,
+page, pos, len));
if (PageUptodate(page))
return 0;
 
-   /* If we are writing a full page it will be up to date,
-  no need to read from the server */
-   if ((to == PAGE_CACHE_SIZE)  (from == 0)) {
-   SetPageUptodate(page);
+   /* If we are writing a full page it will become up to date,
+  no need to read from the server (although we may encounter a
+  short copy, so write_end has to handle this) */
+   if (len

[patch 43/44] minix convert to new aops

2007-04-23 Thread Nick Piggin

Cc: Andries Brouwer [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/minix/dir.c   |   43 +--
 fs/minix/inode.c |   23 +++
 2 files changed, 44 insertions(+), 22 deletions(-)

Index: linux-2.6/fs/minix/inode.c
===
--- linux-2.6.orig/fs/minix/inode.c
+++ linux-2.6/fs/minix/inode.c
@@ -348,24 +348,39 @@ static int minix_writepage(struct page *
 {
return block_write_full_page(page, minix_get_block, wbc);
 }
+
 static int minix_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,minix_get_block);
 }
-static int minix_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+
+int __minix_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page,from,to,minix_get_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   minix_get_block);
 }
+
+static int minix_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return __minix_write_begin(file, mapping, pos, len, flags, pagep, 
fsdata);
+}
+
 static sector_t minix_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,minix_get_block);
 }
+
 static const struct address_space_operations minix_aops = {
.readpage = minix_readpage,
.writepage = minix_writepage,
.sync_page = block_sync_page,
-   .prepare_write = minix_prepare_write,
-   .commit_write = generic_commit_write,
+   .write_begin = minix_write_begin,
+   .write_end = generic_write_end,
.bmap = minix_bmap
 };
 
Index: linux-2.6/fs/minix/dir.c
===
--- linux-2.6.orig/fs/minix/dir.c
+++ linux-2.6/fs/minix/dir.c
@@ -9,6 +9,7 @@
  */
 
 #include minix.h
+#include linux/buffer_head.h
 #include linux/highmem.h
 #include linux/smp_lock.h
 
@@ -48,11 +49,12 @@ static inline unsigned long dir_pages(st
return (inode-i_size+PAGE_CACHE_SIZE-1)PAGE_CACHE_SHIFT;
 }
 
-static int dir_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-   struct inode *dir = (struct inode *)page-mapping-host;
+   struct address_space *mapping = page-mapping;
+   struct inode *dir = mapping-host;
int err = 0;
-   page-mapping-a_ops-commit_write(NULL, page, from, to);
+   block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
@@ -220,7 +222,7 @@ int minix_add_link(struct dentry *dentry
char *kaddr, *p;
minix_dirent *de;
minix3_dirent *de3;
-   unsigned from, to;
+   loff_t pos;
int err;
char *namx = NULL;
__u32 inumber;
@@ -272,9 +274,9 @@ int minix_add_link(struct dentry *dentry
return -EINVAL;
 
 got_it:
-   from = p - (char*)page_address(page);
-   to = from + sbi-s_dirsize;
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   pos = (page-index  PAGE_CACHE_SHIFT) + p - (char*)page_address(page);
+   err = __minix_write_begin(NULL, page-mapping, pos, sbi-s_dirsize,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
if (err)
goto out_unlock;
memcpy (namx, name, namelen);
@@ -285,7 +287,7 @@ got_it:
memset (namx + namelen, 0, sbi-s_dirsize - namelen - 2);
de-inode = inode-i_ino;
}
-   err = dir_commit_chunk(page, from, to);
+   err = dir_commit_chunk(page, pos, sbi-s_dirsize);
dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(dir);
 out_put:
@@ -302,15 +304,16 @@ int minix_delete_entry(struct minix_dir_
struct address_space *mapping = page-mapping;
struct inode *inode = (struct inode*)mapping-host;
char *kaddr = page_address(page);
-   unsigned from = (char*)de - kaddr;
-   unsigned to = from + minix_sb(inode-i_sb)-s_dirsize;
+   loff_t pos = (page-index  PAGE_CACHE_SHIFT) + (char*)de - kaddr;
+   unsigned len = minix_sb(inode-i_sb)-s_dirsize;
int err;
 
lock_page(page);
-   err = mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __minix_write_begin(NULL, mapping, pos, len,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
if (err == 0) {
de-inode = 0;
-

[patch 24/44] affs convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/affs/file.c |  106 +++--
 1 file changed, 58 insertions(+), 48 deletions(-)

Index: linux-2.6/fs/affs/file.c
===
--- linux-2.6.orig/fs/affs/file.c
+++ linux-2.6/fs/affs/file.c
@@ -395,25 +395,33 @@ static int affs_writepage(struct page *p
 {
return block_write_full_page(page, affs_get_block, wbc);
 }
+
 static int affs_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page, affs_get_block);
 }
-static int affs_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
-{
-   return cont_prepare_write(page, from, to, affs_get_block,
-   AFFS_I(page-mapping-host)-mmu_private);
+
+static int affs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   affs_get_block,
+   AFFS_I(mapping-host)-mmu_private);
 }
+
 static sector_t _affs_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,affs_get_block);
 }
+
 const struct address_space_operations affs_aops = {
.readpage = affs_readpage,
.writepage = affs_writepage,
.sync_page = block_sync_page,
-   .prepare_write = affs_prepare_write,
-   .commit_write = generic_commit_write,
+   .write_begin = affs_write_begin,
+   .write_end = generic_write_end,
.bmap = _affs_bmap
 };
 
@@ -603,58 +611,65 @@ affs_readpage_ofs(struct file *file, str
return err;
 }
 
-static int affs_prepare_write_ofs(struct file *file, struct page *page, 
unsigned from, unsigned to)
-{
-   struct inode *inode = page-mapping-host;
-   u32 size, offset;
-   u32 tmp;
+static int affs_write_begin_ofs(struct file *file, struct address_space 
*mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   struct inode *inode = mapping-host;
+   struct page *page;
+   pgoff_t index;
int err = 0;
 
-   pr_debug(AFFS: prepare_write(%u, %ld, %d, %d)\n, (u32)inode-i_ino, 
page-index, from, to);
-   offset = page-index  PAGE_CACHE_SHIFT;
-   if (offset + from  AFFS_I(inode)-mmu_private) {
-   err = affs_extent_file_ofs(inode, offset + from);
+   pr_debug(AFFS: write_begin(%u, %llu, %llu)\n, (u32)inode-i_ino, 
(unsigned long long)pos, (unsigned long long)pos + len);
+   if (pos  AFFS_I(inode)-mmu_private) {
+   /* XXX: this probably leaves a too-big i_size in case of
+* failure. Should really be updating i_size at write_end time
+*/
+   err = affs_extent_file_ofs(inode, pos);
if (err)
return err;
}
-   size = inode-i_size;
+
+   index = pos  PAGE_CACHE_SHIFT;
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
 
if (PageUptodate(page))
return 0;
 
-   if (from) {
-   err = affs_do_readpage_ofs(file, page, 0, from);
-   if (err)
-   return err;
-   }
-   if (to  PAGE_CACHE_SIZE) {
-   char *kaddr = kmap_atomic(page, KM_USER0);
-
-   memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
-   flush_dcache_page(page);
-   kunmap_atomic(kaddr, KM_USER0);
-   if (size  offset + to) {
-   if (size  offset + PAGE_CACHE_SIZE)
-   tmp = size  ~PAGE_CACHE_MASK;
-   else
-   tmp = PAGE_CACHE_SIZE;
-   err = affs_do_readpage_ofs(file, page, to, tmp);
-   }
+   /* XXX: inefficient but safe in the face of short writes */
+   err = affs_do_readpage_ofs(file, page, 0, PAGE_CACHE_SIZE);
+   if (err) {
+   unlock_page(page);
+   page_cache_release(page);
}
return err;
 }
 
-static int affs_commit_write_ofs(struct file *file, struct page *page, 
unsigned from, unsigned to)
+static int affs_write_end_ofs(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = mapping-host;
struct super_block *sb = inode-i_sb;
struct buffer_head *bh, *prev_bh;

[patch 25/44] hfs convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/hfs/extent.c |   19 ---
 fs/hfs/inode.c  |   20 
 2 files changed, 20 insertions(+), 19 deletions(-)

Index: linux-2.6/fs/hfs/inode.c
===
--- linux-2.6.orig/fs/hfs/inode.c
+++ linux-2.6/fs/hfs/inode.c
@@ -34,10 +34,14 @@ static int hfs_readpage(struct file *fil
return block_read_full_page(page, hfs_get_block);
 }
 
-static int hfs_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
-{
-   return cont_prepare_write(page, from, to, hfs_get_block,
- HFS_I(page-mapping-host)-phys_size);
+static int hfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   hfs_get_block,
+   HFS_I(mapping-host)-phys_size);
 }
 
 static sector_t hfs_bmap(struct address_space *mapping, sector_t block)
@@ -118,8 +122,8 @@ const struct address_space_operations hf
.readpage   = hfs_readpage,
.writepage  = hfs_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = hfs_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= hfs_write_begin,
+   .write_end  = generic_write_end,
.bmap   = hfs_bmap,
.releasepage= hfs_releasepage,
 };
@@ -128,8 +132,8 @@ const struct address_space_operations hf
.readpage   = hfs_readpage,
.writepage  = hfs_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = hfs_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= hfs_write_begin,
+   .write_end  = generic_write_end,
.bmap   = hfs_bmap,
.direct_IO  = hfs_direct_IO,
.writepages = hfs_writepages,
Index: linux-2.6/fs/hfs/extent.c
===
--- linux-2.6.orig/fs/hfs/extent.c
+++ linux-2.6/fs/hfs/extent.c
@@ -464,23 +464,20 @@ void hfs_file_truncate(struct inode *ino
   (long long)HFS_I(inode)-phys_size, inode-i_size);
if (inode-i_size  HFS_I(inode)-phys_size) {
struct address_space *mapping = inode-i_mapping;
+   void *fsdata;
struct page *page;
int res;
 
+   /* XXX: Can use generic_cont_expand? */
size = inode-i_size - 1;
-   page = grab_cache_page(mapping, size  PAGE_CACHE_SHIFT);
-   if (!page)
-   return;
-   size = PAGE_CACHE_SIZE - 1;
-   size++;
-   res = mapping-a_ops-prepare_write(NULL, page, size, size);
-   if (!res)
-   res = mapping-a_ops-commit_write(NULL, page, size, 
size);
+   res = pagecache_write_begin(NULL, mapping, size+1, 0,
+   AOP_FLAG_UNINTERRUPTIBLE, page, fsdata);
+   if (!res) {
+   res = pagecache_write_end(NULL, mapping, size+1, 0, 0,
+   page, fsdata);
+   }
if (res)
inode-i_size = HFS_I(inode)-phys_size;
-   unlock_page(page);
-   page_cache_release(page);
-   mark_inode_dirty(inode);
return;
} else if (inode-i_size == HFS_I(inode)-phys_size)
return;

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 30/44] nfs convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/nfs/file.c |   49 -
 1 file changed, 36 insertions(+), 13 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -282,27 +282,50 @@ nfs_fsync(struct file *file, struct dent
 }
 
 /*
- * This does the real work of the write. The generic routine has
- * allocated the page, locked it, done all the page alignment stuff
- * calculations etc. Now we should just copy the data from user
- * space and write it back to the real medium..
+ * This does the real work of the write. We must allocate and lock the
+ * page to be sent back to the generic routine, which then copies the
+ * data from user space.
  *
  * If the writer ends up delaying the write, the writer needs to
  * increment the page use counts until he is done with the page.
  */
-static int nfs_prepare_write(struct file *file, struct page *page, unsigned 
offset, unsigned to)
-{
-   return nfs_flush_incompatible(file, page);
+static int nfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   int ret;
+   pgoff_t index;
+   struct page *page;
+   index = pos  PAGE_CACHE_SHIFT;
+
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
+
+   ret = nfs_flush_incompatible(file, page);
+   if (ret) {
+   unlock_page(page);
+   page_cache_release(page);
+   }
+   return ret;
 }
 
-static int nfs_commit_write(struct file *file, struct page *page, unsigned 
offset, unsigned to)
+static int nfs_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   long status;
+   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
+   int status;
 
lock_kernel();
-   status = nfs_updatepage(file, page, offset, to-offset);
+   status = nfs_updatepage(file, page, offset, copied);
unlock_kernel();
-   return status;
+
+   unlock_page(page);
+   page_cache_release(page);
+
+   return status  0 ? status : copied;
 }
 
 static void nfs_invalidate_page(struct page *page, unsigned long offset)
@@ -330,8 +353,8 @@ const struct address_space_operations nf
.set_page_dirty = nfs_set_page_dirty,
.writepage = nfs_writepage,
.writepages = nfs_writepages,
-   .prepare_write = nfs_prepare_write,
-   .commit_write = nfs_commit_write,
+   .write_begin = nfs_write_begin,
+   .write_end = nfs_write_end,
.invalidatepage = nfs_invalidate_page,
.releasepage = nfs_release_page,
 #ifdef CONFIG_NFS_DIRECTIO

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 23/44] adfs convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/adfs/inode.c |   14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

Index: linux-2.6/fs/adfs/inode.c
===
--- linux-2.6.orig/fs/adfs/inode.c
+++ linux-2.6/fs/adfs/inode.c
@@ -61,10 +61,14 @@ static int adfs_readpage(struct file *fi
return block_read_full_page(page, adfs_get_block);
 }
 
-static int adfs_prepare_write(struct file *file, struct page *page, unsigned 
int from, unsigned int to)
+static int adfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return cont_prepare_write(page, from, to, adfs_get_block,
-   ADFS_I(page-mapping-host)-mmu_private);
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   adfs_get_block,
+   ADFS_I(mapping-host)-mmu_private);
 }
 
 static sector_t _adfs_bmap(struct address_space *mapping, sector_t block)
@@ -76,8 +80,8 @@ static const struct address_space_operat
.readpage   = adfs_readpage,
.writepage  = adfs_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = adfs_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= adfs_write_begin,
+   .write_end  = generic_write_end,
.bmap   = _adfs_bmap
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 41/44] udf convert to new aops

2007-04-23 Thread Nick Piggin

Convert udf to new aops. Also seem to have fixed pagecache corruption in
udf_adinicb_commit_write -- page was marked uptodate when it is not. Also,
fixed the silly setup where prepare_write was doing a kmap to be used in
commit_write: just do kmap_atomic in write_end. Use libfs helpers to make
this easier.

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/udf/file.c  |   32 +---
 fs/udf/inode.c |   11 +++
 2 files changed, 20 insertions(+), 23 deletions(-)

Index: linux-2.6/fs/udf/file.c
===
--- linux-2.6.orig/fs/udf/file.c
+++ linux-2.6/fs/udf/file.c
@@ -73,34 +73,28 @@ static int udf_adinicb_writepage(struct 
return 0;
 }
 
-static int udf_adinicb_prepare_write(struct file *file, struct page *page, 
unsigned offset, unsigned to)
+static int udf_adinicb_write_end(struct file *file, struct address_space 
*mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   kmap(page);
-   return 0;
-}
-
-static int udf_adinicb_commit_write(struct file *file, struct page *page, 
unsigned offset, unsigned to)
-{
-   struct inode *inode = page-mapping-host;
-   char *kaddr = page_address(page);
+   struct inode *inode = mapping-host;
+   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
+   char *kaddr;
 
+   kaddr = kmap_atomic(page, KM_USER0);
memcpy(UDF_I_DATA(inode) + UDF_I_LENEATTR(inode) + offset,
-   kaddr + offset, to - offset);
-   mark_inode_dirty(inode);
-   SetPageUptodate(page);
-   kunmap(page);
-   /* only one page here */
-   if (to  inode-i_size)
-   inode-i_size = to;
-   return 0;
+   kaddr + offset, copied);
+   kunmap_atomic(kaddr, KM_USER0);
+
+   return simple_write_end(file, mapping, pos, len, copied, page, fsdata);
 }
 
 const struct address_space_operations udf_adinicb_aops = {
.readpage   = udf_adinicb_readpage,
.writepage  = udf_adinicb_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = udf_adinicb_prepare_write,
-   .commit_write   = udf_adinicb_commit_write,
+   .write_begin= simple_write_begin,
+   .write_end  = udf_adinicb_write_end,
 };
 
 static ssize_t udf_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
Index: linux-2.6/fs/udf/inode.c
===
--- linux-2.6.orig/fs/udf/inode.c
+++ linux-2.6/fs/udf/inode.c
@@ -122,9 +122,12 @@ static int udf_readpage(struct file *fil
return block_read_full_page(page, udf_get_block);
 }
 
-static int udf_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+static int udf_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page, from, to, udf_get_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   udf_get_block);
 }
 
 static sector_t udf_bmap(struct address_space *mapping, sector_t block)
@@ -136,8 +139,8 @@ const struct address_space_operations ud
.readpage   = udf_readpage,
.writepage  = udf_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = udf_prepare_write,
-   .commit_write   = generic_commit_write,
+   .write_begin= udf_write_begin,
+   .write_end  = generic_write_end,
.bmap   = udf_bmap,
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 35/44] ecryptfs convert to new aops

2007-04-23 Thread Nick Piggin

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/ecryptfs/crypto.c  |   32 +++---
 fs/ecryptfs/ecryptfs_kernel.h |4 
 fs/ecryptfs/mmap.c|  213 +++---
 3 files changed, 119 insertions(+), 130 deletions(-)

Index: linux-2.6/fs/ecryptfs/mmap.c
===
--- linux-2.6.orig/fs/ecryptfs/mmap.c
+++ linux-2.6/fs/ecryptfs/mmap.c
@@ -36,26 +36,6 @@
 
 struct kmem_cache *ecryptfs_lower_page_cache;
 
-/**
- * ecryptfs_get1page
- *
- * Get one page from cache or lower f/s, return error otherwise.
- *
- * Returns unlocked and up-to-date page (if ok), with increased
- * refcnt.
- */
-static struct page *ecryptfs_get1page(struct file *file, int index)
-{
-   struct dentry *dentry;
-   struct inode *inode;
-   struct address_space *mapping;
-
-   dentry = file-f_path.dentry;
-   inode = dentry-d_inode;
-   mapping = inode-i_mapping;
-   return read_mapping_page(mapping, index, (void *)file);
-}
-
 static
 int write_zeros(struct file *file, pgoff_t index, int start, int num_zeros);
 
@@ -360,17 +340,14 @@ out:
 /**
  * Called with lower inode mutex held.
  */
-static int fill_zeros_to_end_of_page(struct page *page, unsigned int to)
+static int fill_zeros_to_end_of_page(struct page *page, loff_t new_isize)
 {
-   struct inode *inode = page-mapping-host;
int end_byte_in_page;
char *page_virt;
 
-   if ((i_size_read(inode) / PAGE_CACHE_SIZE) != page-index)
+   if ((new_isize  PAGE_CACHE_SHIFT) != page-index)
goto out;
-   end_byte_in_page = i_size_read(inode) % PAGE_CACHE_SIZE;
-   if (to  end_byte_in_page)
-   end_byte_in_page = to;
+   end_byte_in_page = new_isize % PAGE_CACHE_SIZE;
page_virt = kmap_atomic(page, KM_USER0);
memset((page_virt + end_byte_in_page), 0,
   (PAGE_CACHE_SIZE - end_byte_in_page));
@@ -380,16 +357,35 @@ out:
return 0;
 }
 
-static int ecryptfs_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ecryptfs_write_begin(struct file *file,struct address_space 
*mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
+   struct page *page;
+   pgoff_t index;
int rc = 0;
 
-   if (from == 0  to == PAGE_CACHE_SIZE)
-   goto out;   /* If we are writing a full page, it will be
-  up to date. */
-   if (!PageUptodate(page))
-   rc = ecryptfs_do_readpage(file, page, page-index);
+   index = pos  PAGE_CACHE_SHIFT;
+   page = __grab_cache_page(mapping, index);
+   if (!page) {
+   rc = -ENOMEM;
+   goto out;
+   }
+
+   /*
+* If we are writing a full page (with no possibility of a short
+* write), it will be guaranteed to end up being uptodate at
+* write_end-time
+*/
+   if (flags  AOP_FLAG_UNINTERRUPTIBLE  len == PAGE_CACHE_SIZE)
+   goto out;
+   if (!PageUptodate(page)) {
+   rc = ecryptfs_do_readpage(file, page, index);
+   if (rc) {
+   unlock_page(page);
+   page_cache_release(page);
+   }
+   }
 out:
return rc;
 }
@@ -412,12 +408,6 @@ out:
return rc;
 }
 
-static void ecryptfs_release_lower_page(struct page *lower_page)
-{
-   unlock_page(lower_page);
-   page_cache_release(lower_page);
-}
-
 /**
  * ecryptfs_write_inode_size_to_header
  *
@@ -431,23 +421,17 @@ static int ecryptfs_write_inode_size_to_
 {
int rc = 0;
struct page *header_page;
+   void *fsdata;
char *header_virt;
-   const struct address_space_operations *lower_a_ops;
+   struct address_space *lower_mapping = lower_inode-i_mapping;
u64 file_size;
 
-   header_page = grab_cache_page(lower_inode-i_mapping, 0);
-   if (!header_page) {
-   ecryptfs_printk(KERN_ERR, grab_cache_page for 
-   lower_page_index 0 failed\n);
-   rc = -EINVAL;
-   goto out;
-   }
-   lower_a_ops = lower_inode-i_mapping-a_ops;
-   rc = lower_a_ops-prepare_write(lower_file, header_page, 0, 8);
-   if (rc) {
-   ecryptfs_release_lower_page(header_page);
+   rc = pagecache_write_begin(lower_file, lower_mapping, 0, sizeof(u64),
+   AOP_FLAG_UNINTERRUPTIBLE,
+   header_page, fsdata);
+   if (rc)
goto out;
-   }
+
file_size = (u64)i_size_read(inode);
ecryptfs_printk(KERN_DEBUG, Writing size: [0x%.16x]\n,

[patch 32/44] ocfs2: convert to new aops

2007-04-23 Thread Nick Piggin

From: Mark Fasheh [EMAIL PROTECTED]

Fix up ocfs2 to use -write_begin and -write_end. This lets us dump a large
amount of code which was implementing our own write path while preserving
the nice locking rules that were gained by moving away from -prepare_write.

It makes use of the context back pointer to store information related to the
write which the vfs normally doesn't know about. Most importantly this is an
array of zero'd pages which might have to be written out for an allocating
write. Of note is that I also stick the journal handle on there. Ocfs2 could
use current-journal_info for that, but I think it's much cleaner to just pass
that around as a file system specific context.

I tested this on a couple nodes and things seem to be running smoothly.

A couple of notes:

* The ocfs2 write context is probably a bit big. I'm much more concerned
  with readability though as Ocfs2 has much more baggage to carry around
  than other file systems.

* A ton of code was deleted :) The patch adds a bunch too, but that's mostly
  getting the old stuff to flow with -write_begin. Some assumptions about
  the size of the write that were made with my previous implemenation were
  no longer true (this is good).

* I could probably clean this up some more, but I'd be fine if the patch
  went upstream as-is. Diff seems to have mangled this patch file enough
  that it's probably much easier to read once applied.

* This doesn't use -perform_write (yet), so stuff is still being copied one
  page at a time. I _think_ things are pretty reasonably set up to allow
  larger writes though...

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Mark Fasheh [EMAIL PROTECTED]

 fs/ocfs2/aops.c |  779 +++-
 fs/ocfs2/aops.h |   52 ---
 fs/ocfs2/file.c |  246 -
 3 files changed, 453 insertions(+), 624 deletions(-)

Index: linux-2.6/fs/ocfs2/aops.c
===
--- linux-2.6.orig/fs/ocfs2/aops.c
+++ linux-2.6/fs/ocfs2/aops.c
@@ -677,6 +677,8 @@ int ocfs2_map_page_blocks(struct page *p
 bh = bh-b_this_page, block_start += bsize) {
block_end = block_start + bsize;
 
+   clear_buffer_new(bh);
+
/*
 * Ignore blocks outside of our i/o range -
 * they may belong to unallocated clusters.
@@ -691,9 +693,8 @@ int ocfs2_map_page_blocks(struct page *p
 * For an allocating write with cluster size = page
 * size, we always write the entire page.
 */
-
-   if (buffer_new(bh))
-   clear_buffer_new(bh);
+   if (new)
+   set_buffer_new(bh);
 
if (!buffer_mapped(bh)) {
map_bh(bh, inode-i_sb, *p_blkno);
@@ -754,217 +755,187 @@ next_bh:
return ret;
 }
 
+#if (PAGE_CACHE_SIZE = OCFS2_MAX_CLUSTERSIZE)
+#define OCFS2_MAX_CTXT_PAGES   1
+#else
+#define OCFS2_MAX_CTXT_PAGES   (OCFS2_MAX_CLUSTERSIZE / PAGE_CACHE_SIZE)
+#endif
+
+#define OCFS2_MAX_CLUSTERS_PER_PAGE(PAGE_CACHE_SIZE / 
OCFS2_MIN_CLUSTERSIZE)
+
 /*
- * This will copy user data from the buffer page in the splice
- * context.
- *
- * For now, we ignore SPLICE_F_MOVE as that would require some extra
- * communication out all the way to ocfs2_write().
+ * Describe the state of a single cluster to be written to.
  */
-int ocfs2_map_and_write_splice_data(struct inode *inode,
- struct ocfs2_write_ctxt *wc, u64 *p_blkno,
- unsigned int *ret_from, unsigned int *ret_to)
-{
-   int ret;
-   unsigned int to, from, cluster_start, cluster_end;
-   char *src, *dst;
-   struct ocfs2_splice_write_priv *sp = wc-w_private;
-   struct pipe_buffer *buf = sp-s_buf;
-   unsigned long bytes, src_from;
-   struct ocfs2_super *osb = OCFS2_SB(inode-i_sb);
+struct ocfs2_write_cluster_desc {
+   u32 c_cpos;
+   u32 c_phys;
+   /*
+* Give this a unique field because c_phys eventually gets
+* filled.
+*/
+   unsignedc_new;   
+};
 
-   ocfs2_figure_cluster_boundaries(osb, wc-w_cpos, cluster_start,
-   cluster_end);
+struct ocfs2_write_ctxt {
+   /* Logical cluster position / len of write */
+   u32 w_cpos;
+   u32 w_clen;
 
-   from = sp-s_offset;
-   src_from = sp-s_buf_offset;
-   bytes = wc-w_count;
+   struct ocfs2_write_cluster_desc w_desc[OCFS2_MAX_CLUSTERS_PER_PAGE];
 
-   if (wc-w_large_pages) {
-   /*
-* For cluster size  page size, we have to
-* calculate pos within the cluster and obey
-* the rightmost boundary.
-*/
-   bytes = min(bytes, (unsigned

[patch 21/44] fs: new cont helpers

2007-04-23 Thread Nick Piggin

Rework the generic block cont routines to handle the new aops.
Supporting cont_prepare_write would take quite a lot of code to support,
so remove it instead (and we later convert all filesystems to use it).

write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
generic_cont_expand, so filesystems can avoid the old hacks they used.

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/buffer.c |  204 +---
 include/linux/buffer_head.h |5 -
 include/linux/fs.h  |1 
 mm/filemap.c|5 +
 4 files changed, 110 insertions(+), 105 deletions(-)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2027,6 +2027,7 @@ int generic_write_end(struct file *file,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
 {
+   struct inode *inode = mapping-host;
copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 
unlock_page(page);
@@ -2041,6 +2042,8 @@ int generic_write_end(struct file *file,
i_size_write(inode, pos+copied);
mark_inode_dirty(inode);
}
+
+   return copied;
 }
 EXPORT_SYMBOL(generic_write_end);
 
@@ -2142,14 +2145,14 @@ int block_read_full_page(struct page *pa
 }
 
 /* utility function for filesystems that need to do work on expanding
- * truncates.  Uses prepare/commit_write to allow the filesystem to
+ * truncates.  Uses filesystem pagecache writes to allow the filesystem to
  * deal with the hole.  
  */
-static int __generic_cont_expand(struct inode *inode, loff_t size,
-pgoff_t index, unsigned int offset)
+int generic_cont_expand_simple(struct inode *inode, loff_t size)
 {
struct address_space *mapping = inode-i_mapping;
struct page *page;
+   void *fsdata;
unsigned long limit;
int err;
 
@@ -2162,146 +2165,141 @@ static int __generic_cont_expand(struct 
if (size  inode-i_sb-s_maxbytes)
goto out;
 
-   err = -ENOMEM;
-   page = grab_cache_page(mapping, index);
-   if (!page)
-   goto out;
-   err = mapping-a_ops-prepare_write(NULL, page, offset, offset);
-   if (err) {
-   /*
-* -prepare_write() may have instantiated a few blocks
-* outside i_size.  Trim these off again.
-*/
-   unlock_page(page);
-   page_cache_release(page);
-   vmtruncate(inode, inode-i_size);
+   err = pagecache_write_begin(NULL, mapping, size, 0,
+   AOP_FLAG_UNINTERRUPTIBLE|AOP_FLAG_CONT_EXPAND,
+   page, fsdata);
+   if (err)
goto out;
-   }
 
-   err = mapping-a_ops-commit_write(NULL, page, offset, offset);
+   err = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata);
+   BUG_ON(err  0);
 
-   unlock_page(page);
-   page_cache_release(page);
-   if (err  0)
-   err = 0;
 out:
return err;
 }
 
 int generic_cont_expand(struct inode *inode, loff_t size)
 {
-   pgoff_t index;
unsigned int offset;
 
offset = (size  (PAGE_CACHE_SIZE - 1)); /* Within page */
 
/* ugh.  in prepare/commit_write, if from==to==start of block, we
-   ** skip the prepare.  make sure we never send an offset for the start
-   ** of a block
-   */
+* skip the prepare.  make sure we never send an offset for the start
+* of a block.
+* XXX: actually, this should be handled in those filesystems by
+* checking for the AOP_FLAG_CONT_EXPAND flag.
+*/
if ((offset  (inode-i_sb-s_blocksize - 1)) == 0) {
/* caller must handle this extra byte. */
-   offset++;
+   size++;
}
-   index = size  PAGE_CACHE_SHIFT;
-
-   return __generic_cont_expand(inode, size, index, offset);
+   return generic_cont_expand_simple(inode, size);
 }
 
-int generic_cont_expand_simple(struct inode *inode, loff_t size)
+int cont_expand_zero(struct file *file, struct address_space *mapping,
+   loff_t pos, loff_t *bytes)
 {
-   loff_t pos = size - 1;
-   pgoff_t index = pos  PAGE_CACHE_SHIFT;
-   unsigned int offset = (pos  (PAGE_CACHE_SIZE - 1)) + 1;
-
-   /* prepare/commit_write can handle even if from==to==start of block. */
-   return __generic_cont_expand(inode, size, index, offset);
-}
-
-/*
- * For moronic filesystems that do not allow holes in file.
- * We may have to extend the file.
- */
-
-int cont_prepare_write(struct page *page, unsigned offset,
-   unsigned to, get_block_t *get_block, loff_t *bytes)
-{
-   struct

[patch 40/44] ufs convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/ufs/dir.c   |   50 +++---
 fs/ufs/inode.c |   23 +++
 2 files changed, 50 insertions(+), 23 deletions(-)

Index: linux-2.6/fs/ufs/inode.c
===
--- linux-2.6.orig/fs/ufs/inode.c
+++ linux-2.6/fs/ufs/inode.c
@@ -558,24 +558,39 @@ static int ufs_writepage(struct page *pa
 {
return block_write_full_page(page,ufs_getfrag_block,wbc);
 }
+
 static int ufs_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,ufs_getfrag_block);
 }
-static int ufs_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+
+int __ufs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page,from,to,ufs_getfrag_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   ufs_getfrag_block);
 }
+
+static int ufs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return __ufs_write_begin(file, mapping, pos, len, flags, pagep, fsdata);
+}
+
 static sector_t ufs_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,ufs_getfrag_block);
 }
+
 const struct address_space_operations ufs_aops = {
.readpage = ufs_readpage,
.writepage = ufs_writepage,
.sync_page = block_sync_page,
-   .prepare_write = ufs_prepare_write,
-   .commit_write = generic_commit_write,
+   .write_begin = ufs_write_begin,
+   .write_end = generic_write_end,
.bmap = ufs_bmap
 };
 
Index: linux-2.6/fs/ufs/dir.c
===
--- linux-2.6.orig/fs/ufs/dir.c
+++ linux-2.6/fs/ufs/dir.c
@@ -38,12 +38,14 @@ static inline int ufs_match(struct super
return !memcmp(name, de-d_name, len);
 }
 
-static int ufs_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-   struct inode *dir = page-mapping-host;
+   struct address_space *mapping = page-mapping;
+   struct inode *dir = mapping-host;
int err = 0;
+
dir-i_version++;
-   page-mapping-a_ops-commit_write(NULL, page, from, to);
+   block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
@@ -81,16 +83,20 @@ ino_t ufs_inode_by_name(struct inode *di
 void ufs_set_link(struct inode *dir, struct ufs_dir_entry *de,
  struct page *page, struct inode *inode)
 {
-   unsigned from = (char *) de - (char *) page_address(page);
-   unsigned to = from + fs16_to_cpu(dir-i_sb, de-d_reclen);
+   loff_t pos = (page-index  PAGE_CACHE_SHIFT) +
+   (char *) de - (char *) page_address(page);
+   unsigned len = fs16_to_cpu(dir-i_sb, de-d_reclen);
int err;
 
lock_page(page);
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __ufs_write_begin(NULL, page-mapping, pos, len,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
BUG_ON(err);
+
de-d_ino = cpu_to_fs32(dir-i_sb, inode-i_ino);
ufs_set_de_type(dir-i_sb, de, inode-i_mode);
-   err = ufs_commit_chunk(page, from, to);
+
+   err = ufs_commit_chunk(page, pos, len);
ufs_put_page(page);
dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(dir);
@@ -312,7 +318,7 @@ int ufs_add_link(struct dentry *dentry, 
unsigned long npages = ufs_dir_pages(dir);
unsigned long n;
char *kaddr;
-   unsigned from, to;
+   loff_t pos;
int err;
 
UFSD(ENTER, name %s, namelen %u\n, name, namelen);
@@ -367,9 +373,10 @@ int ufs_add_link(struct dentry *dentry, 
return -EINVAL;
 
 got_it:
-   from = (char*)de - (char*)page_address(page);
-   to = from + rec_len;
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   pos = (page-index  PAGE_CACHE_SHIFT) +
+   (char*)de - (char*)page_address(page);
+   err = __ufs_write_begin(NULL, page-mapping, pos, rec_len,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
if (err)
goto out_unlock;
if (de-d_ino) {
@@ -386,7 +393,7 @@ got_it:
de-d_ino = cpu_to_fs32(sb, inode-i_ino);
ufs_set_de_type(sb, de, inode-i_mode);
 
-   err = ufs_commit_chunk(page, from,

[patch 42/44] sysv convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/sysv/dir.c   |   45 +
 fs/sysv/itree.c |   23 +++
 2 files changed, 44 insertions(+), 24 deletions(-)

Index: linux-2.6/fs/sysv/itree.c
===
--- linux-2.6.orig/fs/sysv/itree.c
+++ linux-2.6/fs/sysv/itree.c
@@ -453,23 +453,38 @@ static int sysv_writepage(struct page *p
 {
return block_write_full_page(page,get_block,wbc);
 }
+
 static int sysv_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,get_block);
 }
-static int sysv_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+
+int __sysv_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   return block_prepare_write(page,from,to,get_block);
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   get_block);
 }
+
+static int sysv_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return __sysv_write_begin(file, mapping, pos, len, flags, pagep, 
fsdata);
+}
+
 static sector_t sysv_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,get_block);
 }
+
 const struct address_space_operations sysv_aops = {
.readpage = sysv_readpage,
.writepage = sysv_writepage,
.sync_page = block_sync_page,
-   .prepare_write = sysv_prepare_write,
-   .commit_write = generic_commit_write,
+   .write_begin = sysv_write_begin,
+   .write_end = generic_write_end,
.bmap = sysv_bmap
 };
Index: linux-2.6/fs/sysv/dir.c
===
--- linux-2.6.orig/fs/sysv/dir.c
+++ linux-2.6/fs/sysv/dir.c
@@ -37,12 +37,13 @@ static inline unsigned long dir_pages(st
return (inode-i_size+PAGE_CACHE_SIZE-1)PAGE_CACHE_SHIFT;
 }
 
-static int dir_commit_chunk(struct page *page, unsigned from, unsigned to)
+static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len)
 {
-   struct inode *dir = (struct inode *)page-mapping-host;
+   struct address_space *mapping = page-mapping;
+   struct inode *dir = mapping-host;
int err = 0;
 
-   page-mapping-a_ops-commit_write(NULL, page, from, to);
+   block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
@@ -186,7 +187,7 @@ int sysv_add_link(struct dentry *dentry,
unsigned long npages = dir_pages(dir);
unsigned long n;
char *kaddr;
-   unsigned from, to;
+   loff_t pos;
int err;
 
/* We take care of directory expansion in the same loop */
@@ -212,16 +213,17 @@ int sysv_add_link(struct dentry *dentry,
return -EINVAL;
 
 got_it:
-   from = (char*)de - (char*)page_address(page);
-   to = from + SYSV_DIRSIZE;
+   pos = (page-index  PAGE_CACHE_SHIFT) +
+   (char*)de - (char*)page_address(page);
lock_page(page);
-   err = page-mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __sysv_write_begin(NULL, page-mapping, pos, SYSV_DIRSIZE,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
if (err)
goto out_unlock;
memcpy (de-name, name, namelen);
memset (de-name + namelen, 0, SYSV_DIRSIZE - namelen - 2);
de-inode = cpu_to_fs16(SYSV_SB(inode-i_sb), inode-i_ino);
-   err = dir_commit_chunk(page, from, to);
+   err = dir_commit_chunk(page, pos, SYSV_DIRSIZE);
dir-i_mtime = dir-i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(dir);
 out_page:
@@ -238,15 +240,15 @@ int sysv_delete_entry(struct sysv_dir_en
struct address_space *mapping = page-mapping;
struct inode *inode = (struct inode*)mapping-host;
char *kaddr = (char*)page_address(page);
-   unsigned from = (char*)de - kaddr;
-   unsigned to = from + SYSV_DIRSIZE;
+   loff_t pos = (page-index  PAGE_CACHE_SHIFT) + (char *)de - kaddr;
int err;
 
lock_page(page);
-   err = mapping-a_ops-prepare_write(NULL, page, from, to);
+   err = __sysv_write_begin(NULL, mapping, pos, SYSV_DIRSIZE,
+   AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
BUG_ON(err);
de-inode = 0;
-   err = dir_commit_chunk(page, from, to);
+   err = dir_commit_chunk(page, pos, SYSV_DIRSIZE);
dir_put_page(page);
inode-i_ctime = inode-i_mtime = CURRENT_TIME_SEC;

[patch 34/44] fs: no AOP_TRUNCATED_PAGE for writes

2007-04-23 Thread Nick Piggin

prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and GFS2
were converted to the new aops, so we can make some simplifications for that.

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 Documentation/filesystems/vfs.txt |6 -
 fs/ecryptfs/mmap.c|   39 +-
 include/linux/fs.h|2 -
 mm/filemap.c  |   21 +++-
 4 files changed, 20 insertions(+), 48 deletions(-)

Index: linux-2.6/Documentation/filesystems/vfs.txt
===
--- linux-2.6.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6/Documentation/filesystems/vfs.txt
@@ -619,11 +619,7 @@ struct address_space_operations {
any basic-blocks on storage, then those blocks should be
pre-read (if they haven't been read already) so that the
updated blocks can be written out properly.
-   The page will be locked.  If prepare_write wants to unlock the
-   page it, like readpage, may do so and return
-   AOP_TRUNCATED_PAGE.
-   In this case the prepare_write will be retried one the lock is
-   regained.
+   The page will be locked.
 
Note: the page _must not_ be marked uptodate in this function
(or anywhere else) unless it actually is uptodate right now. As
Index: linux-2.6/fs/ecryptfs/mmap.c
===
--- linux-2.6.orig/fs/ecryptfs/mmap.c
+++ linux-2.6/fs/ecryptfs/mmap.c
@@ -412,11 +412,9 @@ out:
return rc;
 }
 
-static
-void ecryptfs_release_lower_page(struct page *lower_page, int page_locked)
+static void ecryptfs_release_lower_page(struct page *lower_page)
 {
-   if (page_locked)
-   unlock_page(lower_page);
+   unlock_page(lower_page);
page_cache_release(lower_page);
 }
 
@@ -437,7 +435,6 @@ static int ecryptfs_write_inode_size_to_
const struct address_space_operations *lower_a_ops;
u64 file_size;
 
-retry:
header_page = grab_cache_page(lower_inode-i_mapping, 0);
if (!header_page) {
ecryptfs_printk(KERN_ERR, grab_cache_page for 
@@ -448,11 +445,7 @@ retry:
lower_a_ops = lower_inode-i_mapping-a_ops;
rc = lower_a_ops-prepare_write(lower_file, header_page, 0, 8);
if (rc) {
-   if (rc == AOP_TRUNCATED_PAGE) {
-   ecryptfs_release_lower_page(header_page, 0);
-   goto retry;
-   } else
-   ecryptfs_release_lower_page(header_page, 1);
+   ecryptfs_release_lower_page(header_page);
goto out;
}
file_size = (u64)i_size_read(inode);
@@ -466,11 +459,7 @@ retry:
if (rc  0)
ecryptfs_printk(KERN_ERR, Error commiting header page 
write\n);
-   if (rc == AOP_TRUNCATED_PAGE) {
-   ecryptfs_release_lower_page(header_page, 0);
-   goto retry;
-   } else
-   ecryptfs_release_lower_page(header_page, 1);
+   ecryptfs_release_lower_page(header_page);
lower_inode-i_mtime = lower_inode-i_ctime = CURRENT_TIME;
mark_inode_dirty_sync(inode);
 out:
@@ -573,16 +562,11 @@ retry:
  byte_offset,
  region_bytes);
if (rc) {
-   if (rc == AOP_TRUNCATED_PAGE) {
-   ecryptfs_release_lower_page(*lower_page, 0);
-   goto retry;
-   } else {
-   ecryptfs_printk(KERN_ERR, prepare_write for 
-   lower_page_index = [0x%.16x] failed; rc = 
-   [%d]\n, lower_page_index, rc);
-   ecryptfs_release_lower_page(*lower_page, 1);
-   (*lower_page) = NULL;
-   }
+   ecryptfs_printk(KERN_ERR, prepare_write for 
+   lower_page_index = [0x%.16x] failed; rc = 
+   [%d]\n, lower_page_index, rc);
+   ecryptfs_release_lower_page(*lower_page);
+   (*lower_page) = NULL;
}
 out:
return rc;
@@ -598,19 +582,16 @@ ecryptfs_commit_lower_page(struct page *
   struct file *lower_file, int byte_offset,
   int region_size)
 {
-   int page_locked = 1;
int rc = 0;
 
rc = lower_inode-i_mapping-a_ops-commit_write(
lower_file, lower_page, byte_offset, region_size);
-   if (rc == AOP_TRUNCATED_PAGE)
-   page_locked = 0;
if (rc  0) {
ecryptfs_printk(KERN_ERR,
Error committing write; rc = [%d]\n, rc);
} else
rc = 0;
-   ecryptfs_release_lower_page(lower_page,

[patch 33/44] gfs2 convert to new aops

2007-04-23 Thread Nick Piggin

From: Steven Whitehouse [EMAIL PROTECTED]

(needs a SOB)

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org

 fs/gfs2/ops_address.c |  209 +-
 1 file changed, 125 insertions(+), 84 deletions(-)

Index: linux-2.6/fs/gfs2/ops_address.c
===
--- linux-2.6.orig/fs/gfs2/ops_address.c
+++ linux-2.6/fs/gfs2/ops_address.c
@@ -17,6 +17,7 @@
 #include linux/mpage.h
 #include linux/fs.h
 #include linux/writeback.h
+#include linux/swap.h
 #include linux/gfs2_ondisk.h
 #include linux/lm_interface.h
 
@@ -337,45 +338,49 @@ out_unlock:
 }
 
 /**
- * gfs2_prepare_write - Prepare to write a page to a file
+ * gfs2_write_begin - Begin to write to a file
  * @file: The file to write to
- * @page: The page which is to be prepared for writing
- * @from: From (byte range within page)
- * @to: To (byte range within page)
+ * @mapping: The mapping in which to write
+ * @pos: The file offset at which to start writing
+ * @len: Length of the write
+ * @flags: Various flags
+ * @pagep: Pointer to return the page
+ * @fsdata: Pointer to return fs data (unused by GFS2)
  *
  * Returns: errno
  */
 
-static int gfs2_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int gfs2_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   struct gfs2_inode *ip = GFS2_I(page-mapping-host);
-   struct gfs2_sbd *sdp = GFS2_SB(page-mapping-host);
+   struct gfs2_inode *ip = GFS2_I(mapping-host);
+   struct gfs2_sbd *sdp = GFS2_SB(mapping-host);
unsigned int data_blocks, ind_blocks, rblocks;
int alloc_required;
int error = 0;
-   loff_t pos = ((loff_t)page-index  PAGE_CACHE_SHIFT) + from;
-   loff_t end = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
struct gfs2_alloc *al;
-   unsigned int write_len = to - from;
-
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   unsigned from = pos  (PAGE_CACHE_SIZE - 1);
+   unsigned to = from + len;
+   struct page *page;
 
-   gfs2_holder_init(ip-i_gl, LM_ST_EXCLUSIVE, GL_ATIME|LM_FLAG_TRY_1CB, 
ip-i_gh);
+   gfs2_holder_init(ip-i_gl, LM_ST_EXCLUSIVE, GL_ATIME, ip-i_gh);
error = gfs2_glock_nq_atime(ip-i_gh);
-   if (unlikely(error)) {
-   if (error == GLR_TRYFAILED) {
-   unlock_page(page);
-   error = AOP_TRUNCATED_PAGE;
-   yield();
-   }
+   if (unlikely(error))
goto out_uninit;
-   }
 
-   gfs2_write_calc_reserv(ip, write_len, data_blocks, ind_blocks);
+   error = -ENOMEM;
+   page = __grab_cache_page(mapping, index);
+   *pagep = page;
+   if (!page)
+   goto out_unlock;
+
+   gfs2_write_calc_reserv(ip, len, data_blocks, ind_blocks);
 
-   error = gfs2_write_alloc_required(ip, pos, write_len, alloc_required);
+   error = gfs2_write_alloc_required(ip, pos, len, alloc_required);
if (error)
-   goto out_unlock;
+   goto out_putpage;
 
 
ip-i_alloc.al_requested = 0;
@@ -407,7 +412,7 @@ static int gfs2_prepare_write(struct fil
goto out;
 
if (gfs2_is_stuffed(ip)) {
-   if (end  sdp-sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) {
+   if (pos + len  sdp-sd_sb.sb_bsize - sizeof(struct 
gfs2_dinode)) {
error = gfs2_unstuff_dinode(ip, page);
if (error == 0)
goto prepare_write;
@@ -429,6 +434,10 @@ out_qunlock:
 out_alloc_put:
gfs2_alloc_put(ip);
}
+out_putpage:
+   page_cache_release(page);
+   if (pos + len  ip-i_inode.i_size)
+   vmtruncate(ip-i_inode, ip-i_inode.i_size);
 out_unlock:
gfs2_glock_dq_m(1, ip-i_gh);
 out_uninit:
@@ -439,96 +448,128 @@ out_uninit:
 }
 
 /**
- * gfs2_commit_write - Commit write to a file
+ * gfs2_stuffed_write_end - Write end for stuffed files
+ * @inode: The inode
+ * @dibh: The buffer_head containing the on-disk inode
+ * @pos: The file position
+ * @len: The length of the write
+ * @copied: How much was actually copied by the VFS
+ * @page: The page
+ *
+ * This copies the data from the page into the inode block after
+ * the inode data structure itself.
+ *
+ * Returns: errno
+ */
+static int gfs2_stuffed_write_end(struct inode *inode, struct buffer_head 
*dibh,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page)
+{
+   struct gfs2_inode *ip = GFS2_I(inode);
+   struct gfs2_sbd *sdp = GFS2_SB(inode);
+   u64 to = pos + copied;
+   void *kaddr;
+   unsigned char

[patch 36/44] fuse convert to new aops

2007-04-23 Thread Nick Piggin

[mszeredi]
 - don't send zero length write requests
 - it is not legal for the filesystem to return with zero written bytes

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

 fs/fuse/file.c |   48 +---
 1 file changed, 33 insertions(+), 15 deletions(-)

Index: linux-2.6/fs/fuse/file.c
===
--- linux-2.6.orig/fs/fuse/file.c
+++ linux-2.6/fs/fuse/file.c
@@ -443,22 +443,25 @@ static size_t fuse_send_write(struct fus
return outarg.size;
 }
 
-static int fuse_prepare_write(struct file *file, struct page *page,
- unsigned offset, unsigned to)
-{
-   /* No op */
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+
+   *pagep = __grab_cache_page(mapping, index);
+   if (!*pagep)
+   return -ENOMEM;
return 0;
 }
 
-static int fuse_commit_write(struct file *file, struct page *page,
-unsigned offset, unsigned to)
+static int fuse_buffered_write(struct file *file, struct inode *inode,
+  loff_t pos, unsigned count, struct page *page)
 {
int err;
size_t nres;
-   unsigned count = to - offset;
-   struct inode *inode = page-mapping-host;
struct fuse_conn *fc = get_fuse_conn(inode);
-   loff_t pos = page_offset(page) + offset;
+   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
struct fuse_req *req;
 
if (is_bad_inode(inode))
@@ -474,20 +477,35 @@ static int fuse_commit_write(struct file
nres = fuse_send_write(req, file, inode, pos, count);
err = req-out.h.error;
fuse_put_request(fc, req);
-   if (!err  nres != count)
+   if (!err  !nres)
err = -EIO;
if (!err) {
-   pos += count;
+   pos += nres;
spin_lock(fc-lock);
if (pos  inode-i_size)
i_size_write(inode, pos);
spin_unlock(fc-lock);
 
-   if (offset == 0  to == PAGE_CACHE_SIZE)
+   if (count == PAGE_CACHE_SIZE)
SetPageUptodate(page);
}
fuse_invalidate_attr(inode);
-   return err;
+   return err ? err : nres;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
+{
+   struct inode *inode = mapping-host;
+   int res = 0;
+
+   if (copied)
+   res = fuse_buffered_write(file, inode, pos, copied, page);
+
+   unlock_page(page);
+   page_cache_release(page);
+   return res;
 }
 
 static void fuse_release_user_pages(struct fuse_req *req, int write)
@@ -816,8 +834,8 @@ static const struct file_operations fuse
 
 static const struct address_space_operations fuse_file_aops  = {
.readpage   = fuse_readpage,
-   .prepare_write  = fuse_prepare_write,
-   .commit_write   = fuse_commit_write,
+   .write_begin= fuse_write_begin,
+   .write_end  = fuse_write_end,
.readpages  = fuse_readpages,
.set_page_dirty = fuse_set_page_dirty,
.bmap   = fuse_bmap,

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 31/44] smb convert to new aops

2007-04-23 Thread Nick Piggin

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/smbfs/file.c |   34 +-
 1 file changed, 25 insertions(+), 9 deletions(-)

Index: linux-2.6/fs/smbfs/file.c
===
--- linux-2.6.orig/fs/smbfs/file.c
+++ linux-2.6/fs/smbfs/file.c
@@ -290,29 +290,45 @@ out:
  * If the writer ends up delaying the write, the writer needs to
  * increment the page use counts until he is done with the page.
  */
-static int smb_prepare_write(struct file *file, struct page *page, 
-unsigned offset, unsigned to)
-{
+static int smb_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   *pagep = __grab_cache_page(mapping, index);
+   if (!*pagep)
+   return -ENOMEM;
return 0;
 }
 
-static int smb_commit_write(struct file *file, struct page *page,
-   unsigned offset, unsigned to)
+static int smb_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
int status;
+   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
 
-   status = -EFAULT;
lock_kernel();
-   status = smb_updatepage(file, page, offset, to-offset);
+   status = smb_updatepage(file, page, offset, copied);
unlock_kernel();
+
+   if (!status) {
+   if (!PageUptodate(page)  copied == PAGE_CACHE_SIZE)
+   SetPageUptodate(page);
+   status = copied;
+   }
+
+   unlock_page(page);
+   page_cache_release(page);
+
return status;
 }
 
 const struct address_space_operations smb_file_aops = {
.readpage = smb_readpage,
.writepage = smb_writepage,
-   .prepare_write = smb_prepare_write,
-   .commit_write = smb_commit_write
+   .write_begin = smb_write_begin,
+   .write_end = smb_write_end,
 };
 
 /* 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 27/44] hpfs convert to new aops

2007-04-23 Thread Nick Piggin

Cc: [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/hpfs/file.c |   20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

Index: linux-2.6/fs/hpfs/file.c
===
--- linux-2.6.orig/fs/hpfs/file.c
+++ linux-2.6/fs/hpfs/file.c
@@ -86,25 +86,33 @@ static int hpfs_writepage(struct page *p
 {
return block_write_full_page(page,hpfs_get_block, wbc);
 }
+
 static int hpfs_readpage(struct file *file, struct page *page)
 {
return block_read_full_page(page,hpfs_get_block);
 }
-static int hpfs_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
-{
-   return cont_prepare_write(page,from,to,hpfs_get_block,
-   hpfs_i(page-mapping-host)-mmu_private);
+
+static int hpfs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return cont_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   hpfs_get_block,
+   hpfs_i(mapping-host)-mmu_private);
 }
+
 static sector_t _hpfs_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,hpfs_get_block);
 }
+
 const struct address_space_operations hpfs_aops = {
.readpage = hpfs_readpage,
.writepage = hpfs_writepage,
.sync_page = block_sync_page,
-   .prepare_write = hpfs_prepare_write,
-   .commit_write = generic_commit_write,
+   .write_begin = hpfs_write_begin,
+   .write_end = generic_write_end,
.bmap = _hpfs_bmap
 };
 

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 11/44] fs: fix data-loss on error

2007-04-23 Thread Nick Piggin


New buffers against uptodate pages are simply be marked uptodate, while the
buffer_new bit remains set. This causes error-case code to zero out parts
of those buffers because it thinks they contain stale data: wrong, they
are actually uptodate so this is a data loss situation.

Fix this by actually clearning buffer_new and marking the buffer dirty. It
makes sense to always clear buffer_new before setting a buffer uptodate.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/buffer.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1800,7 +1800,9 @@ static int __block_prepare_write(struct 
unmap_underlying_metadata(bh-b_bdev,
bh-b_blocknr);
if (PageUptodate(page)) {
+   clear_buffer_new(bh);
set_buffer_uptodate(bh);
+   mark_buffer_dirty(bh);
continue;
}
if (block_end  to || block_start  from) {

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 09/44] mm: fix pagecache write deadlocks

2007-04-23 Thread Nick Piggin


Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.prepare_write
4. unlock_page+vmtruncate
5. copy_from_user
6.  mmap_sem(r)
7.   handle_mm_fault
8.lock_page (filemap_nopage)
9.commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.get_user_pages
e. handle_mm_fault
f.  lock_page (filemap_nopage)

2,8 - recursive deadlock if page is same
2,8;2,8 - ABBA deadlock is page is different
2,6;b,f - ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
atomic usercopies which do not take pagefaults and do not zero the uncopied
tail of the destination. The destination is already uptodate, so we can
commit_write the full length even if there was a partial copy: it does not
matter that the tail was not modified, because if it is dirtied and written
back to disk it will not cause any problems (uptodate *means* that the
destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
information, because atomic usercopies cannot distinguish between
non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
made slightly more optimal), then allocate a temporary page to copy the
source data into. Relock the destination page and continue with the copy.
However, instead of a usercopy (which might take a fault), copy the data
from the pinned temporary page via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

This increases the CPU/memory copy cost by almost 50% on the affected
workloads. That will be solved by introducing a new set of pagecache write
aops in a subsequent patch.

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 include/linux/pagemap.h |   11 +++-
 mm/filemap.c|  114 
 2 files changed, 104 insertions(+), 21 deletions(-)

Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1933,11 +1933,12 @@ generic_file_buffered_write(struct kiocb
filemap_set_next_iovec(cur_iov, nr_segs, iov_offset, written);
 
do {
+   struct page *src_page;
struct page *page;
pgoff_t index;  /* Pagecache index for current page */
unsigned long offset;   /* Offset into pagecache page */
-   unsigned long maxlen;   /* Bytes remaining in current iovec */
-   size_t bytes;   /* Bytes to write to page */
+   unsigned long seglen;   /* Bytes remaining in current iovec */
+   unsigned long bytes;/* Bytes to write to page */
size_t copied;  /* Bytes copied from user */
 
buf = cur_iov-iov_base + iov_offset;
@@ -1947,20 +1948,30 @@ generic_file_buffered_write(struct kiocb
if (bytes  count)
bytes = count;
 
-   maxlen = cur_iov-iov_len - iov_offset;
-   if (maxlen  bytes)
-   maxlen = bytes;
+   /*
+* a non-NULL src_page indicates that we're doing the
+* copy via get_user_pages and kmap.
+*/
+   src_page = NULL;
+
+   seglen = cur_iov-iov_len - iov_offset;
+   if (seglen  bytes)
+   seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
/*
 * Bring in the user page that we will copy from _first_.
 * Otherwise there's a nasty deadlock on copying from the
 * same page as we're writing to, without it being marked
 * up-to-date.
+*
+* Not only is this an optimisation, but it is also required
+* to check that the address is actually valid, when atomic
+* usercopies are used, below.
 */
-   fault_in_pages_readable(buf, maxlen);
-#endif
-
+   if (unlikely(fault_in_pages_readable(buf, seglen))) {
+   status = -EFAULT;
+   break;
+   }
 
page = __grab_cache_page(mapping, index);
if (!page) {
@@ -1968,32 +1979,104 @@ generic_file_buffered_write(struct kiocb
break;
}
 
+   /*
+* non-uptodate pages

[patch 14/44] implement simple fs aops

2007-04-23 Thread Nick Piggin

Implement new aops for some of the simpler filesystems.

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/configfs/inode.c   |4 ++--
 fs/hugetlbfs/inode.c  |   16 ++--
 fs/ramfs/file-mmu.c   |4 ++--
 fs/ramfs/file-nommu.c |4 ++--
 fs/sysfs/inode.c  |4 ++--
 mm/shmem.c|   35 ---
 6 files changed, 46 insertions(+), 21 deletions(-)

Index: linux-2.6/mm/shmem.c
===
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1109,7 +1109,7 @@ static int shmem_getpage(struct inode *i
 * Normally, filepage is NULL on entry, and either found
 * uptodate immediately, or allocated and zeroed, or read
 * in under swappage, which is then assigned to filepage.
-* But shmem_prepare_write passes in a locked filepage,
+* But shmem_write_begin passes in a locked filepage,
 * which may be found not uptodate by other callers too,
 * and may need to be copied from the swappage read in.
 */
@@ -1454,14 +1454,35 @@ static const struct inode_operations shm
 static const struct inode_operations shmem_symlink_inline_operations;
 
 /*
- * Normally tmpfs makes no use of shmem_prepare_write, but it
+ * Normally tmpfs makes no use of shmem_write_begin, but it
  * lets a tmpfs file be used read-write below the loop driver.
  */
 static int
-shmem_prepare_write(struct file *file, struct page *page, unsigned offset, 
unsigned to)
+shmem_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   struct inode *inode = mapping-host;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   *pagep = NULL;
+   return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+}
+
+static int
+shmem_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   struct inode *inode = page-mapping-host;
-   return shmem_getpage(inode, page-index, page, SGP_WRITE, NULL);
+   struct inode *inode = mapping-host;
+
+   set_page_dirty(page);
+   mark_page_accessed(page);
+   page_cache_release(page);
+
+   if (pos+copied  inode-i_size)
+   i_size_write(inode, pos+copied);
+
+   return copied;
 }
 
 static ssize_t
@@ -2358,8 +2379,8 @@ static const struct address_space_operat
.writepage  = shmem_writepage,
.set_page_dirty = __set_page_dirty_no_writeback,
 #ifdef CONFIG_TMPFS
-   .prepare_write  = shmem_prepare_write,
-   .commit_write   = simple_commit_write,
+   .write_begin= shmem_write_begin,
+   .write_end  = shmem_write_end,
 #endif
.migratepage= migrate_page,
 };
Index: linux-2.6/fs/configfs/inode.c
===
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -40,8 +40,8 @@ extern struct super_block * configfs_sb;
 
 static const struct address_space_operations configfs_aops = {
.readpage   = simple_readpage,
-   .prepare_write  = simple_prepare_write,
-   .commit_write   = simple_commit_write
+   .write_begin= simple_write_begin,
+   .write_end  = simple_write_end,
 };
 
 static struct backing_dev_info configfs_backing_dev_info = {
Index: linux-2.6/fs/sysfs/inode.c
===
--- linux-2.6.orig/fs/sysfs/inode.c
+++ linux-2.6/fs/sysfs/inode.c
@@ -20,8 +20,8 @@ extern struct super_block * sysfs_sb;
 
 static const struct address_space_operations sysfs_aops = {
.readpage   = simple_readpage,
-   .prepare_write  = simple_prepare_write,
-   .commit_write   = simple_commit_write
+   .write_begin= simple_write_begin,
+   .write_end  = simple_write_end,
 };
 
 static struct backing_dev_info sysfs_backing_dev_info = {
Index: linux-2.6/fs/ramfs/file-mmu.c
===
--- linux-2.6.orig/fs/ramfs/file-mmu.c
+++ linux-2.6/fs/ramfs/file-mmu.c
@@ -29,8 +29,8 @@
 
 const struct address_space_operations ramfs_aops = {
.readpage   = simple_readpage,
-   .prepare_write  = simple_prepare_write,
-   .commit_write   = simple_commit_write,
+   .write_begin= simple_write_begin,
+   .write_end  = simple_write_end,
.set_page_dirty = __set_page_dirty_no_writeback,
 };
 
Index: linux-2.6/fs/ramfs/file-nommu.c
===
--- linux-2.6.orig/fs/ramfs/file-nommu.c
+++ linux-2.6/fs/ramfs/file-nommu.c
@@ -29,8 +29,8 @@ static int ramfs_nommu_setattr(struct de
 
 const struct address_space_operations

[patch 12/44] fs: introduce write_begin, write_end, and perform_write aops

2007-04-23 Thread Nick Piggin

These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

Cc: Linux Memory Management [EMAIL PROTECTED]
Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

API design contributions, code review and fixes. 

Signed-off-by: Mark Fasheh [EMAIL PROTECTED]

 Documentation/filesystems/Locking |9 -
 Documentation/filesystems/vfs.txt |   48 +++
 drivers/block/loop.c  |   77 
 fs/buffer.c   |  203 +++--
 fs/libfs.c|   44 +++
 fs/namei.c|   47 +--
 fs/splice.c   |   70 +--
 include/linux/buffer_head.h   |   10 +
 include/linux/fs.h|   28 
 include/linux/pagemap.h   |2 
 mm/filemap.c  |  233 ++
 11 files changed, 561 insertions(+), 210 deletions(-)

Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -391,6 +391,8 @@ enum positive_aop_returns {
AOP_TRUNCATED_PAGE  = 0x80001,
 };
 
+#define AOP_FLAG_UNINTERRUPTIBLE   0x0001 /* will not do a short write */
+
 /*
  * oh the beauties of C type declarations.
  */
@@ -451,6 +453,14 @@ struct address_space_operations {
 */
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+
+   int (*write_begin)(struct file *, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata);
+   int (*write_end)(struct file *, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata);
+
/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidatepage) (struct page *, unsigned long);
@@ -465,6 +475,18 @@ struct address_space_operations {
int (*launder_page) (struct page *);
 };
 
+/*
+ * pagecache_write_begin/pagecache_write_end must be used by general code
+ * to write into the pagecache.
+ */
+int pagecache_write_begin(struct file *, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata);
+
+int pagecache_write_end(struct file *, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata);
+
 struct backing_dev_info;
 struct address_space {
struct inode*host;  /* owner: inode, block_device */
@@ -1969,6 +1991,12 @@ extern int simple_prepare_write(struct f
unsigned offset, unsigned to);
 extern int simple_commit_write(struct file *file, struct page *page,
unsigned offset, unsigned to);
+extern int simple_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata);
+extern int simple_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata);
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct 
nameidata *);
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t 
*);
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1950,6 +1950,93 @@ inline int generic_write_checks(struct f
 }
 EXPORT_SYMBOL(generic_write_checks);
 
+int pagecache_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   const struct address_space_operations *aops = mapping-a_ops;
+
+   if (aops-write_begin) {
+   return aops-write_begin(file, mapping, pos, len, flags,
+   pagep, fsdata);
+   } else {
+   int ret;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
+   struct inode *inode = mapping-host;
+   struct page *page;
+again:
+   page

[patch 15/44] block_dev convert to new aops

2007-04-23 Thread Nick Piggin

Cc: Linux Filesystems linux-fsdevel@vger.kernel.org
Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 fs/block_dev.c |   26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

Index: linux-2.6/fs/block_dev.c
===
--- linux-2.6.orig/fs/block_dev.c
+++ linux-2.6/fs/block_dev.c
@@ -378,14 +378,26 @@ static int blkdev_readpage(struct file *
return block_read_full_page(page, blkdev_get_block);
 }
 
-static int blkdev_prepare_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
-{
-   return block_prepare_write(page, from, to, blkdev_get_block);
+static int blkdev_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+   blkdev_get_block);
 }
 
-static int blkdev_commit_write(struct file *file, struct page *page, unsigned 
from, unsigned to)
+static int blkdev_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   return block_commit_write(page, from, to);
+   int ret;
+   ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+
+   unlock_page(page);
+   page_cache_release(page);
+
+   return ret;
 }
 
 /*
@@ -1333,8 +1345,8 @@ const struct address_space_operations de
.readpage   = blkdev_readpage,
.writepage  = blkdev_writepage,
.sync_page  = block_sync_page,
-   .prepare_write  = blkdev_prepare_write,
-   .commit_write   = blkdev_commit_write,
+   .write_begin= blkdev_write_begin,
+   .write_end  = blkdev_write_end,
.writepages = generic_writepages,
.direct_IO  = blkdev_direct_IO,
 };

-- 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

58 matches

Mail list logo