Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-12 Thread Suparna Bhattacharya
On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote:
> On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
> > Well, if you see the modes proposed using above flags :
> > 
> > #define FA_ALLOCATE 0
> > #define FA_DEALLOCATE   FA_FL_DEALLOC
> > #define FA_RESV_SPACE   FA_FL_KEEP_SIZE
> > #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
> > FA_FL_DEL_DATA)
> > 
> > FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
> > for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
> > flag. Hence prealloction will never delete data.
> > This mode is required only for FA_UNRESV_SPACE, which is a deallocation
> > mode, to support any existing XFS aware applications/usage-scenarios.
> 
> Sorry, but this doesn't make any sense.  There is no need to put every
> feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
> be supported forever anyway - as I suggested before they really should
> be moved to generic code.
> 
> What needs to be supported is what makes sense as an interface.
> A punch a hole interface does make sense, but trying to hack this into
> a preallocation system call is just madness.  We're not IRIX or windows
> that fit things into random subcall just because there was some space
> left to squeeze them in.
> 
> > > > > FA_FL_NO_MTIME0x10 /* keep same mtime (default change on 
> > > > > size, data change) */
> > > > > FA_FL_NO_CTIME0x20 /* keep same ctime (default change on 
> > > > > size, data change) */
> > > 
> > > NACK to these aswell.  If i_size changes c/mtime need updates, if the size
> > > doesn't chamge they don't.  No need to add more flags for this.
> > 
> > This requirement was from the point of view of HSM applications. Hope
> > you saw Andreas previous post and are keeping that in mind.
> 
> HSMs needs this basically for every system call, which screams for an
> open flag like O_INVISIBLE anyway.  Adding this in a generic way is
> a good idea, but hacking bits and pieces that won't fit into the global
> design is completely wrong.


Why don't we just merge the interface for preallocation (essentially
enough to satisfy posix_fallocate() and the simple XFS requirement for 
space reservation without changing file size), which there is clear agreement
on (I hope :)).  After all, this was all that we set out to do when we
started.

And leave all the dealloc/punch/hsm type features for separate future patches/
debates, those really shouldn't hold up the basic fallocate interface.
I agree with Christoph that we are just diverging too much in trying to
club those decisions here.

Dave, Andreas, Ted ?

Regards
Suparna

> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] dio: remove bogus refcounting BUG_ON

2007-07-04 Thread Suparna Bhattacharya
On Wed, Jul 04, 2007 at 07:25:10PM -0700, Badari Pulavarty wrote:
> On Tue, 2007-07-03 at 15:28 -0700, Zach Brown wrote:
> > Linus, Andrew, please apply the bug fix patch at the end of this reply
> > for .22.
> > 
> > > >>One of our perf. team ran into this while doing some runs.
> > > >>I didn't see anything obvious - it looks like we converted
> > > >>async IO to synchronous one. I didn't spend much time digging
> > > >>around.
> > 
> > OK, I think this BUG_ON() is just broken.  I wasn't able to find any
> > obvious bugs from reading the code which would cause the BUG_ON() to
> > fire.  If it's reproducible I'd love to hear what the recipe is.
> > 
> > I did notice that this BUG_ON() is evaluating dio after having dropped
> > it's ref :/.  So it's not completely absurd to fear that it's a race
> > with the dio's memory being reused, but that'd be a pretty tight race.
> > 
> > Let's remove this stupid BUG_ON and see if that test box still has
> > trouble.  It might just hit the valid BUG_ON a few lines down, but this
> > unsafe BUG_ON needs to go.
> 
> I went through the code multiple times, I can't find how we can trigger
> the BUG_ON(). But unfortunately, our perf. team is able reproduce the
> problem. Debug indicated that, the ret2 == 1 :(
> 
> Not sure how that can happen. Ideas ?

Does it trigger even if you avoid referencing dio in the BUG_ON(), i.e.
with something like ...


--- direct-io.c 2007-07-02 01:24:24.0 +0530
+++ direct-io-debug.c   2007-07-05 09:18:56.0 +0530
@@ -1104,9 +1104,10 @@ direct_io_worker(int rw, struct kiocb *i
 * decide to wake the submission path atomically.
 */
spin_lock_irqsave(&dio->bio_lock, flags);
+   is_async = dio->is_async;
ret2 = --dio->refcount;
spin_unlock_irqrestore(&dio->bio_lock, flags);
-   BUG_ON(!dio->is_async && ret2 != 0);
+   BUG_ON(!is_async && ret2 != 0);
if (ret2 == 0) {
ret = dio_complete(dio, offset, ret);
kfree(dio);

> 
> Thanks,
> Badari
> 
> > 
> > ---
> > 
> > dio: remove bogus refcounting BUG_ON
> > 
> > Badari Pulavarty reported a case of this BUG_ON is triggering during
> > testing.  It's completely bogus and should be removed.
> > 
> > It's trying to notice if we left references to the dio hanging around in
> > the sync case.  They should have been dropped as IO completed while this
> > path was in dio_await_completion().  This condition will also be
> > checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few
> > lines lower.  So to start this BUG_ON() is redundant.
> > 
> > More fatally, it's dereferencing dio-> after having dropped its
> > reference.  It's only safe to dereference the dio after releasing the
> > lock if the final reference was just dropped.  Another CPU might free
> > the dio in bio completion and reuse the memory after this path drops the
> > dio lock but before the BUG_ON() is evaluated.
> > 
> > This patch passed aio+dio regression unit tests and aio-stress on ext3.
> > 
> > Signed-off-by: Zach Brown <[EMAIL PROTECTED]>
> > Cc: Badari Pulavarty <[EMAIL PROTECTED]>
> > 
> > diff -r 509ce354ae1b fs/direct-io.c
> > --- a/fs/direct-io.cSun Jul 01 22:00:49 2007 +
> > +++ b/fs/direct-io.cTue Jul 03 14:56:41 2007 -0700
> > @@ -1106,7 +1106,7 @@ direct_io_worker(int rw, struct kiocb *i
> > spin_lock_irqsave(&dio->bio_lock, flags);
> > ret2 = --dio->refcount;
> > spin_unlock_irqrestore(&dio->bio_lock, flags);
> > -   BUG_ON(!dio->is_async && ret2 != 0);
> > +
> > if (ret2 == 0) {
> > ret = dio_complete(dio, offset, ret);
> > kfree(dio);
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] file capabilities: Introduction

2007-05-16 Thread Suparna Bhattacharya
On Mon, May 14, 2007 at 08:00:11PM +, Pavel Machek wrote:
> Hi!
> 
> > "Serge E. Hallyn" <[EMAIL PROTECTED]> wrote:
> > 
> > > Following are two patches which have been sitting for some time in -mm.
> > 
> > Where "some time" == "nearly six months".
> > 
> > We need help considering, reviewing and testing this code, please.
> 
> I did quick scan, and it looks ok. Plus, it means we can finally start
> using that old capabilities subsystem... so I think we should do it.

FWIW, I looked through it recently as well, and it looked reasonable enough
to me, though I'm not a security expert. I did have a question about
testing corner cases etc, which Serge has tried to address.

Serge, are you planning to post an update without STRICTXATTR ? That should
simplify the second patch.

Regards
Suparna

> 
>   Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-11 Thread Suparna Bhattacharya
e proposed, so
that we don't have to make all the decisions right now. 
So it would be good to start with a *minimal* definition, even just one mode.
The rest could follow as subsequent patches, each being reviewed and debated
separately. Otherwise this discussion can drag on for a long time.

Regards
Suparna

> 
> Cheers,
> 
> Dave,
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] file capabilities: accomodate >32 bit capabilities

2007-05-10 Thread Suparna Bhattacharya
On Thu, May 10, 2007 at 01:01:27PM -0700, Andreas Dilger wrote:
> On May 08, 2007  16:49 -0500, Serge E. Hallyn wrote:
> > Quoting Andreas Dilger ([EMAIL PROTECTED]):
> > > One of the important use cases I can see today is the ability to
> > > split the heavily-overloaded e.g. CAP_SYS_ADMIN into much more fine
> > > grained attributes.
> > 
> > Sounds plausible, though it suffers from both making capabilities far
> > more cumbersome (i.e. finding the right capability for what you wanted
> > to do) and backward compatibility.  Perhaps at that point we should
> > introduce security.capabilityv2 xattrs.  A binary can then carry
> > security.capability=CAP_SYS_ADMIN=p, and
> > security.capabilityv2=cap_may_clone_mntns=p.
> 
> Well, the overhead of each EA is non-trivial (16 bytes/EA) for storing
> 12 bytes worth of data, so it is probably just better to keep extending
> the original capability fields as was in the proposal.
> 
> > > What we definitely do NOT want to happen is an application that needs
> > > priviledged access (e.g. e2fsck, mount) to stop running because the
> > > new capabilities _would_ have been granted by the new kernel and are
> > > not by the old kernel and STRICTXATTR is used.
> > > 
> > > To me it would seem that having extra capabilities on an old kernel
> > > is relatively harmless if the old kernel doesn't know what they are.
> > > It's like having a key to a door that you don't know where it is.
> > 
> > If we ditch the STRICTXATTR option do the semantics seem sane to you?
> 
> Seems reasonable.

It would simplify the code as well, which is good.

This does mean no sanity checking of fcaps, am not sure if that matters,
I'm guessing it should be similar to the case for other security attributes.

Regards
Suparna

> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Suparna Bhattacharya
On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote:
> Suparna Bhattacharya writes:
> 
> > > This looks like it will have the same problem on s390 as
> > > sys_sync_file_range.  Maybe the prototype should be:
> > > 
> > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
> > 
> > Yes, but the trouble is that there was a contrary viewpoint preferring that 
> > fd
> > first be maintained as a convention like other syscalls (see the following
> > posts)
> 
> Of course the interface used by an application program would have the
> fd first.  Glibc can do the translation.

I think that was understood.

Regards
Suparna

> 
> Paul.

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Suparna Bhattacharya
On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote:
> Andrew Morton writes:
> 
> > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > This patch implements the fallocate() system call and adds support for
> > > i386, x86_64 and powerpc.
> > > 
> > > ...
> > >
> > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t 
> > > len)
> > 
> > Please add a comment over this function which specifies its behaviour. 
> > Really it should be enough material from which a full manpage can be
> > written.
> 
> This looks like it will have the same problem on s390 as
> sys_sync_file_range.  Maybe the prototype should be:
> 
> asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)

Yes, but the trouble is that there was a contrary viewpoint preferring that fd
first be maintained as a convention like other syscalls (see the following
posts)

http://marc.info/?l=linux-fsdevel&m=117585330016809&w=2 (Andreas)
http://marc.info/?l=linux-fsdevel&m=117690157917378&w=2  (Andreas)

http://marc.info/?l=linux-fsdevel&m=117578821827323&w=2 (Randy)

So we are kind of deadlocked, aren't we ?

The debates on the proposed solution for s390

http://marc.info/?l=linux-fsdevel&m=117760995610639&w=2  
http://marc.info/?l=linux-fsdevel&m=117708124913098&w=2 
http://marc.info/?l=linux-fsdevel&m=117767607229807&w=2

Are there any better ideas ?

Regards
Suparna

> 
> Paul.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ChunkFS - measuring cross-chunk references

2007-04-25 Thread Suparna Bhattacharya
On Wed, Apr 25, 2007 at 05:50:55AM +0530, Karuna sagar K wrote:
> On 4/24/07, Theodore Tso <[EMAIL PROTECTED]> wrote:
> >On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote:
> .
> >It would also be good to distinguish between directories referencing
> >files in another chunk, and directories referencing subdirectories in
> >another chunk (which would be simpler to handle, given the topological
> >restrictions on directories, as compared to files and hard links).
> >
> 
> Modified the tool to distinguish between
> 1. cross references between directories and files
> 2. cross references between directories and sub directories
> 3. cross references within a file (due to huge file size)

One more set of numbers to calculate would be an estimate of cross-references
across chunks of block groups -- 1 (=128MB), 2 (=256MB), 4 (=512MB), 8(=1GB)
as suggested by Kalpak.

Once we have that, it would be nice if we can get data on results with
the tool from other people, especially with larger filesystem sizes.

Regards
Suparna

> 
> Below is the result from / partition of ext3 file system:
> 
> Number of files = 221794
> Number of directories = 24457
> Total size = 8193116 KB
> Total data stored = 7187392 KB
> Size of block groups = 131072 KB
> Number of inodes per block group = 16288
> No. of cross references between directories and sub-directories = 7791
> No. of cross references between directories and file = 657
> Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570)
> 
> Thanks for the suggestions.
> 
> >There may also be special things we will need to do to handle
> >scenarios such as BackupPC, where if it looks like a directory
> >contains a huge number of hard links to a particular chunk, we'll need
> >to make sure that directory is either created in the right chunk
> >(possibly with hints from the application) or migrated to the right
> >chunk (but this might cause the inode number of the directory to
> >change --- maybe we allow this as long as the directory has never been
> >stat'ed, so that the inode number has never been observed).
> >
> >The other thing which we should consider is that chunkfs really
> >requires a 64-bit inode number space, which means either we only allow
> >it on 64-bit systems, or we need to consider a migration so that even
> >on 32-bit platforms, stat() functions like stat64(), insofar that it
> >uses a stat structure which returns a 64-bit ino_t.
> >
> >   - Ted
> >
> 
> 
> Thanks,
> Karuna



-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Suparna Bhattacharya
On Mon, Apr 23, 2007 at 09:58:49PM +0530, Suparna Bhattacharya wrote:
> On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote:
> > 
> > This is an initial implementation of ChunkFS technique, briefly discussed
> > at: http://lwn.net/Articles/190222 and 
> > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> > 
> > This implementation is done within ext2 driver. Every chunk is an 
> > independent ext2 file system. The knowledge about chunks is kept within 
> > ext2 and 'continuation inodes', which are used to allow files and 
> > directories span across multiple chunks, are managed within ext2.
> > 
> > At mount time, super blocks for all the chunks are created and linked with 
> > the global super_blocks list maintained by VFS. This allows independent 
> > behavior or individual chunks and also helps writebacks to happen 
> > seamlessly.
> > 
> > Apart from this, chunkfs code in ext2 effectively only provides knowledge 
> > of:
> > 
> > - what inode's which block number to look for, for a given file's logical 
> > block number
> > - in which chunk to allocate next inode / block
> > - number of inodes to scan when a directory is being read
> > 
> > To maintain the ext2's inode number uniqueness property, 8 msb bits of 
> > inode number are used to indicate the chunk number in which it resides.
> > 
> > As said, this is a preliminary implementation and lots of changes are 
> > expected before this code is even sanely usable. Some known issues and 
> > obvious optimizations are listed in the TODO file in the chunkfs patch.
> > 
> > http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
> > - one big patch
> > - applies to 2.6.18
> 
> 
> Could you send this out as a patch to ext2 codebase, so we can just look
> at the changes for chunkfs ? That might also make it small enough
> to inline your patch in email for review. 

Sorry, I missed the part about ext2-chunkfs-diff below.

Regards
suparna

> 
> What kind of results are you planning to gather to evaluate/optimize this ?
> 
> Regards
> Suparna
> 
> > 
> > Attached - ext2-chunkfs-diff.patch.gz
> > - since the code is a spin-off of ext2, this patch explains better what
> >   has changed from the ext2.
> > 
> > git://cislinux.cis.ksu.edu/chunkfs-tools
> > - mkfs, and fsck for chunkfs.
> > 
> > http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
> > - config file used; tested mostly on UML with loopback file systems.
> > 
> > NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
> > should be "no" for clean compile.
> > 
> > 
> > Please comment, suggest, criticize. Patches most welcome.
> > 
> > 
> > Best,
> > AG
> > --
> > May the source be with you.
> > http://www.cis.ksu.edu/~gud
> 
> 
> 
> -- 
> Suparna Bhattacharya ([EMAIL PROTECTED])
> Linux Technology Center
> IBM Software Lab, India
> 

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Suparna Bhattacharya
On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote:
> 
> This is an initial implementation of ChunkFS technique, briefly discussed
> at: http://lwn.net/Articles/190222 and 
> http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> 
> This implementation is done within ext2 driver. Every chunk is an 
> independent ext2 file system. The knowledge about chunks is kept within 
> ext2 and 'continuation inodes', which are used to allow files and 
> directories span across multiple chunks, are managed within ext2.
> 
> At mount time, super blocks for all the chunks are created and linked with 
> the global super_blocks list maintained by VFS. This allows independent 
> behavior or individual chunks and also helps writebacks to happen 
> seamlessly.
> 
> Apart from this, chunkfs code in ext2 effectively only provides knowledge 
> of:
> 
> - what inode's which block number to look for, for a given file's logical 
> block number
> - in which chunk to allocate next inode / block
> - number of inodes to scan when a directory is being read
> 
> To maintain the ext2's inode number uniqueness property, 8 msb bits of 
> inode number are used to indicate the chunk number in which it resides.
> 
> As said, this is a preliminary implementation and lots of changes are 
> expected before this code is even sanely usable. Some known issues and 
> obvious optimizations are listed in the TODO file in the chunkfs patch.
> 
> http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
> - one big patch
> - applies to 2.6.18


Could you send this out as a patch to ext2 codebase, so we can just look
at the changes for chunkfs ? That might also make it small enough
to inline your patch in email for review. 

What kind of results are you planning to gather to evaluate/optimize this ?

Regards
Suparna

> 
> Attached - ext2-chunkfs-diff.patch.gz
> - since the code is a spin-off of ext2, this patch explains better what
>   has changed from the ext2.
> 
> git://cislinux.cis.ksu.edu/chunkfs-tools
> - mkfs, and fsck for chunkfs.
> 
> http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
> - config file used; tested mostly on UML with loopback file systems.
> 
> NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
> should be "no" for clean compile.
> 
> 
> Please comment, suggest, criticize. Patches most welcome.
> 
> 
> Best,
> AG
> --
> May the source be with you.
> http://www.cis.ksu.edu/~gud



-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4 of 8] Add flags to control direct IO helpers

2007-02-07 Thread Suparna Bhattacharya
On Wed, Feb 07, 2007 at 01:05:44PM -0500, Chris Mason wrote:
> On Wed, Feb 07, 2007 at 10:38:45PM +0530, Suparna Bhattacharya wrote:
> > > + * The flags parameter is a bitmask of:
> > > + *
> > > + * DIO_PLACEHOLDERS (use placeholder pages for locking)
> > > + * DIO_CREATE (pass create=1 to get_block for filling holes or extending)
> > 
> > A little more explanation about why these options are needed, and examples
> > of when one would specify each of these options would be good.
> 
> I'll extend the comments in the patch, but for discussion here:
> 
> DIO_PLACEHOLDERS:  placeholders are inserted into the page cache to
> synchronize the DIO with buffered writes.  From a locking point of view,
> this is similar to inserting and locking pages in the address space
> corresponding to the DIO.
> 
> placeholders guard against concurrent allocations and truncates during the 
> DIO.
> You don't need placeholders if truncates and allocations are are
> impossible (for example, on a block device).

Likewise placeholders may not be needed if the underlying filesystem
already takes care of locking to synchronizes DIO vs buffered.

> 
> DIO_CREATE: placeholders make it possible for filesystems to safely fill
> holes and extend the file via get_block during the DIO.  If DIO_CREATE
> is turned on, get_block will be called with create=1, allowing the FS to
> allocate blocks during the DIO.

When would one NOT specify DIO_CREATE, and what are the implications ?
The purpose of having an option of NOT allowing the FS to allocate blocks
during DIO is one is not very intuitive from the standpoint of the caller.
(the block device case could be an example, but then create=1 could not do
any harm or add extra overhead, so why bother ?)

Is there still a valid case where we fallback to buffered IO to fill holes
- to me that seems to be the only situation where create=0 must be enforced.

> 
> DIO_DROP_I_MUTEX: If the write is inside of i_size, i_mutex is dropped
> during the DIO and taken again before returning.

Again an example of when one would not specify this (block device and
XFS ?) would be useful.

Regards
Suparna

> 
> -chris

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4 of 8] Add flags to control direct IO helpers

2007-02-07 Thread Suparna Bhattacharya
gt; + err = filemap_write_and_wait_range(mapping, offset, end - 1);
> + if (!err)
> + err = invalidate_inode_pages2_range(mapping,
> + offset >> PAGE_CACHE_SHIFT,
> + (end - 1) >> PAGE_CACHE_SHIFT);
> + if (!retval && err)
> + retval = err;
> + }
>   return retval;
>  }
> -EXPORT_SYMBOL(__blockdev_direct_IO);
> +EXPORT_SYMBOL(blockdev_direct_IO_flags);
> diff -r 1a7105ab9c19 -r 04dd7ddd593e include/linux/fs.h
> --- a/include/linux/fs.h  Tue Feb 06 20:02:55 2007 -0500
> +++ b/include/linux/fs.h  Tue Feb 06 20:02:56 2007 -0500
> @@ -1776,24 +1776,28 @@ static inline void do_generic_file_read(
>  }
> 
>  #ifdef CONFIG_BLOCK
> -ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
> +int check_dio_alignment(struct inode *inode, struct block_device *bdev,
> +const struct iovec *iov, loff_t offset, unsigned 
> long nr_segs,
> + unsigned *blkbits_ret, loff_t 
> *end_ret);
> +
> +ssize_t blockdev_direct_IO_flags(int rw, struct kiocb *iocb, struct inode 
> *inode,
>   struct block_device *bdev, const struct iovec *iov, loff_t offset,
>   unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
> - int lock_type);
> -
> -enum {
> - DIO_LOCKING = 1, /* need locking between buffered and direct access */
> - DIO_NO_LOCKING,  /* bdev; no locking at all between buffered/direct */
> - DIO_OWN_LOCKING, /* filesystem locks buffered and direct internally */
> -};
> + unsigned int dio_flags);
> +
> +#define DIO_PLACEHOLDERS (1 << 0)  /* insert placeholder pages */
> +#define DIO_CREATE   (1 << 1)  /* pass create=1 to get_block when writing */
> +#define DIO_DROP_I_MUTEX (1 << 2) /* drop i_mutex during writes */
> 
>  static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
>   struct inode *inode, struct block_device *bdev, const struct iovec *iov,
>   loff_t offset, unsigned long nr_segs, get_block_t get_block,
>   dio_iodone_t end_io)
>  {
> - return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> - nr_segs, get_block, end_io, DIO_LOCKING);
> + /* locking is on, FS wants to fill holes w/get_block */
> + return blockdev_direct_IO_flags(rw, iocb, inode, bdev, iov, offset,
> + nr_segs, get_block, end_io, DIO_PLACEHOLDERS |
> + DIO_CREATE | DIO_DROP_I_MUTEX);
>  }
> 
>  static inline ssize_t blockdev_direct_IO_no_locking(int rw, struct kiocb 
> *iocb,
> @@ -1801,17 +1805,9 @@ static inline ssize_t blockdev_direct_IO
>   loff_t offset, unsigned long nr_segs, get_block_t get_block,
>   dio_iodone_t end_io)
>  {
> - return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> - nr_segs, get_block, end_io, DIO_NO_LOCKING);
> -}
> -
> -static inline ssize_t blockdev_direct_IO_own_locking(int rw, struct kiocb 
> *iocb,
> - struct inode *inode, struct block_device *bdev, const struct iovec *iov,
> - loff_t offset, unsigned long nr_segs, get_block_t get_block,
> - dio_iodone_t end_io)
> -{
> - return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> - nr_segs, get_block, end_io, DIO_OWN_LOCKING);
> + /* locking is off, create is off */
> + return blockdev_direct_IO_flags(rw, iocb, inode, bdev, iov, offset,
> + nr_segs, get_block, end_io, 0);
>  }
>  #endif
> 
> diff -r 1a7105ab9c19 -r 04dd7ddd593e mm/filemap.c
> --- a/mm/filemap.cTue Feb 06 20:02:55 2007 -0500
> +++ b/mm/filemap.cTue Feb 06 20:02:56 2007 -0500
> @@ -40,7 +40,7 @@
> 
>  #include 
> 
> -static ssize_t
> +static inline ssize_t
>  generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
>   loff_t offset, unsigned long nr_segs);
> 
> @@ -2817,46 +2817,12 @@ EXPORT_SYMBOL(generic_file_aio_write);
>   * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
> something
>   * went wrong during pagecache shootdown.
>   */
> -static ssize_t
> +static inline ssize_t
>  generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
>   loff_t offset, unsigned long nr_segs)
>  {
> - struct file *file = iocb->ki_filp;
> - struct address_space *mapping = file->f_mapping;
> - ssize_t retval;
> - size_t write_len = 0;
> -
> - /*
> -  * If it's a write, unmap all mmappings of the file

Re: Testing ext4 persistent preallocation patches for 64 bit features

2007-02-07 Thread Suparna Bhattacharya
 when we detect linearly ascending requests.
>*/
>   __u32   last_alloc_physical_block;
> + __u32   goal_block_group;
>  };
> 
>  #define rsv_start rsv_window._rsv_start
> diff -puN fs/ext3/balloc.c~ext3_set_alloc_blk_group_hack fs/ext3/balloc.c
> --- linux-2.6.16/fs/ext3/balloc.c~ext3_set_alloc_blk_group_hack   
> 2006-03-28 15:45:30.0 -0800
> +++ linux-2.6.16-ming/fs/ext3/balloc.c2006-03-28 16:03:55.770850040 
> -0800
> @@ -285,6 +285,7 @@ void ext3_init_block_alloc_info(struct i
>   rsv->rsv_alloc_hit = 0;
>   block_i->last_alloc_logical_block = 0;
>   block_i->last_alloc_physical_block = 0;
> + block_i->goal_block_group = 0;
>   }
>   ei->i_block_alloc_info = block_i;
>  }
> @@ -1263,15 +1264,20 @@ unsigned long ext3_new_blocks(handle_t *
>   *errp = -ENOSPC;
>   goto out;
>   }
> -
> - /*
> -  * First, test whether the goal block is free.
> -  */
> - if (goal < le32_to_cpu(es->s_first_data_block) ||
> - goal >= le32_to_cpu(es->s_blocks_count))
> - goal = le32_to_cpu(es->s_first_data_block);
> - group_no = (goal - le32_to_cpu(es->s_first_data_block)) /
> - EXT3_BLOCKS_PER_GROUP(sb);
> + if (block_i->goal_block_group) {
> + group_no = block_i->goal_block_group;
> + goal = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) + 
>group_no * EXT3_BLOCKS_PER_GROUP(sb);
> + block_i->goal_block_group = 0;
> + } else {
> + /*
> +  * First, test whether the goal block is free.
> +  */
> + if (goal < le32_to_cpu(es->s_first_data_block) ||
> + goal >= le32_to_cpu(es->s_blocks_count))
> + goal = le32_to_cpu(es->s_first_data_block);
> + group_no = (goal - le32_to_cpu(es->s_first_data_block)) /
> + EXT3_BLOCKS_PER_GROUP(sb);
> + }
>   gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
>   if (!gdp)
>   goto io_error;
> 
> _
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/9] buffered write deadlock fix

2007-02-02 Thread Suparna Bhattacharya
On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > The following set of patches attempt to fix the buffered write
> > locking problems (and there are a couple of peripheral patches
> > and cleanups there too).
> > 
> > Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
> > be an easier diff with the fsaio patches gone, but the readahead
> > rewrite clashes badly :(
> 
> Well fsaio is restored, but there's now considerable doubt over it due to
> the recent febril febrility.

I think Ingo made a point earlier about letting the old co-exist with the
new. Fibrils + kevents have great potential for a next generation
solution but we need to give the whole story some time to play out and prove
it in practice, debate and benchmark the alternative combinations, optimize it
for various workloads etc.  It will also take more work on top before we
can get the whole POSIX AIO implementation supported on top of this. I'll be
very happy when that happens ... it is just that it is still too early to
be sure.

Since this is going to be a new interface, not the existing linux AIO
interface, I do not see any conflict between the two. Samba4 already uses
fsaio, and we now have the ability to do POSIX AIO over kernel AIO (which
depends on fsaio). The more we delay real world usage the longer we take
to learn about the application patterns that matter. And it is those
patterns that are key.

> 
> How bad is the clash with the readahead patches?
> 
> Clashes with git-block are likely, too.
> 
> Bugfixes come first, so I will drop readahead and fsaio and git-block to get
> this work completed if needed - please work agaisnt mainline.

If you need help with fixing the clashes, please let me know.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).

2007-01-18 Thread Suparna Bhattacharya
On Wed, Jan 17, 2007 at 05:39:51PM +0300, Evgeniy Polyakov wrote:
> On Wed, Jan 17, 2007 at 07:21:42PM +0530, Suparna Bhattacharya ([EMAIL 
> PROTECTED]) wrote:
> >
> > Since you are implementing new APIs here, have you considered doing an
> > aio_sendfilev to be able to send a header with the data ?
> 
> It is doable, but why people do not like corking?
> With Linux less than microsecond syscall overhead it is better and more
> flexible solution, doesn't it?

That is what I used to think as well. However ...

The problem as I understand it now is not about bunching data together, but
of ensuring some sort of atomicity between the header and the data, when
there can be multiple outstanding aio requests on the same socket - i.e
ensuring strict ordering without other data coming in between, when data
to be sent is not already in cache, and in the meantime another sendfile
or aio write requests comes in for the same socket. Without having to lock
the socket when reading data from disk.

There are alternate ways to address this, aio_sendfilev is one of the options
I have heard people requesting.

Regards
Suparna

> 
> I'm not saying - 'no, there will not be any *v variants', just getting
> more info.
> 
> > Regards
> > Suparna
> 
> --
>   Evgeniy Polyakov

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).

2007-01-17 Thread Suparna Bhattacharya
gt; +static int kaio_add_kevent(int fd, struct kaio_req *req)
> +{
> + struct ukevent uk;
> + struct file *file;
> + struct kevent_user *u;
> + int err, need_fput = 0;
> + u32 *cnt;
> +
> + file = fget_light(fd, &need_fput);
> + if (!file) {
> + err = -EBADF;
> + goto err_out;
> + }
> +
> + if (file->f_op != &kevent_user_fops) {
> + err = -EINVAL;
> + goto err_out_fput;
> + }
> +
> + u = file->private_data;
> +
> + memset(&uk, 0, sizeof(struct ukevent));
> +
> + uk.event = KEVENT_MASK_ALL;
> + uk.type = KEVENT_AIO;
> +
> + preempt_disable();
> + uk.id.raw[0] = smp_processor_id();
> + cnt = &__get_cpu_var(kaio_req_counter);
> + uk.id.raw[1] = *cnt;
> + *cnt = *cnt + 1;
> + preempt_enable();
> +
> + uk.req_flags = KEVENT_REQ_ONESHOT | KEVENT_REQ_ALWAYS_QUEUE;
> + uk.ptr = req;
> +
> + err = kevent_user_add_ukevent(&uk, u);
> + if (err)
> + goto err_out_fput;
> +
> + kevent_user_get(u);
> +
> + fput_light(file, need_fput);
> +
> + return 0;
> +
> +err_out_fput:
> + fput_light(file, need_fput);
> +err_out:
> + return err;
> +}
> +
> +static void kaio_destructor(struct kaio_req *req)
> +{
> + struct kaio_private *priv = req->priv;
> + struct socket *sock = priv->dptr;
> + struct file *file = priv->sptr;
> +
> + fput(file);
> + sockfd_put(sock);
> +
> + kevent_storage_ready(&priv->kevent_user->st, NULL, KEVENT_MASK_ALL);
> + kevent_user_put(priv->kevent_user);
> +
> + kmem_cache_free(kaio_priv_cache, req->priv);
> +}
> +
> +static struct kaio_req *kaio_sendfile(int kevent_fd, int sock_fd, struct 
> file *file, off_t offset, size_t count)
> +{
> + struct kaio_req *req;
> + struct socket *sock;
> + struct kaio_private *priv;
> + int err;
> +
> + sock = sockfd_lookup(sock_fd, &err);
> + if (!sock)
> + goto err_out_exit;
> +
> + priv = kmem_cache_alloc(kaio_priv_cache, GFP_KERNEL);
> + if (!priv)
> + goto err_out_sput;
> +
> + priv->sptr = file;
> + priv->dptr = sock;
> + priv->offset = offset;
> + priv->count = min_t(u64, i_size_read(file->f_dentry->d_inode), count);
> + priv->processed = offset;
> + priv->limit = 128;
> +
> + req = kaio_add_call(NULL, &kaio_read_send_pages, -1, GFP_KERNEL);
> + if (!req)
> + goto err_out_free;
> +
> + req->destructor = kaio_destructor;
> + req->priv = priv;
> +
> + err = kaio_add_kevent(kevent_fd, req);
> +
> + dprintk("%s: req: %p, priv: %p, call: %p [%p], offset: %Lu, processed: 
> %Lu, count: %Lu, err: %d.\n",
> + __func__, req, priv, &kaio_read_send_pages,
> + kaio_read_send_pages, priv->offset, priv->processed, 
> priv->count, err);
> +
> + if (err)
> + goto err_out_remove;
> +
> + kaio_schedule_req(req);
> +
> + return req;
> +
> +err_out_remove:
> + /* It is safe to just free the object since it is guaranteed that it 
> was not
> +  * queued for processing.
> +  */
> + kmem_cache_free(kaio_req_cache, req);
> +err_out_free:
> + kmem_cache_free(kaio_priv_cache, priv);
> +err_out_sput:
> + sockfd_put(sock);
> +err_out_exit:
> + return NULL;
> +
> +}
> +
> +asmlinkage long sys_aio_sendfile(int kevent_fd, int sock_fd, int in_fd, 
> off_t offset, size_t count)
> +{
> + struct kaio_req *req;
> + struct file *file;
> + int err;
> +
> + file = fget(in_fd);
> + if (!file) {
> + err = -EBADF;
> + goto err_out_exit;
> + }
> +
> + req = kaio_sendfile(kevent_fd, sock_fd, file, offset, count);
> + if (!req) {
> + err = -EINVAL;
> + goto err_out_fput;
> + }
> +
> + return (long)req;
> +
> +err_out_fput:
> + fput(file);
> +err_out_exit:
> + return err;
> +}
> +
> +asmlinkage long sys_aio_sendfile_path(int kevent_fd, int sock_fd, char 
> __user *filename, off_t offset, size_t count)
> +{
> + char *tmp = getname(filename);
> + int fd = PTR_ERR(tmp);
> + int flags = O_RDONLY, err;
> + struct nameidata nd;
> + struct file *file;
> + struct kaio_req *req;
> +
> + if (force_o_largefile())
> +     flags = O_LARGEFILE;
> +
> + if (IS_ERR(tmp)) {
> +   

Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-10 Thread Suparna Bhattacharya
On Wed, Jan 10, 2007 at 05:08:29PM -0800, Andrew Morton wrote:
> On Wed, 10 Jan 2007 11:14:19 +0530
> Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> > > On Thu, 4 Jan 2007 10:26:21 +0530
> > > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> > >
> > > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> > >
> > > Patches against next -mm would be appreciated, please.  Sorry about that.
> >
> > I have updated the patchset against 2620-rc3-mm1, incorporated various
> > cleanups suggested during last review. Please let me know if I have missed
> > anything:
> 
> The s/lock_page_slow/lock_page_blocking/ got lost.  I redid it.

I thought the lock_page_blocking was an alternative you had suggested
to the __lock_page vs lock_page_async discussion which got resolved later.
That is why I didn't make the change in this patchset.
The call does not block in the async case, hence the choice of
the _slow suffix (like in fs/buffer.c). But if lock_page_blocking()
sounds more intuitive to you, that's OK. 

> 
> For the record, patches-via-http are very painful.  Please always always
> email them.
> 
> As a result, these patches ended up with titles which are derived from their
> filenames, which are cryptic.

Sorry about that - I wanted to ask if you'd prefer my resending them to the
list, but missed doing so. Some people have found it easier to download the
series as a whole when they intend to apply it, so I ended up maintaining it
that way all this while.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-09 Thread Suparna Bhattacharya
On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> On Thu, 4 Jan 2007 10:26:21 +0530
> Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> 
> > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> 
> Patches against next -mm would be appreciated, please.  Sorry about that.

I have updated the patchset against 2620-rc3-mm1, incorporated various
cleanups suggested during last review. Please let me know if I have missed
anything:

It should show up at
www.kernel.org:/pub/linux/kernel/people/suparna/aio/2620-rc3-mm1

Brief changelog:
- Reworked against the block layer unplug changes 
- Switched from defines to inlines for init_wait_bit* etc (per akpm)
- Better naming:  __lock_page to lock_page_async (per hch, npiggin)
- Kill lock_page_slow wrapper and rename __lock_page_slow to lock_page_slow
  (per hch)
- Use a helper function aio_restarted() (per hch)
- Replace combined if/assignment (per hch)
- fix resetting of current->io_wait after ->retry in aio_run_iocb (per zab)

I have run my usual aio-stress variations script
(www.kernel.org:/pub/linux/kernel/people/suparna/aio/aio-results.sh)

Regards
Suparna


-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-05 Thread Suparna Bhattacharya
On Fri, Jan 05, 2007 at 08:02:33AM +0100, Jens Axboe wrote:
> On Fri, Jan 05 2007, Suparna Bhattacharya wrote:
> > On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> > > On Thu, 4 Jan 2007 10:26:21 +0530
> > > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> > >
> > > > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> > > > > On Thu, 28 Dec 2006 13:53:08 +0530
> > > > > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > > This patchset implements changes to make filesystem AIO read
> > > > > > and write asynchronous for the non O_DIRECT case.
> > > > >
> > > > > Unfortunately the unplugging changes in Jen's block tree have trashed 
> > > > > these
> > > > > patches to a degree that I'm not confident in my repair attempts.  So 
> > > > > I'll
> > > > > drop the fasio patches from -mm.
> > > >
> > > > I took a quick look and the conflicts seem pretty minor to me, the 
> > > > unplugging
> > > > changes mostly touch nearby code.
> > >
> > > Well...  the conflicts (both mechanical and conceptual) are such that a
> > > round of retesting is needed.
> > >
> > > > Please let know how you want this fixed up.
> > > >
> > > > >From what I can tell the comments in the unplug patches seem to say 
> > > > >that
> > > > it needs more work and testing, so perhaps a separate fixup patch may be
> > > > a better idea rather than make the fsaio patchset dependent on this.
> > >
> > > Patches against next -mm would be appreciated, please.  Sorry about that.
> > >
> > > I _assume_ Jens is targetting 2.6.21?
> >
> > When is the next -mm likely to be out ?
> >
> > I was considering regenerating the blk unplug patches against the
> > fsaio changes instead of the other way around, if Jens were willing to
> > accept that. But if the next -mm is just around the corner then its
> > not an issue.
> 
> I don't really care much, but I work against mainline and anything but
> occasional one-off generations of a patch against a different base is
> not very likely.
> 
> The -mm order should just reflect the merge order of the patches, what
> is the fsaio target?

2.6.21 was what I had in mind, to enable the glibc folks to proceed with
conversion to native AIO.

Regenerating my patches against the unplug stuff is not a problem, I only
worry about being queued up behind something that may take longer to
stabilize and is likely to change ... If that is not the case, I don't
mind.

Regards
Suparna

> 
> --
> Jens Axboe
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-04 Thread Suparna Bhattacharya
On Thu, Jan 04, 2007 at 09:02:42AM -0800, Andrew Morton wrote:
> On Thu, 4 Jan 2007 10:26:21 +0530
> Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> 
> > On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> > > On Thu, 28 Dec 2006 13:53:08 +0530
> > > Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> > >
> > > > This patchset implements changes to make filesystem AIO read
> > > > and write asynchronous for the non O_DIRECT case.
> > >
> > > Unfortunately the unplugging changes in Jen's block tree have trashed 
> > > these
> > > patches to a degree that I'm not confident in my repair attempts.  So I'll
> > > drop the fasio patches from -mm.
> >
> > I took a quick look and the conflicts seem pretty minor to me, the 
> > unplugging
> > changes mostly touch nearby code.
> 
> Well...  the conflicts (both mechanical and conceptual) are such that a
> round of retesting is needed.
> 
> > Please let know how you want this fixed up.
> >
> > >From what I can tell the comments in the unplug patches seem to say that
> > it needs more work and testing, so perhaps a separate fixup patch may be
> > a better idea rather than make the fsaio patchset dependent on this.
> 
> Patches against next -mm would be appreciated, please.  Sorry about that.
> 
> I _assume_ Jens is targetting 2.6.21?

When is the next -mm likely to be out ? 

I was considering regenerating the blk unplug patches against the
fsaio changes instead of the other way around, if Jens were willing to accept
that. But if the next -mm is just around the corner then its not an issue.

Regards
Suparna

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHSET 4][PATCH 1/1] AIO fallback for pipes, sockets and pollable fds

2007-01-04 Thread Suparna Bhattacharya
   msg->msg_iov = (struct iovec *)iov;
msg->msg_iovlen = nr_segs;
-   msg->msg_flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
+   msg->msg_flags = ((file->f_flags & O_NONBLOCK) || !is_sync_kiocb(iocb))
+   ? MSG_DONTWAIT : 0;
 
return __sock_recvmsg(iocb, sock, msg, size, msg->msg_flags);
 }
@@ -741,7 +742,8 @@ static ssize_t do_sock_write(struct msgh
msg->msg_controllen = 0;
msg->msg_iov = (struct iovec *)iov;
msg->msg_iovlen = nr_segs;
-   msg->msg_flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
+   msg->msg_flags = ((file->f_flags & O_NONBLOCK) || !is_sync_kiocb(iocb))
+   ? MSG_DONTWAIT : 0;
if (sock->type == SOCK_SEQPACKET)
msg->msg_flags |= MSG_EOR;
 
diff -puN fs/pipe.c~aio-fallback-nonblock fs/pipe.c
--- linux-2.6.20-rc1/fs/pipe.c~aio-fallback-nonblock2007-01-03 
19:16:36.0 +0530
+++ linux-2.6.20-rc1-root/fs/pipe.c 2007-01-03 19:16:36.0 +0530
@@ -226,14 +226,16 @@ pipe_read(struct kiocb *iocb, const stru
struct pipe_inode_info *pipe;
int do_wakeup;
ssize_t ret;
-   struct iovec *iov = (struct iovec *)_iov;
+   struct iovec iov_array[nr_segs];
+   struct iovec *iov = iov_array;
size_t total_len;
 
-   total_len = iov_length(iov, nr_segs);
+   total_len = iov_length(_iov, nr_segs);
/* Null read succeeds. */
if (unlikely(total_len == 0))
return 0;
 
+   memcpy(iov, _iov, nr_segs * sizeof(struct iovec));
do_wakeup = 0;
ret = 0;
mutex_lock(&inode->i_mutex);
@@ -302,7 +304,8 @@ redo:
 */
if (ret)
break;
-   if (filp->f_flags & O_NONBLOCK) {
+   if (filp->f_flags & O_NONBLOCK ||
+   !is_sync_kiocb(iocb)) {
ret = -EAGAIN;
break;
}
@@ -339,15 +342,17 @@ pipe_write(struct kiocb *iocb, const str
struct pipe_inode_info *pipe;
ssize_t ret;
int do_wakeup;
-   struct iovec *iov = (struct iovec *)_iov;
+   struct iovec iov_array[nr_segs];
+   struct iovec *iov = iov_array;
size_t total_len;
ssize_t chars;
 
-   total_len = iov_length(iov, nr_segs);
+   total_len = iov_length(_iov, nr_segs);
/* Null write succeeds. */
if (unlikely(total_len == 0))
return 0;
 
+   memcpy(iov, _iov, nr_segs * sizeof(struct iovec));
do_wakeup = 0;
ret = 0;
mutex_lock(&inode->i_mutex);
@@ -473,7 +478,7 @@ redo2:
}
if (bufs < PIPE_BUFFERS)
continue;
-   if (filp->f_flags & O_NONBLOCK) {
+   if (filp->f_flags & O_NONBLOCK || !is_sync_kiocb(iocb)) {
if (!ret)
ret = -EAGAIN;
break;
_

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-04 Thread Suparna Bhattacharya
On Thu, Jan 04, 2007 at 05:50:11PM +1100, Nick Piggin wrote:
> Suparna Bhattacharya wrote:
> >On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote:
> 
> >>So long as AIO threads do the same, there would be no problem (plugging
> >>is optional, of course).
> >
> >
> >Yup, the AIO threads run the same code as for regular IO, i.e in the rare
> >situations where they actually end up submitting IO, so there should
> >be no problem. And you have already added plug/unplug at the appropriate
> >places in those path, so things should just work.
> 
> Yes I think it should.
> 
> >>This (is supposed to) give a number of improvements over the traditional
> >>plugging (although some downsides too). Most notably for me, the VM gets
> >>cleaner ;)
> >>
> >>However AIO could be an interesting case to test for explicit plugging
> >>because of the way they interact. What kind of improvements do you see
> >>with samba and do you have any benchmark setups?
> >
> >
> >I think aio-stress would be a good way to test/benchmark this sort of
> >stuff, at least for a start.
> >Samba (if I understand this correctly based on my discussions with Tridge)
> >is less likely to generate the kind of io patterns that could benefit from
> >explicit plugging (because the file server has no way to tell what the next
> >request is going to be, it ends up submitting each independently instead of
> >batching iocbs).
> 
> OK, but I think that after IO submission, you do not run sync_page to
> unplug the block device, like the normal IO path would (via lock_page,
> before the explicit plug patches).

In the buffered AIO case, we do run sync page like normal IO ... just
that we don't block in io_schedule(), everything else is pretty much 
similar.

In the case of AIO-DIO, the path is like the just like non-AIO DIO, there
is a call to blk_run_address_space() after submission.

> 
> However, with explicit plugging, AIO requests will be started immediately.
> Maybe this won't be noticable if the device is always busy, but I would
> like to know there isn't a regression.
> 
> >In future there may be optimization possibilities to consider when
> >submitting batches of iocbs, i.e. on the io submission path. Maybe
> >AIO - O_DIRECT would be interesting to play with first in this regardi ?
> 
> Well I've got some simple per-process batching in there now, each process
> has a list of pending requests. Request merging is done locklessly against
> the last request added; and submission at unplug time is batched under a
> single block device lock.
> 
> I'm sure more merging or batching could be done, but also consider that
> most programs will not ever make use of any added complexity.

I guess I didn't express myself well - by batching I meant being able to
surround submission of a batch of iocbs with explicit plug/unplug instead
of explicit plug/unplug for each iocb separately. Of course there is no
easy way to do that, since at the io_submit() level we do not know about
the block device (each iocb could be directed to a different fd and not
just block devices). So it may not be worth thinking about.

> 
> Regarding your patches, I've just had a quick look and have a question --
> what do you do about blocking in page reclaim and dirty balancing? Aren't
> those major points of blocking with buffered IO? Did your test cases
> dirty enough to start writeout or cause a lot of reclaim? (admittedly,
> blocking in reclaim will now be much less common since the dirty mapping
> accounting).

In my earlier versions of patches I actually had converted these waits to
be async retriable, but then I came to the conclusion that the additional
complexity wasn't worth it. For one it didn't seem to make a difference 
compared to the other bigger cases, and I was looking primarily at handling
the gross blocking points (say to enable an application to keep device queues
busy) and not making everything asynchronous; for another we had a long
discussion thread waay back about not making AIO submitters exempt from
throttling or memory availability waits.

Regards
Suparna

> 
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-03 Thread Suparna Bhattacharya
On Thu, Jan 04, 2007 at 04:51:58PM +1100, Nick Piggin wrote:
> Suparna Bhattacharya wrote:
> >On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> 
> >>Plus Jens's unplugging changes add more reliance upon context inside
> >>*current, for the plugging and unplugging operations.  I expect that the
> >>fsaio patches will need to be aware of the protocol which those proposed
> >>changes add.
> >
> >
> >Whatever logic applies to background writeout etc should also just apply
> >as is to aio worker threads, shouldn't it ? At least at a quick glance I
> >don't see anything special that needs to be done for fsaio, but its good
> >to be aware of this anyway, thanks !
> 
> The submitting process plugs itself, submits all its IO, then unplugs
> itself (ie. so the plug is now on the process, rather than the block
> device).
> 
> So long as AIO threads do the same, there would be no problem (plugging
> is optional, of course).

Yup, the AIO threads run the same code as for regular IO, i.e in the rare
situations where they actually end up submitting IO, so there should
be no problem. And you have already added plug/unplug at the appropriate
places in those path, so things should just work. 

> 
> This (is supposed to) give a number of improvements over the traditional
> plugging (although some downsides too). Most notably for me, the VM gets
> cleaner ;)
> 
> However AIO could be an interesting case to test for explicit plugging
> because of the way they interact. What kind of improvements do you see
> with samba and do you have any benchmark setups?

I think aio-stress would be a good way to test/benchmark this sort of
stuff, at least for a start. 
Samba (if I understand this correctly based on my discussions with Tridge)
is less likely to generate the kind of io patterns that could benefit from
explicit plugging (because the file server has no way to tell what the next
request is going to be, it ends up submitting each independently instead of
batching iocbs).

In future there may be optimization possibilities to consider when
submitting batches of iocbs, i.e. on the io submission path. Maybe
AIO - O_DIRECT would be interesting to play with first in this regardi ? 

Regards
Suparna

> 
> Thanks,
> Nick
> 
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2007-01-03 Thread Suparna Bhattacharya
On Wed, Jan 03, 2007 at 02:15:56PM -0800, Andrew Morton wrote:
> On Thu, 28 Dec 2006 13:53:08 +0530
> Suparna Bhattacharya <[EMAIL PROTECTED]> wrote:
> 
> > This patchset implements changes to make filesystem AIO read
> > and write asynchronous for the non O_DIRECT case.
> 
> Unfortunately the unplugging changes in Jen's block tree have trashed these
> patches to a degree that I'm not confident in my repair attempts.  So I'll
> drop the fasio patches from -mm.

I took a quick look and the conflicts seem pretty minor to me, the unplugging
changes mostly touch nearby code. Please let know how you want this fixed
up.

>From what I can tell the comments in the unplug patches seem to say that
it needs more work and testing, so perhaps a separate fixup patch may be
a better idea rather than make the fsaio patchset dependent on this.

> 
> Zach's observations regarding this code's reliance upon things at *current
> sounded pretty serious, so I expect we'd be seeing changes for that anyway?

Not really, at least nothing that I can see needing a change.
As I mentioned there is no reliance on *current in the code that
runs in the aio threads that we need to worry about. 

The generic_write_checks etc that Zach was referring to all happens in the
context of submitting process, not in retry context. The model is to perform
all validation at the time of io submission. And of course things like
copy_to_user() are already taken care of by use_mm().

Lets look at it this way - the kernel already has the ability to do 
background writeout on behalf of a task from a kernel thread and likewise
read(ahead) pages that may be consumed by another task. There is also the
ability to operate another task's address space (as used by ptrace).

So there is nothing groundbreaking here.

In fact on most occasions, all the IO is initiated in the context of the
submitting task, so the aio threads mainly deal with checking for completion
and transfering completed data to user space.

> 
> Plus Jens's unplugging changes add more reliance upon context inside
> *current, for the plugging and unplugging operations.  I expect that the
> fsaio patches will need to be aware of the protocol which those proposed
> changes add.

Whatever logic applies to background writeout etc should also just apply
as is to aio worker threads, shouldn't it ? At least at a quick glance I
don't see anything special that needs to be done for fsaio, but its good
to be aware of this anyway, thanks !

Regards
Suparna

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHSET 2][PATCH 1/1] Combining epoll and disk file AIO

2007-01-02 Thread Suparna Bhattacharya
On Wed, Dec 27, 2006 at 09:08:56PM +0530, Suparna Bhattacharya wrote:
> (2) Most of these other applications need the ability to process both
> network events (epoll) and disk file AIO in the same loop. With POSIX AIO
> they could at least sort of do this using signals (yeah, and all 
> associated
> issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with
> modifications from Jeff Moyer and me) addresses this problem for native
> linux aio in a simple manner. Tridge has written a test harness to
> try out the Samba4 event library modifications to use this. Jeff Moyer
> has a modified version of pipetest for comparison.
> 


Enable epoll wait to be unified with io_getevents

From: Zach Brown, Jeff Moyer, Suparna Bhattacharya

Previously there have been (complicated and scary) attempts to funnel
individual aio events down epoll or vice versa.  This instead lets one
issue an entire sys_epoll_wait() as an aio op.  You'd setup epoll as
usual and then issue epoll_wait aio ops which would complete once epoll
events had been copied. This will enable a single io_getevents() event
loop to process both disk AIO and epoll notifications.

>From an application standpoint a typical flow works like this:
- Use epoll_ctl as usual to add/remove epoll registrations
- Instead of issuing sys_epoll_wait, setup an iocb using
  io_prep_epoll_wait (see examples below) specifying the epoll
  events buffer to fill up with epoll notifications. Submit the iocb
  using io_submit
- Now io_getevents can be used to wait for both epoll waits and
  disk aio completion. If the returned AIO event is of type
  IO_CMD_EPOLL_WAIT, then corresponding result value indicates the
  number of epoll notifications in the iocb's event buffer, which
  can now be processed just like once would process results from a
  sys_epoll_wait()

There are a couple of sample applications:
- Andrew Tridgell has implemented a little test harness using an aio events
  library implementation intended for samba4 
  http://samba.org/~tridge/etest
  (The -e aio option uses aio epoll wait and can issue disk aio as well)
- An updated version of pipetest from Jeff Moyer has a --aio-epoll option
  http://people.redhat.com/jmoyer/aio/epoll/pipetest.c

There is obviously a little overhead compared to using sys_epoll_wait(), due
to the extra step of submitting the epoll wait iocb, most noticible when
there are very few events processed per loop. However, the goal here is not
to build an epoll alternative but merely to allow network and disk I/O to
be processed in the same event loop which is where the efficiencies really
come from. Picking up more epoll events in each loop can amortize the
overhead across many operations to mitigate the impact.
  
Thanks to Arjan Van de Van for helping figure out how to resolve the
lockdep complaints. Both ctx->lock and ep->lock can be held in certain 
wait queue callback routines, thus being nested inside q->lock. However, this
excludes ctx->wait or ep->wq wait queues, which can safetly be nested
inside ctx->lock or ep->lock respectively. So we teach lockdep to recognize
these as distinct classes.

Signed-off-by: Zach Brown <[EMAIL PROTECTED]>
Signed-off-by: Jeff Moyer <[EMAIL PROTECTED]>
Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>

---

 linux-2.6.20-rc1-root/fs/aio.c  |   54 +
 linux-2.6.20-rc1-root/fs/eventpoll.c|   95 +---
 linux-2.6.20-rc1-root/include/linux/aio.h   |2 
 linux-2.6.20-rc1-root/include/linux/aio_abi.h   |1 
 linux-2.6.20-rc1-root/include/linux/eventpoll.h |   31 +++
 linux-2.6.20-rc1-root/include/linux/sched.h |2 
 linux-2.6.20-rc1-root/kernel/timer.c|   21 +
 7 files changed, 196 insertions(+), 10 deletions(-)

diff -puN fs/aio.c~aio-epoll-wait fs/aio.c
--- linux-2.6.20-rc1/fs/aio.c~aio-epoll-wait2006-12-28 14:22:52.0 
+0530
+++ linux-2.6.20-rc1-root/fs/aio.c  2007-01-03 11:45:40.0 +0530
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -193,6 +194,8 @@ static int aio_setup_ring(struct kioctx 
kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \
 } while(0)
 
+static struct lock_class_key ioctx_wait_queue_head_lock_key;
+
 /* ioctx_alloc
  * Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
@@ -224,6 +227,8 @@ static struct kioctx *ioctx_alloc(unsign
spin_lock_init(&ctx->ctx_lock);
spin_lock_init(&ctx->ring_info.ring_lock);
init_waitqueue_head(&ctx->wait);
+   /* Teach lockdep to recognize this lock as a different class */
+   lockdep_set_class(&ctx->wait.lock, &ioctx_wait_queue_head_lock_key);
 
INIT_LIST_HEAD(&ctx->active_reqs);
INIT_LIST_HEAD(&ctx->run_list);
@@ -1401,6 +1406,42 @@

Re: [RFC] Heads up on a series of AIO patchsets

2007-01-02 Thread Suparna Bhattacharya
On Tue, Jan 02, 2007 at 03:56:09PM -0800, Zach Brown wrote:
> Sorry for the delay, I'm finally back from the holiday break :)

Welcome back !

> 
> >(1) The filesystem AIO patchset, attempts to address one part of
> >the problem
> >which is to make regular file IO, (without O_DIRECT)
> >asynchronous (mainly
> >the case of reads of uncached or partially cached files, and
> >O_SYNC writes).
> 
> One of the properties of the currently implemented EIOCBRETRY aio
> path is that ->mm is the only field in current which matches the
> submitting task_struct while inside the retry path.

Yes and that as I guess you know is to enable the aio worker thread to
operate on the caller's address space for copy_from/to_user. 

The actual io setup and associated checks are expected to have been
handled at submission time.

> 
> It looks like a retry-based aio write path would be broken because of
> this.  generic_write_checks() could run in the aio thread and get its
> task_struct instead of that of the submitter.  The wrong rlimit will
> be tested and SIGXFSZ won't be raised.  remove_suid() could check the
> capabilities of the aio thread instead of those of the submitter.

generic_write_checks() are done in the submission path, not repeated during
retries, so such types of checks are not intended to run in the aio thread.

Did I miss something here ?

> 
> I don't think EIOCBRETRY is the way to go because of this increased
> (and subtle!) complexity.  What are the chances that we would have
> ever found those bugs outside code review?  How do we make sure that
> current references don't sneak back in after having initially audited
> the paths?

The EIOCBRETRY route is not something that is intended to be used blindly,
It is just one alternative to implement an aio operation by splitting up
responsibility between the submitter and aio threads, where aio threads 
can run in the caller's address space.

> 
> Take the io_cmd_epoll_wait patch..
> 
> >issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach
> >Brown with
> >modifications from Jeff Moyer and me) addresses this problem
> >for native
> >linux aio in a simple manner.
> 
> It's simple looking, sure.  This current flipping didn't even occur
> to me while throwing the patch together!
> 
> But that patch ends up calling ->poll (and poll_table->qproc) and
> writing to userspace (so potentially calling ->nopage) from the aio

Yes of course, but why is that a problem ?
The copy_from/to_user/put_user constructs are designed to handle soft failures,
and we are already using the caller's ->mm. Do you see a need for any
additional asserts() ?

If there is something that is needed by ->nopage etc which is not abstracted
out within the ->mm, then we would need to fix that instead, for correctness
anyway, isn't that so ?

Now it is possible that there are minor blocking points in the code and the
effect of these would be to hold up / delay subsequent queued aio operations;
which is an efficiency issue, but not a correctness concern.

> threads.  Are we sure that none of them will behave surprisingly
> because current changed under them?

My take is that we should fix the problems that we see. It is likely that
what manifests relatively more easily with AIO is also a subtle problem
in other cases.

> 
> It might be safe now, but that isn't really the point.  I'd rather we
> didn't have yet one more subtle invariant to audit and maintain.
> 
> At the risk of making myself vulnerable to the charge of mentioning
> vapourware, I will admit that I've been working on a (slightly mad)
> implementation of async syscalls.  I've been quiet about it because I
> don't want to whip up complicated discussion without being able to
> show code that works, even if barely.  I mention it now only to make
> it clear that I want to be constructive, not just critical :).

That is great and I look forward to it :) I am, however assuming that
whatever implementation you come up will have a different interface
from current linux aio -- i.e. a next generation aio model, that will be
easily integratable with kevents etc.

Which takes me back to Ingo's point - lets have the new evolve parallely
with the old, if we can, and not hold up the patches for POSIX AIO to
start using kernel AIO, or for epoll to integrate with AIO.

OK, I just took a quick look at your blog and I see that you
are basically implementing Linus' microthreads scheduling approach -
one year since we had that discussion. Glad to see that you found a way
to make it workable ... (I'm guessing that you are copying over the part
of the stack in use at the time of every switch, is that correct ? At what
point do you do the allo

Re: [FSAIO][PATCH 7/8] Filesystem AIO read

2006-12-28 Thread Suparna Bhattacharya
On Thu, Dec 28, 2006 at 11:57:47AM +, Christoph Hellwig wrote:
> > +   if (in_aio()) {
> > +   /* Avoid repeat readahead */
> > +   if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
> > +   next_index = last_index;
> > +   }
> 
> Every place we use kiocbTryRestart in this and the next patch it's in
> this from, so we should add a little helper for it:
> 
> int aio_try_restart(void)
> {
>   struct wait_queue_head_t *wq = current->io_wait;
> 
>   if (!is_sync_wait(wq) && kiocbTryRestart(io_wait_to_kiocb(wq)))
>   return 1;
>   return 0;
> }

Yes, we can do that -- how about aio_restarted() as an alternate name ?

> 
> with a big kerneldoc comment explaining this idiom (and possible a better
> name for the function ;-))
> 
> > +
> > +   if ((error = __lock_page(page, current->io_wait))) {
> > +   goto readpage_error;
> > +   }
> 
> This should  be
> 
>   error = __lock_page(page, current->io_wait);
>   if (error)
>   goto readpage_error;
> 
> Pluse possible naming updates discussed in the last mail.  Also do we
> really need to pass current->io_wait here?  Isn't the waitqueue in
> the kiocb always guaranteed to be the same?  Now that all pagecache

We don't have have the kiocb available to this routine. Using current->io_wait
avoids the need to pass the iocb down to deeper levels just for the sync vs
async checks, also allowing such routines to be shared by other code which
does not use iocbs (e.g. generic_file_sendfile->do_generic_file_read
->do_generic_mapping_read) without having to set up dummy iocbs.

Does that clarify ? We could abstract this away within a lock page wrapper,
but I don't know if that makes a difference.

> I/O goes through the ->aio_read/->aio_write routines I'd prefer to
> get rid of the task_struct field cludges and pass all this around in
> the kiocb.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page

2006-12-28 Thread Suparna Bhattacharya
On Thu, Dec 28, 2006 at 11:55:10AM +, Christoph Hellwig wrote:
> On Thu, Dec 28, 2006 at 02:11:49PM +0530, Suparna Bhattacharya wrote:
> > -extern void FASTCALL(lock_page_slow(struct page *page));
> > +extern int FASTCALL(__lock_page_slow(struct page *page, wait_queue_t 
> > *wait));
> >  extern void FASTCALL(__lock_page_nosync(struct page *page));
> >  extern void FASTCALL(unlock_page(struct page *page));
> >
> >  /*
> >   * lock_page may only be called if we have the page's inode pinned.
> >   */
> > -static inline void lock_page(struct page *page)
> > +static inline int __lock_page(struct page *page, wait_queue_t *wait)
> >  {
> > might_sleep();
> > if (TestSetPageLocked(page))
> > -   lock_page_slow(page);
> > +   return __lock_page_slow(page, wait);
> > +   return 0;
> >  }
> >
> > +#define lock_page(page)__lock_page(page, ¤t->__wait.wait)
> > +#define lock_page_slow(page)   __lock_page_slow(page, 
> > ¤t->__wait.wait)
> 
> Can we please simply kill your lock_page_slow wrapper and rename the
> arguments taking __lock_page_slow to lock_page_slow?  All too many
> variants of the locking functions aren't all that useful and there's
> very few users.

OK.

> 
> Similarly I don't really think __lock_page is an all that useful name here.
> What about lock_page_wq?  or aio_lock_page to denote it has special

I am really bad with names :(  I tried using the _wq suffixes earlier and
that seemed confusing to some, but if no one else objects I'm happy to use
that. I thought aio_lock_page() might be misleading because it is
synchronous if a regular wait queue entry is passed in, but again it may not
be too bad.

What's your preference ? Does anything more intuitive come to mind ?

> meaning in aio contect?  Then again because of these special sematics
> we need a bunch of really verbose kerneldoc comments for this function
> famility.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [FSAIO][PATCH 1/6] Add a wait queue parameter to the wait_bit action routine

2006-12-28 Thread Suparna Bhattacharya
Sorry this should have read [PATCH 1/8] instead of [PATCH 1/6]

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[FSAIO][PATCH 8/8] AIO O_SYNC filesystem write

2006-12-28 Thread Suparna Bhattacharya

AIO support for O_SYNC buffered writes, built over O_SYNC-speedup.
It uses the tagged radix tree lookups to writeout just the pages
pertaining to this request, and retries instead of blocking
for writeback to complete on the same range. All the writeout is 
issued at the time of io submission, and there is a check to make
sure that retries skip over straight to the wait_on_page_writeback_range.

Limitations: Extending file writes or hole overwrites with O_SYNC may
still block because we have yet to convert generic_osync_inode to be
asynchronous. For non O_SYNC writes, writeout happens in the background
and so typically appears async to the caller except for memory throttling
and non-block aligned writes involving read-modify-write.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 include/linux/aio.h  |0 
 linux-2.6.20-rc1-root/include/linux/fs.h |   13 +-
 linux-2.6.20-rc1-root/mm/filemap.c   |   61 +--
 3 files changed, 54 insertions(+), 20 deletions(-)

diff -puN include/linux/aio.h~aio-fs-write include/linux/aio.h
diff -puN mm/filemap.c~aio-fs-write mm/filemap.c
--- linux-2.6.20-rc1/mm/filemap.c~aio-fs-write  2006-12-21 08:46:21.0 
+0530
+++ linux-2.6.20-rc1-root/mm/filemap.c  2006-12-21 08:46:21.0 +0530
@@ -239,10 +239,11 @@ EXPORT_SYMBOL(filemap_flush);
  * @end:   ending page index
  *
  * Wait for writeback to complete against pages indexed by start->end
- * inclusive
+ * inclusive. In AIO context, this may queue an async notification
+ * and retry callback and return, instead of blocking the caller.
  */
-int wait_on_page_writeback_range(struct address_space *mapping,
-   pgoff_t start, pgoff_t end)
+int __wait_on_page_writeback_range(struct address_space *mapping,
+   pgoff_t start, pgoff_t end, wait_queue_t *wait)
 {
struct pagevec pvec;
int nr_pages;
@@ -254,20 +255,20 @@ int wait_on_page_writeback_range(struct 
 
pagevec_init(&pvec, 0);
index = start;
-   while ((index <= end) &&
+   while (!ret && (index <= end) &&
(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
PAGECACHE_TAG_WRITEBACK,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
unsigned i;
 
-   for (i = 0; i < nr_pages; i++) {
+   for (i = 0; !ret && (i < nr_pages); i++) {
struct page *page = pvec.pages[i];
 
/* until radix tree lookup accepts end_index */
if (page->index > end)
continue;
 
-   wait_on_page_writeback(page);
+   ret = __wait_on_page_writeback(page, wait);
if (PageError(page))
ret = -EIO;
}
@@ -303,18 +304,27 @@ int sync_page_range(struct inode *inode,
 {
pgoff_t start = pos >> PAGE_CACHE_SHIFT;
pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
-   int ret;
+   int ret = 0;
 
if (!mapping_cap_writeback_dirty(mapping) || !count)
return 0;
+   if (in_aio()) {
+   /* Already issued writeouts for this iocb ? */
+   if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
+   goto do_wait; /* just need to check if done */
+   }
ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
-   if (ret == 0) {
+
+   if (ret >= 0) {
mutex_lock(&inode->i_mutex);
ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
mutex_unlock(&inode->i_mutex);
}
-   if (ret == 0)
-   ret = wait_on_page_writeback_range(mapping, start, end);
+do_wait:
+   if (ret >= 0) {
+   ret = __wait_on_page_writeback_range(mapping, start, end,
+   current->io_wait);
+   }
return ret;
 }
 EXPORT_SYMBOL(sync_page_range);
@@ -335,15 +345,23 @@ int sync_page_range_nolock(struct inode 
 {
pgoff_t start = pos >> PAGE_CACHE_SHIFT;
pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
-   int ret;
+   int ret = 0;
 
if (!mapping_cap_writeback_dirty(mapping) || !count)
return 0;
+   if (in_aio()) {
+   /* Already issued writeouts for this iocb ? */
+   if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
+   goto do_wait; /* just need to check if done */
+   }
ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
-   if (ret == 0)
+   if (ret >= 0)
ret = generic_osync_inode(inode, mapping,

[FSAIO][PATCH 7/8] Filesystem AIO read

2006-12-28 Thread Suparna Bhattacharya

Converts the wait for page to become uptodate (lock page)
after readahead/readpage (in do_generic_mapping_read) to a retry
exit, to make buffered filesystem AIO reads actually synchronous.

The patch avoids exclusive wakeups with AIO, a problem originally
spotted by Chris Mason, though the reasoning for why it is an
issue is now much clearer (see explanation in the comment below
in aio.c), and the solution is perhaps slightly simpler.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 linux-2.6.20-rc1-root/fs/aio.c|   11 ++-
 linux-2.6.20-rc1-root/include/linux/aio.h |5 +
 linux-2.6.20-rc1-root/mm/filemap.c|   19 ---
 3 files changed, 31 insertions(+), 4 deletions(-)

diff -puN fs/aio.c~aio-fs-read fs/aio.c
--- linux-2.6.20-rc1/fs/aio.c~aio-fs-read   2006-12-21 08:46:13.0 
+0530
+++ linux-2.6.20-rc1-root/fs/aio.c  2006-12-28 09:26:30.0 +0530
@@ -1529,7 +1529,16 @@ static int aio_wake_function(wait_queue_
 
list_del_init(&wait->task_list);
kick_iocb(iocb);
-   return 1;
+   /*
+* Avoid exclusive wakeups with retries since an exclusive wakeup
+* may involve implicit expectations of waking up the next waiter
+* and there is no guarantee that the retry will take a path that
+* would do so. For example if a page has become up-to-date, then
+* a retried read may end up straightaway performing a copyout
+* and not go through a lock_page - unlock_page that would have
+* passed the baton to the next waiter.
+*/
+   return 0;
 }
 
 int fastcall io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
diff -puN mm/filemap.c~aio-fs-read mm/filemap.c
--- linux-2.6.20-rc1/mm/filemap.c~aio-fs-read   2006-12-21 08:46:13.0 
+0530
+++ linux-2.6.20-rc1-root/mm/filemap.c  2006-12-28 09:31:48.0 +0530
@@ -909,6 +909,11 @@ void do_generic_mapping_read(struct addr
if (!isize)
goto out;
 
+   if (in_aio()) {
+   /* Avoid repeat readahead */
+   if (kiocbTryRestart(io_wait_to_kiocb(current->io_wait)))
+   next_index = last_index;
+   }
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
for (;;) {
struct page *page;
@@ -978,7 +983,10 @@ page_ok:
 
 page_not_up_to_date:
/* Get exclusive access to the page ... */
-   lock_page(page);
+
+   if ((error = __lock_page(page, current->io_wait))) {
+   goto readpage_error;
+   }
 
/* Did it get truncated before we got the lock? */
if (!page->mapping) {
@@ -1006,7 +1014,8 @@ readpage:
}
 
if (!PageUptodate(page)) {
-   lock_page(page);
+   if ((error = __lock_page(page, current->io_wait)))
+   goto readpage_error;
if (!PageUptodate(page)) {
if (page->mapping == NULL) {
/*
@@ -1052,7 +1061,11 @@ readpage:
goto page_ok;
 
 readpage_error:
-   /* UHHUH! A synchronous read error occurred. Report it */
+   /* We don't have uptodate data in the page yet */
+   /* Could be due to an error or because we need to
+* retry when we get an async i/o notification.
+* Report the reason.
+*/
desc->error = error;
page_cache_release(page);
goto out;
diff -puN include/linux/aio.h~aio-fs-read include/linux/aio.h
--- linux-2.6.20-rc1/include/linux/aio.h~aio-fs-read2006-12-21 
08:46:13.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/aio.h   2006-12-28 09:26:27.0 
+0530
@@ -34,21 +34,26 @@ struct kioctx;
 /* #define KIF_LOCKED  0 */
 #define KIF_KICKED 1
 #define KIF_CANCELLED  2
+#define KIF_RESTARTED  3
 
 #define kiocbTryLock(iocb) test_and_set_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbTryKick(iocb) test_and_set_bit(KIF_KICKED, &(iocb)->ki_flags)
+#define kiocbTryRestart(iocb)  test_and_set_bit(KIF_RESTARTED, 
&(iocb)->ki_flags)
 
 #define kiocbSetLocked(iocb)   set_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbSetKicked(iocb)   set_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbSetCancelled(iocb)set_bit(KIF_CANCELLED, 
&(iocb)->ki_flags)
+#define kiocbSetRestarted(iocb)set_bit(KIF_RESTARTED, 
&(iocb)->ki_flags)
 
 #define kiocbClearLocked(iocb) clear_bit(KIF_LOCKED, &(iocb)->ki_flags)
 #define kiocbClearKicked(iocb) clear_bit(KIF_KICKED, &(iocb)->ki_flags)
 #define kiocbClearCancelled(iocb)  clear_bit(KIF_CANCELLED, 
&am

[FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page

2006-12-28 Thread Suparna Bhattacharya

Define low-level page wait and lock page routines which take a
wait queue entry pointer as an additional parameter and
return status (which may be non-zero when the wait queue
parameter signifies an asynchronous wait, typically during
AIO).

Synchronous IO waits become a special case where the wait
queue parameter is the running task's default io wait context.
Asynchronous IO waits happen when the wait queue parameter
is the io wait context of a kiocb. Code paths which choose to
execute synchronous or asynchronous behaviour depending on the
called context specify the current io wait context (which points
to sync or async context as the case may be) as the wait
parameter. 

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 linux-2.6.20-rc1-root/include/linux/pagemap.h |   30 ++---
 linux-2.6.20-rc1-root/include/linux/sched.h   |1 
 linux-2.6.20-rc1-root/kernel/sched.c  |   14 +++
 linux-2.6.20-rc1-root/mm/filemap.c|   31 +++---
 4 files changed, 55 insertions(+), 21 deletions(-)

diff -puN include/linux/pagemap.h~aio-wait-page include/linux/pagemap.h
--- linux-2.6.20-rc1/include/linux/pagemap.h~aio-wait-page  2006-12-21 
08:46:02.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/pagemap.h   2006-12-21 
08:46:02.0 +0530
@@ -133,20 +133,24 @@ static inline pgoff_t linear_page_index(
return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 }
 
-extern void FASTCALL(lock_page_slow(struct page *page));
+extern int FASTCALL(__lock_page_slow(struct page *page, wait_queue_t *wait));
 extern void FASTCALL(__lock_page_nosync(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
 
 /*
  * lock_page may only be called if we have the page's inode pinned.
  */
-static inline void lock_page(struct page *page)
+static inline int __lock_page(struct page *page, wait_queue_t *wait)
 {
might_sleep();
if (TestSetPageLocked(page))
-   lock_page_slow(page);
+   return __lock_page_slow(page, wait);
+   return 0;
 }
 
+#define lock_page(page)__lock_page(page, ¤t->__wait.wait)
+#define lock_page_slow(page)   __lock_page_slow(page, ¤t->__wait.wait)
+
 /*
  * lock_page_nosync should only be used if we can't pin the page's inode.
  * Doesn't play quite so well with block device plugging.
@@ -162,7 +166,8 @@ static inline void lock_page_nosync(stru
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!
  */
-extern void FASTCALL(wait_on_page_bit(struct page *page, int bit_nr));
+extern int FASTCALL(wait_on_page_bit(struct page *page, int bit_nr,
+   wait_queue_t *wait));
 
 /* 
  * Wait for a page to be unlocked.
@@ -171,21 +176,30 @@ extern void FASTCALL(wait_on_page_bit(st
  * ie with increased "page->count" so that the page won't
  * go away during the wait..
  */
-static inline void wait_on_page_locked(struct page *page)
+static inline int __wait_on_page_locked(struct page *page, wait_queue_t *wait)
 {
if (PageLocked(page))
-   wait_on_page_bit(page, PG_locked);
+   return wait_on_page_bit(page, PG_locked, wait);
+   return 0;
 }
 
+#define wait_on_page_locked(page) \
+   __wait_on_page_locked(page, ¤t->__wait.wait)
+
 /* 
  * Wait for a page to complete writeback
  */
-static inline void wait_on_page_writeback(struct page *page)
+static inline int __wait_on_page_writeback(struct page *page,
+   wait_queue_t *wait)
 {
if (PageWriteback(page))
-   wait_on_page_bit(page, PG_writeback);
+   return wait_on_page_bit(page, PG_writeback, wait);
+   return 0;
 }
 
+#define wait_on_page_writeback(page) \
+   __wait_on_page_writeback(page, ¤t->__wait.wait)
+
 extern void end_page_writeback(struct page *page);
 
 /*
diff -puN include/linux/sched.h~aio-wait-page include/linux/sched.h
--- linux-2.6.20-rc1/include/linux/sched.h~aio-wait-page2006-12-21 
08:46:02.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/sched.h 2006-12-28 09:26:27.0 
+0530
@@ -216,6 +216,7 @@ extern void show_stack(struct task_struc
 
 void io_schedule(void);
 long io_schedule_timeout(long timeout);
+int io_wait_schedule(wait_queue_t *wait);
 
 extern void cpu_init (void);
 extern void trap_init(void);
diff -puN kernel/sched.c~aio-wait-page kernel/sched.c
--- linux-2.6.20-rc1/kernel/sched.c~aio-wait-page   2006-12-21 
08:46:02.0 +0530
+++ linux-2.6.20-rc1-root/kernel/sched.c2006-12-21 08:46:02.0 
+0530
@@ -4743,6 +4743,20 @@ long __sched io_schedule_timeout(long ti
return ret;
 }
 
+/*
+ * Sleep only if the wait context passed is not async,
+ * otherwise return so that a retry can be issued later.
+ */
+int __sch

[FSAIO][PATCH 5/8] Enable wait bit based filtered wakeups to work for AIO

2006-12-28 Thread Suparna Bhattacharya

Enable wait bit based filtered wakeups to work for AIO.
Replaces the wait queue entry in the kiocb with a wait bit
structure, to allow enough space for the wait bit key.
This adds an extra level of indirection in references to the 
wait queue entry in the iocb. Also, an extra check had to be
added in aio_wake_function to allow for other kinds of waiters 
which do not require wait bit, based on the assumption that
the key passed in would be NULL in such cases.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

---

 linux-2.6.20-rc1-root/fs/aio.c|   21 ++---
 linux-2.6.20-rc1-root/include/linux/aio.h |7 ---
 linux-2.6.20-rc1-root/kernel/wait.c   |   17 ++---
 3 files changed, 32 insertions(+), 13 deletions(-)

diff -puN fs/aio.c~aio-wait-bit fs/aio.c
--- linux-2.6.20-rc1/fs/aio.c~aio-wait-bit  2006-12-21 08:45:57.0 
+0530
+++ linux-2.6.20-rc1-root/fs/aio.c  2006-12-28 09:32:27.0 +0530
@@ -719,13 +719,13 @@ static ssize_t aio_run_iocb(struct kiocb
 * cause the iocb to be kicked for continuation (through
 * the aio_wake_function callback).
 */
-   BUG_ON(current->io_wait != NULL);
-   current->io_wait = &iocb->ki_wait;
+   BUG_ON(!is_sync_wait(current->io_wait));
+   current->io_wait = &iocb->ki_wait.wait;
ret = retry(iocb);
current->io_wait = NULL;
 
if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) {
-   BUG_ON(!list_empty(&iocb->ki_wait.task_list));
+   BUG_ON(!list_empty(&iocb->ki_wait.wait.task_list));
aio_complete(iocb, ret, 0);
}
 out:
@@ -883,7 +883,7 @@ static void try_queue_kicked_iocb(struct
 * than retry has happened before we could queue the iocb.  This also
 * means that the retry could have completed and freed our iocb, no
 * good. */
-   BUG_ON((!list_empty(&iocb->ki_wait.task_list)));
+   BUG_ON((!list_empty(&iocb->ki_wait.wait.task_list)));
 
spin_lock_irqsave(&ctx->ctx_lock, flags);
/* set this inside the lock so that we can't race with aio_run_iocb()
@@ -1519,7 +1519,13 @@ static ssize_t aio_setup_iocb(struct kio
 static int aio_wake_function(wait_queue_t *wait, unsigned mode,
 int sync, void *key)
 {
-   struct kiocb *iocb = container_of(wait, struct kiocb, ki_wait);
+   struct wait_bit_queue *wait_bit
+   = container_of(wait, struct wait_bit_queue, wait);
+   struct kiocb *iocb = container_of(wait_bit, struct kiocb, ki_wait);
+
+   /* Assumes that a non-NULL key implies wait bit filtering */
+   if (key && !test_wait_bit_key(wait, key))
+   return 0;
 
list_del_init(&wait->task_list);
kick_iocb(iocb);
@@ -1574,8 +1580,9 @@ int fastcall io_submit_one(struct kioctx
req->ki_buf = (char __user *)(unsigned long)iocb->aio_buf;
req->ki_left = req->ki_nbytes = iocb->aio_nbytes;
req->ki_opcode = iocb->aio_lio_opcode;
-   init_waitqueue_func_entry(&req->ki_wait, aio_wake_function);
-   INIT_LIST_HEAD(&req->ki_wait.task_list);
+   init_waitqueue_func_entry(&req->ki_wait.wait, aio_wake_function);
+   INIT_LIST_HEAD(&req->ki_wait.wait.task_list);
+   req->ki_run_list.next = req->ki_run_list.prev = NULL;
 
ret = aio_setup_iocb(req);
 
diff -puN include/linux/aio.h~aio-wait-bit include/linux/aio.h
--- linux-2.6.20-rc1/include/linux/aio.h~aio-wait-bit   2006-12-21 
08:45:57.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/aio.h   2006-12-28 09:32:27.0 
+0530
@@ -102,7 +102,7 @@ struct kiocb {
} ki_obj;
 
__u64   ki_user_data;   /* user's data for completion */
-   wait_queue_tki_wait;
+   struct wait_bit_queue   ki_wait;
loff_t  ki_pos;
 
atomic_tki_bio_count;   /* num bio used for this iocb */
@@ -135,7 +135,7 @@ struct kiocb {
(x)->ki_dtor = NULL;\
(x)->ki_obj.tsk = tsk;  \
(x)->ki_user_data = 0;  \
-   init_wait((&(x)->ki_wait)); \
+   init_wait_bit_task((&(x)->ki_wait), current);\
} while (0)
 
 #define AIO_RING_MAGIC 0xa10a10a1
@@ -237,7 +237,8 @@ do {
\
}   \
 } while (0)
 
-#define io_wait_to_kiocb(wait) container_of(wait, struct kiocb, ki_wait)
+#define io_wait_to_kiocb(io_wait) container_of(container_of(io_wait,   \
+   struct wait_bit_queue, wait), struct kiocb, ki_wait)
 
 #include

[FSAIO][PATCH 4/8] Add a default io wait bit field in task struct

2006-12-28 Thread Suparna Bhattacharya

Allocates space for the default io wait queue entry (actually a wait
bit entry) in the task struct. Doing so simplifies the patches 
for AIO wait page allowing for cleaner and more efficient 
implementation, at the cost of 28 additional bytes in task struct
vs allocation on demand on-stack.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 linux-2.6.20-rc1-root/include/linux/sched.h |   11 +++
 linux-2.6.20-rc1-root/kernel/fork.c |3 ++-
 2 files changed, 9 insertions(+), 5 deletions(-)

diff -puN include/linux/sched.h~tsk-default-io-wait include/linux/sched.h
--- linux-2.6.20-rc1/include/linux/sched.h~tsk-default-io-wait  2006-12-21 
08:45:51.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/sched.h 2006-12-28 09:32:39.0 
+0530
@@ -1006,11 +1006,14 @@ struct task_struct {
 
unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use.  */
+
+/* Space for default IO wait bit entry used for synchronous IO waits */
+   struct wait_bit_queue __wait;
 /*
- * current io wait handle: wait queue entry to use for io waits
- * If this thread is processing aio, this points at the waitqueue
- * inside the currently handled kiocb. It may be NULL (i.e. default
- * to a stack based synchronous wait) if its doing sync IO.
+ * Current IO wait handle: wait queue entry to use for IO waits
+ * If this thread is processing AIO, this points at the waitqueue
+ * inside the currently handled kiocb. Otherwise, points to the
+ * default IO wait field (i.e &__wait.wait above).
  */
wait_queue_t *io_wait;
 /* i/o counters(bytes read/written, #syscalls */
diff -puN kernel/fork.c~tsk-default-io-wait kernel/fork.c
--- linux-2.6.20-rc1/kernel/fork.c~tsk-default-io-wait  2006-12-21 
08:45:51.0 +0530
+++ linux-2.6.20-rc1-root/kernel/fork.c 2006-12-21 08:45:51.0 +0530
@@ -1056,7 +1056,8 @@ static struct task_struct *copy_process(
do_posix_clock_monotonic_gettime(&p->start_time);
p->security = NULL;
p->io_context = NULL;
-   p->io_wait = NULL;
+   init_wait_bit_task(&p->__wait, p);
+   p->io_wait = &p->__wait.wait;
p->audit_context = NULL;
    cpuset_fork(p);
 #ifdef CONFIG_NUMA
_
-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[FSAIO][PATCH 3/8] Routines to initialize and test a wait bit key

2006-12-28 Thread Suparna Bhattacharya

init_wait_bit_key() initializes the key field in an already 
allocated wait bit structure, useful for async wait bit support.
Also separate out the wait bit test to a common routine which
can be used by different kinds of wakeup callbacks.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 linux-2.6.20-rc1-root/include/linux/wait.h |   24 
 linux-2.6.20-rc1-root/kernel/wait.c|   11 ++-
 2 files changed, 26 insertions(+), 9 deletions(-)

diff -puN include/linux/wait.h~init-wait-bit-key include/linux/wait.h
--- linux-2.6.20-rc1/include/linux/wait.h~init-wait-bit-key 2006-12-21 
08:45:46.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/wait.h  2006-12-21 08:45:46.0 
+0530
@@ -108,6 +108,17 @@ static inline int waitqueue_active(wait_
return !list_empty(&q->task_list);
 }
 
+static inline int test_wait_bit_key(wait_queue_t *wait,
+   struct wait_bit_key *key)
+{
+   struct wait_bit_queue *wait_bit
+   = container_of(wait, struct wait_bit_queue, wait);
+
+   return (wait_bit->key.flags == key->flags &&
+   wait_bit->key.bit_nr == key->bit_nr &&
+   !test_bit(key->bit_nr, key->flags));
+}
+
 /*
  * Used to distinguish between sync and async io wait context:
  * sync i/o typically specifies a NULL wait queue entry or a wait
@@ -416,6 +427,19 @@ int wake_bit_function(wait_queue_t *wait
INIT_LIST_HEAD(&(wait)->task_list); \
} while (0)
 
+#define init_wait_bit_key(waitbit, word, bit)  \
+   do {\
+   (waitbit)->key.flags = word;\
+   (waitbit)->key.bit_nr = bit;\
+   } while (0)
+
+#define init_wait_bit_task(waitbit, tsk)   \
+   do {\
+   (waitbit)->wait.private = tsk;  \
+   (waitbit)->wait.func = wake_bit_function;   \
+   INIT_LIST_HEAD(&(waitbit)->wait.task_list); \
+   } while (0)
+
 /**
  * wait_on_bit - wait for a bit to be cleared
  * @word: the word being waited on, a kernel virtual address
diff -puN kernel/wait.c~init-wait-bit-key kernel/wait.c
--- linux-2.6.20-rc1/kernel/wait.c~init-wait-bit-key2006-12-21 
08:45:46.0 +0530
+++ linux-2.6.20-rc1-root/kernel/wait.c 2006-12-28 09:32:44.0 +0530
@@ -139,16 +139,9 @@ EXPORT_SYMBOL(autoremove_wake_function);
 
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
 {
-   struct wait_bit_key *key = arg;
-   struct wait_bit_queue *wait_bit
-   = container_of(wait, struct wait_bit_queue, wait);
-
-   if (wait_bit->key.flags != key->flags ||
-   wait_bit->key.bit_nr != key->bit_nr ||
-   test_bit(key->bit_nr, key->flags))
+   if (!test_wait_bit_key(wait, arg))
return 0;
-   else
-   return autoremove_wake_function(wait, mode, sync, key);
+   return autoremove_wake_function(wait, mode, sync, arg);
 }
 EXPORT_SYMBOL(wake_bit_function);
 
_
-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[FSAIO][PATCH 2/8] Rename __lock_page to lock_page_slow

2006-12-28 Thread Suparna Bhattacharya

In order to allow for interruptible and asynchronous versions of
lock_page in conjunction with the wait_on_bit changes, we need to
define low-level lock page routines which take an additional
argument, i.e a wait queue entry and may return non-zero status,
e.g -EINTR, -EIOCBRETRY, -EWOULDBLOCK etc. This patch renames 
__lock_page to lock_page_slow, so that __lock_page and 
__lock_page_slow can denote the versions which take a wait queue 
parameter.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 linux-2.6.20-rc1-root/include/linux/pagemap.h |4 ++--
 linux-2.6.20-rc1-root/mm/filemap.c|8 
 2 files changed, 6 insertions(+), 6 deletions(-)

diff -puN include/linux/pagemap.h~lock_page_slow include/linux/pagemap.h
--- linux-2.6.20-rc1/include/linux/pagemap.h~lock_page_slow 2006-12-21 
08:45:40.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/pagemap.h   2006-12-28 
09:32:39.0 +0530
@@ -133,7 +133,7 @@ static inline pgoff_t linear_page_index(
return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 }
 
-extern void FASTCALL(__lock_page(struct page *page));
+extern void FASTCALL(lock_page_slow(struct page *page));
 extern void FASTCALL(__lock_page_nosync(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
 
@@ -144,7 +144,7 @@ static inline void lock_page(struct page
 {
might_sleep();
if (TestSetPageLocked(page))
-   __lock_page(page);
+   lock_page_slow(page);
 }
 
 /*
diff -puN mm/filemap.c~lock_page_slow mm/filemap.c
--- linux-2.6.20-rc1/mm/filemap.c~lock_page_slow2006-12-21 
08:45:40.0 +0530
+++ linux-2.6.20-rc1-root/mm/filemap.c  2006-12-28 09:32:39.0 +0530
@@ -556,7 +556,7 @@ void end_page_writeback(struct page *pag
 EXPORT_SYMBOL(end_page_writeback);
 
 /**
- * __lock_page - get a lock on the page, assuming we need to sleep to get it
+ * lock_page_slow - get a lock on the page, assuming we need to sleep to get it
  * @page: the page to lock
  *
  * Ugly. Running sync_page() in state TASK_UNINTERRUPTIBLE is scary.  If some
@@ -564,14 +564,14 @@ EXPORT_SYMBOL(end_page_writeback);
  * chances are that on the second loop, the block layer's plug list is empty,
  * so sync_page() will then return in state TASK_UNINTERRUPTIBLE.
  */
-void fastcall __lock_page(struct page *page)
+void fastcall lock_page_slow(struct page *page)
 {
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
TASK_UNINTERRUPTIBLE);
 }
-EXPORT_SYMBOL(__lock_page);
+EXPORT_SYMBOL(lock_page_slow);
 
 /*
  * Variant of lock_page that does not require the caller to hold a reference
@@ -647,7 +647,7 @@ repeat:
page_cache_get(page);
if (TestSetPageLocked(page)) {
read_unlock_irq(&mapping->tree_lock);
-   __lock_page(page);
+   lock_page_slow(page);
read_lock_irq(&mapping->tree_lock);
 
/* Has the page been truncated while we slept? */
_
-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[FSAIO][PATCH 1/6] Add a wait queue parameter to the wait_bit action routine

2006-12-28 Thread Suparna Bhattacharya

Add a wait queue parameter to the action routine called by 
__wait_on_bit to allow it to determine whether to block or
not.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---

 linux-2.6.20-rc1-root/fs/buffer.c  |2 +-
 linux-2.6.20-rc1-root/fs/inode.c   |2 +-
 linux-2.6.20-rc1-root/fs/nfs/inode.c   |2 +-
 linux-2.6.20-rc1-root/fs/nfs/nfs4proc.c|2 +-
 linux-2.6.20-rc1-root/fs/nfs/pagelist.c|2 +-
 linux-2.6.20-rc1-root/include/linux/sunrpc/sched.h |3 ++-
 linux-2.6.20-rc1-root/include/linux/wait.h |   18 --
 linux-2.6.20-rc1-root/include/linux/writeback.h|2 +-
 linux-2.6.20-rc1-root/kernel/wait.c|   14 --
 linux-2.6.20-rc1-root/mm/filemap.c |2 +-
 linux-2.6.20-rc1-root/net/sunrpc/sched.c   |5 +++--
 11 files changed, 32 insertions(+), 22 deletions(-)

diff -puN fs/buffer.c~modify-wait-bit-action-args fs/buffer.c
--- linux-2.6.20-rc1/fs/buffer.c~modify-wait-bit-action-args2006-12-21 
08:45:34.0 +0530
+++ linux-2.6.20-rc1-root/fs/buffer.c   2006-12-21 08:45:34.0 +0530
@@ -55,7 +55,7 @@ init_buffer(struct buffer_head *bh, bh_e
bh->b_private = private;
 }
 
-static int sync_buffer(void *word)
+static int sync_buffer(void *word, wait_queue_t *wait)
 {
struct block_device *bd;
struct buffer_head *bh
diff -puN fs/inode.c~modify-wait-bit-action-args fs/inode.c
--- linux-2.6.20-rc1/fs/inode.c~modify-wait-bit-action-args 2006-12-21 
08:45:34.0 +0530
+++ linux-2.6.20-rc1-root/fs/inode.c2006-12-21 08:45:34.0 +0530
@@ -1279,7 +1279,7 @@ void remove_dquot_ref(struct super_block
 
 #endif
 
-int inode_wait(void *word)
+int inode_wait(void *word, wait_queue_t *wait)
 {
schedule();
return 0;
diff -puN include/linux/wait.h~modify-wait-bit-action-args include/linux/wait.h
--- linux-2.6.20-rc1/include/linux/wait.h~modify-wait-bit-action-args   
2006-12-21 08:45:34.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/wait.h  2006-12-28 09:32:57.0 
+0530
@@ -145,11 +145,15 @@ void FASTCALL(__wake_up(wait_queue_head_
 extern void FASTCALL(__wake_up_locked(wait_queue_head_t *q, unsigned int 
mode));
 extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, 
int nr));
 void FASTCALL(__wake_up_bit(wait_queue_head_t *, void *, int));
-int FASTCALL(__wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int 
(*)(void *), unsigned));
-int FASTCALL(__wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, 
int (*)(void *), unsigned));
+int FASTCALL(__wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *,
+   int (*)(void *, wait_queue_t *), unsigned));
+int FASTCALL(__wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *,
+   int (*)(void *, wait_queue_t *), unsigned));
 void FASTCALL(wake_up_bit(void *, int));
-int FASTCALL(out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned));
-int FASTCALL(out_of_line_wait_on_bit_lock(void *, int, int (*)(void *), 
unsigned));
+int FASTCALL(out_of_line_wait_on_bit(void *, int, int (*)(void *,
+   wait_queue_t *), unsigned));
+int FASTCALL(out_of_line_wait_on_bit_lock(void *, int, int (*)(void *,
+   wait_queue_t *), unsigned));
 wait_queue_head_t *FASTCALL(bit_waitqueue(void *, int));
 
 #define wake_up(x) __wake_up(x, TASK_UNINTERRUPTIBLE | 
TASK_INTERRUPTIBLE, 1, NULL)
@@ -427,7 +431,8 @@ int wake_bit_function(wait_queue_t *wait
  * but has no intention of setting it.
  */
 static inline int wait_on_bit(void *word, int bit,
-   int (*action)(void *), unsigned mode)
+   int (*action)(void *, wait_queue_t *),
+   unsigned mode)
 {
if (!test_bit(bit, word))
return 0;
@@ -451,7 +456,8 @@ static inline int wait_on_bit(void *word
  * clear with the intention of setting it, and when done, clearing it.
  */
 static inline int wait_on_bit_lock(void *word, int bit,
-   int (*action)(void *), unsigned mode)
+   int (*action)(void *, wait_queue_t *),
+   unsigned mode)
 {
if (!test_and_set_bit(bit, word))
return 0;
diff -puN include/linux/writeback.h~modify-wait-bit-action-args 
include/linux/writeback.h
--- linux-2.6.20-rc1/include/linux/writeback.h~modify-wait-bit-action-args  
2006-12-21 08:45:34.0 +0530
+++ linux-2.6.20-rc1-root/include/linux/writeback.h 2006-12-21 
08:45:34.0 +0530
@@ -66,7 +66,7 @@ struct writeback_control {
  */
 void writeback_inodes(struct writeback_control *wbc);
 void wake_up_inode(struct inode *inode);
-int inode_wait(void *);
+int inode_wait(void *, wait_queue_t *);
 void sync_inodes_sb(struct super_block *, i

[PATCHSET 1][PATCH 0/6] Filesystem AIO read/write

2006-12-28 Thread Suparna Bhattacharya

Currently native linux AIO is properly supported (in the sense of
actually being asynchronous) only for files opened with O_DIRECT.
While this suffices for a major (and most visible) user of AIO, i.e. databases,
other types of users like Samba require AIO support for regular file IO.
Also, for glibc POSIX AIO to be able to switch to using native AIO instead
of the current simulation using threads, it needs/expects asynchronous
behaviour for both O_DIRECT and buffered file AIO.

This patchset implements changes to make filesystem AIO read
and write asynchronous for the non O_DIRECT case. This is mainly
relevant in the case of reads of uncached or partially cached files, and
O_SYNC writes. 

Instead of translating regular IO to [AIO + wait], it translates AIO
to [regular IO - blocking + retries]. The intent of implementing it
this way is to avoid modifying or slowing down normal usage, by keeping
it pretty much the way it is without AIO, while avoiding code duplication.
Instead we make AIO vs regular IO checks inside io_schedule(), i.e. at
the blocking points. The low-level unit of distinction is a wait queue
entry, which in the AIO case is contained in an iocb and in the
synchronous IO case is associated with the calling task.

The core idea is that is we complete as much IO as we can in a non-blocking
fashion, and then continue the remaining part of the transfer again when
woken up asynchronously via a wait queue callback when pages are ready ... 
thus each iteration progresses through more of the request until it is
completed. The interesting part here is that owing largely to the idempotence
in the way radix-tree page cache traveral happens, every iteration is simply
a smaller read/write. Almost all of the iocb manipulation and advancement
in the AIO case happens in the high level AIO code, and rather than in
regular VFS/filesystem paths.

The following is a sampling of comparative aio-stress results with the
patches (each run starts with uncached files):

-

aio-stress throughput comparisons (in MB/s):

file size 1GB, record size 64KB, depth 64, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB
4 way Pentium III SMP box, Adaptec AIC-7896/7 Ultra2 SCSI, 40 MB/s
Filesystem: ext2


Buffered (non O_DIRECT)
Vanilla Patched O_DIRECT

   Vanilla Patched
Random-Read 10.08   23.91   18.91,   18.98
Random-O_SYNC-Write  8.86   15.84   16.51,   16.53
Sequential-Read 31.49   33.00   31.86,   31.79
Sequential-O_SYNC-Write  8.68   32.60   31.45,   32.44
Random-Write31.09 (19.65)   30.90 (19.65)   
Sequential-Write30.84 (28.94)   30.09 (28.39)



Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Heads up on a series of AIO patchsets

2006-12-27 Thread Suparna Bhattacharya

Here is a quick attempt to summarize where we are heading with a bunch of
AIO patches that I'll be posting over the next few days. Because a few of
these patches have been hanging around for a bit, and have gone through
bursts of iterations from time to time, falling dormant for other phases,
the intent of this note is to help pull things together into some coherent
picture for folks to comment on the patches and arrive at a decision of
some sort.

Native linux aio (i.e using libaio) is properly supported (in the sense of
being asynchronous) only for files opened with O_DIRECT, which actually
suffices for a major (and most visible) user of AIO, i.e. databases.

However, for other types of users, e.g. Samba and other applications which
use POSIX AIO, there have been several issues outstanding for a while:

(1) The filesystem AIO patchset, attempts to address one part of the problem
which is to make regular file IO, (without O_DIRECT) asynchronous (mainly
the case of reads of uncached or partially cached files, and O_SYNC writes).

(2) Most of these other applications need the ability to process both
network events (epoll) and disk file AIO in the same loop. With POSIX AIO
they could at least sort of do this using signals (yeah, and all associated
issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with
modifications from Jeff Moyer and me) addresses this problem for native
linux aio in a simple manner. Tridge has written a test harness to 
try out the Samba4 event library modifications to use this. Jeff Moyer
has a modified version of pipetest for comparison.

(3) For glibc POSIX AIO to switch to using native AIO (instead of simulation
with threads) kernel changes are needed to ensure aio sigevent notification
and efficient listio support. Sebestian Dugue's patches for aio sigevent
notifications has undergone several review iterations and seems to be
in good shape now. His patch for lio_listio is pending discussion
on whether to implement it as a separate syscall rather than an additional
iocb command. Bharata B Rao has posted a patch with the syscall variation
for review.

(4) If glibc POSIX AIO switches completely to using native AIO then it
would need basic AIO support for various file types - including sockets,
pipes etc. Since it no longer will be simulating asynchronous behaviour
with threads, it expects the underlying implementation to be asynchronous.
Which is still an issue with native linux AIO, but I now think the problem
to be tractable without a lot of additional work. While (1) helps the case
for regular files, (2) now provides us an alternative infrastructure to
simulate this in kernel using async epoll and O_NONBLOCK for all pollable
fds, i.e. sockets, pipes etc. This should be good enough for working
POSIX AIO.

(5) That leaves just one more todo - implementing aio_fsync() in kernel.

Please note that all of this work is not in conflict with kevent development.
In fact it is my hope that progress made in getting these pieces of the
puzzle in place would also help us along the long term goal of eventual
convergence.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GFS, what's remaining

2005-09-06 Thread Suparna Bhattacharya
On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
> Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> > 
> > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > > >   possibly gain (or vice versa)
> > > > 
> > > > - Relative merits of the two offerings
> > > 
> > > You missed the important one - people actively use it and have been for
> > > some years. Same reason with have NTFS, HPFS, and all the others. On
> > > that alone it makes sense to include.
> >  
> > Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
> > the technical reasons for merging gfs[2], ocfs2, both or neither?
> 
> There seems to be clearly a need for a shared-storage fs of some sort
> for HA clusters and virtualized usage (multiple guests sharing a
> partition).  Shared storage can be more efficient than network file
> systems like NFS because the storage access is often more efficient
> than network access  and it is more reliable because it doesn't have a
> single point of failure in form of the NFS server.
> 
> It's also a logical extension of the "failover on failure" clusters
> many people run now - instead of only failing over the shared fs at
> failure and keeping one machine idle the load can be balanced between
> multiple machines at any time.
> 
> One argument to merge both might be that nobody really knows yet which
> shared-storage file system (GFS or OCFS2) is better. The only way to
> find out would be to let the user base try out both, and that's most
> practical when they're merged.
> 
> Personally I think ocfs2 has nicer&cleaner code than GFS.
> It seems to be more or less a 64bit ext3 with cluster support, while

The "more or less" is what bothers me here - the first time I heard this,
it sounded a little misleading, as I expected to find some kind of a
patch to ext3 to make it 64 bit with extents and cluster support.
Now I understand it a little better (thanks to Joel and Mark)

And herein lies the issue where I tend to agree with Andrew on
-- its really nice to have multiple filesystems innovating freely in
their niches and eventually proving themselves in practice, without
being bogged down by legacy etc. But at the same time, is there enough
thought and discussion about where the fragmentation/diversification is really
warranted, vs improving what is already there, or say incorporating
the best of one into another, maybe over a period of time ?

The number of filesystems seems to just keep growing, and supporting
all of them isn't easy -- for users it isn't really easy to switch from
one to another, and the justifications for choosing between them is
sometimes confusing and burdensome from an administrator standpoint
- one filesystem is good in certain conditions, another in others,
stability levels may vary etc, and its not always possible to predict
which aspect to prioritize.

Now, with filesystems that have been around in production for a long
time, the on-disk format becomes a major constraining factor, and the
reason for having various legacy support around. Likewise, for some
special purpose filesystems there really is a niche usage. But for new
and sufficiently general purpose filesystems, with new on-disk structure,
isn't it worth thinking this through and trying to get it right ? 

Yeah, it is a lot of work upfront ... but with double the people working
on something, it just might get much better than what they individually
can. Sometimes.

BTW, I don't know if it is worth it in this particular case, but just
something that worries me in general.

> GFS seems to reinvent a lot more things and has somewhat uglier code.
> On the other hand GFS' cluster support seems to be more aimed
> at being a universal cluster service open for other usages too,
> which might be a good thing. OCFS2s cluster seems to be more 
> aimed at only serving the file system.
> 
> But which one works better in practice is really an open question.

True, but what usually ends up happening is that this question can
never quite be answered in black and white. So both just continue
to exist and apps need to support both ... convergence becomes impossible
and long term duplication inevitable.

So at least having a clear demarcation/guideline of what situations
each is suitable for upfront would be a good thing. That might also
get some cross ocfs-gfs and ocfs-ext3 reviews in the process :)

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aio-stress regressions in 2.6.12 narrowed down to AIC7xxx

2005-07-06 Thread Suparna Bhattacharya

OK, I think I have narrowed it down to aic7xxx driver, because when
I specify CONFIG_SCSI_AIC7XXX_OLD, it does not regress. 

The regression occurs from 2.6.11 to 2.6.12  (from 17MB/s it goes down to
11MB/s)

The regression is still there in 2.6.13-rc1 +  the "speed fix" patch(discussed
in the recent aic7xxx regression thread on linux-scsi)

Recreate by running: aio-stress -O -o3 <1 GB testfile>

Config options:
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
CONFIG_AIC7XXX_DEBUG_ENABLE=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y


Comparing dmesg outputs
On 2.6.11
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36

aic7896/97: Ultra2 Wide Channel B, SCSI Id=7, 32/253 SCBs

(scsi0:A:0): 80.000MB/s transfers (40.000MHz, offset 63, 16bit)
  Vendor: IBM-ESXS  Model: ST318305LC!#  Rev: B245
  Type:   Direct-Access  ANSI SCSI revision: 03
scsi0:A:0:0: Tagged Queuing enabled.  Depth 32
(scsi0:A:1): 80.000MB/s transfers (40.000MHz, offset 63, 16bit)
  Vendor: IBM-ESXS  Model: ST318305LC!#  Rev: B245
...

On 2.6.12
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36

aic7896/97: Ultra2 Wide Channel B, SCSI Id=7, 32/253 SCBs

  Vendor: IBM-ESXS  Model: ST318305LC!#  Rev: B245
  Type:   Direct-Access  ANSI SCSI revision: 03
scsi0:A:0:0: Tagged Queuing enabled.  Depth 32
 target0:0:0: Beginning Domain Validation
WIDTH IS 1
(scsi0:A:0): 6.600MB/s transfers (16bit)
(scsi0:A:0): 80.000MB/s transfers (40.000MHz, offset 63, 16bit)
 target0:0:0: Ending Domain Validation
  Vendor: IBM-ESXS  Model: ST318305LC!#  Rev: B245
...


On 2.6.13-rc1 + "speed fix"
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36

aic7896/97: Ultra2 Wide Channel B, SCSI Id=7, 32/253 SCBs

 target0:0:0: asynchronous.
  Vendor: IBM-ESXS  Model: ST318305LC!#  Rev: B245
  Type:   Direct-Access  ANSI SCSI revision: 03
scsi0:A:0:0: Tagged Queuing enabled.  Depth 32
 target0:0:0: Beginning Domain Validation
 target0:0:0: wide asynchronous.
 target0:0:0: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 63)
 target0:0:0: Ending Domain Validation
target0:0:1: asynchronous.
  Vendor: IBM-ESXS  Model: ST318305LC!#  Rev: B245
...

Regards
Suparna


On Wed, Jul 06, 2005 at 04:07:29PM +0530, Suparna Bhattacharya wrote:
> On Tue, Jul 05, 2005 at 10:00:24AM -0400, Chris Mason wrote:
> > On Friday 01 July 2005 03:56, Suparna Bhattacharya wrote:
> > > Has anyone else noticed major throughput regressions for random
> > > reads/writes with aio-stress in 2.6.12 ?
> > > Or have there been any other FS/IO regressions lately ?
> > >
> > > On one test system I see a degradation from around 17+ MB/s to 11MB/s
> > > for random O_DIRECT AIO (aio-stress -o3 testext3/rwfile5) from 2.6.11
> > > to 2.6.12. It doesn't seem filesystem specific. Not good :(
> > >
> > > BTW, Chris/Ben, it doesn't look like the changes to aio.c have had an
> > > impact (I copied those back to my 2.6.11 tree and tried the runs with no
> > > effect) So it is something else ...
> > >
> > > Ideas/thoughts/observations ?
> > 
> > Lets try to narrow it down a bit:
> > 
> > aio-stress -o 3 -d 1 will set the depth to 1, (io_submit then wait one 
> > request 
> This doesn't regress - the problem really happens when we don't wait one
> at a time.
> 
> > at a time).  This doesn't take the aio subsystem out of the picture, but it 
> > does make the block layer interaction more or less the same as non-aio 
> > benchmarks.  If this is slow, I would suspect something in the block layer, 
> > and iozone -I -i 2 -w -f testext3/rwfile5 should also show the regression.
> > 
> > If it doesn't regress, I would suspect something in the aio core.  My first 
> > attempts at the context switch reduction patches caused this kind of 
> > regression.  There was too much latency in sending the events up to 
> > userland.
> > 
> > Other options:
> > 
> > Try different elevators
> 
> Still regresses (I tried noop instead of as)
> 
> > Try O_SYNC aio random writes
> > Try aio random reads
> > Try buffers random reads
> 
> Again all these end up waiting one at a time with mainline because it
> forces buffered AIO to be synchronous, so we the regression doesn't
> show up. But, when I apply my patches to make buffered fsAIO async,
> so we aren't waiting one at a time -- there is a similar regression.
> In fact it was this regression that made me go back and check if it
> was happening with AIO-DIO as well.
> 
> Regards
> Suparna
> 
> > 

Re: aio-stress throughput regressions from 2.6.11 to 2.6.12

2005-07-06 Thread Suparna Bhattacharya
On Tue, Jul 05, 2005 at 10:00:24AM -0400, Chris Mason wrote:
> On Friday 01 July 2005 03:56, Suparna Bhattacharya wrote:
> > Has anyone else noticed major throughput regressions for random
> > reads/writes with aio-stress in 2.6.12 ?
> > Or have there been any other FS/IO regressions lately ?
> >
> > On one test system I see a degradation from around 17+ MB/s to 11MB/s
> > for random O_DIRECT AIO (aio-stress -o3 testext3/rwfile5) from 2.6.11
> > to 2.6.12. It doesn't seem filesystem specific. Not good :(
> >
> > BTW, Chris/Ben, it doesn't look like the changes to aio.c have had an
> > impact (I copied those back to my 2.6.11 tree and tried the runs with no
> > effect) So it is something else ...
> >
> > Ideas/thoughts/observations ?
> 
> Lets try to narrow it down a bit:
> 
> aio-stress -o 3 -d 1 will set the depth to 1, (io_submit then wait one 
> request 
This doesn't regress - the problem really happens when we don't wait one
at a time.

> at a time).  This doesn't take the aio subsystem out of the picture, but it 
> does make the block layer interaction more or less the same as non-aio 
> benchmarks.  If this is slow, I would suspect something in the block layer, 
> and iozone -I -i 2 -w -f testext3/rwfile5 should also show the regression.
> 
> If it doesn't regress, I would suspect something in the aio core.  My first 
> attempts at the context switch reduction patches caused this kind of 
> regression.  There was too much latency in sending the events up to userland.
> 
> Other options:
> 
> Try different elevators

Still regresses (I tried noop instead of as)

> Try O_SYNC aio random writes
> Try aio random reads
> Try buffers random reads

Again all these end up waiting one at a time with mainline because it
forces buffered AIO to be synchronous, so we the regression doesn't
show up. But, when I apply my patches to make buffered fsAIO async,
so we aren't waiting one at a time -- there is a similar regression.
In fact it was this regression that made me go back and check if it
was happening with AIO-DIO as well.

Regards
Suparna

> 
> -chris

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aio-stress throughput regressions from 2.6.11 to 2.6.12

2005-07-05 Thread Suparna Bhattacharya
On Fri, Jul 01, 2005 at 10:25:55AM -0400, Benjamin LaHaise wrote:
> On Fri, Jul 01, 2005 at 01:26:00PM +0530, Suparna Bhattacharya wrote:
> > On one test system I see a degradation from around 17+ MB/s to 11MB/s
> > for random O_DIRECT AIO (aio-stress -o3 testext3/rwfile5) from 2.6.11
> > to 2.6.12. It doesn't seem filesystem specific. Not good :(
> 
> What sort of io subsystem does it have?  Also, does changing the 

aic7xxx.
> elevator make any difference?  I'm away for the long weekend, but will 

I tried switching to the noop elevator - the regression was still there.

> help look into this next week.

Do you see the regression as well, or is it just me ?

Regards
Suparna

> 
>   -ben

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Add support for semaphore-like structure with support for asynchronous I/O

2005-04-10 Thread Suparna Bhattacharya


On Fri, Apr 08, 2005 at 07:31:46PM -0400, Trond Myklebust wrote:
> fr den 08.04.2005 Klokka 18:39 (-0400) skreiv Benjamin LaHaise:
> 
> > On the aio side of things, I introduced the owner field in the mutex (as 
> > opposed to the flag in Trond's iosem) for the next patch in the series to 
> > enable something like the following api:
> > 
> > int aio_lock_mutex(struct mutex *lock, struct iocb *iocb);
> 
> Any chance of a more generic interface too?
> 
> iocbs are fairly high level objects, and so I do not see them helping to
> resolve low level filesystem problems such as the NFSv4 state cleanup.

My preferred approach would be to make the wait queue element the
primitive, rather than the iocb, precisely for this reason.

Guess its time for me to repost my aio-wait-bit based patch set - it
doesn't cover the async semaphores bit, but should indicate the general
direction of thinking.

I still need to look at Ben's patches though.

Regards
Suparna

> 
> Cheers,
>   Trond
> 
> -- 
> Trond Myklebust <[EMAIL PROTECTED]>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Add support for semaphore-like structure with support for asynchronous I/O

2005-04-05 Thread Suparna Bhattacharya
On Tue, Apr 05, 2005 at 11:46:41AM -0400, Benjamin LaHaise wrote:
> On Mon, Apr 04, 2005 at 01:56:35PM -0400, Trond Myklebust wrote:
> > IOW: the current semaphore implementations really all need to die, and
> > be replaced by a single generic version to which it is actually
> > practical to add new functionality.
> 
> I can see that goal, but I don't think introducing iosems is the right 
> way to acheive it.  Instead (and I'll start tackling this), how about 
> factoring out the existing semaphore implementations to use a common 
> lib/semaphore.c, much like lib/rwsem.c?  The iosems can be used as a 
> basis for the implementation, but we can avoid having to do a giant 
> s/semaphore/iosem/g over the kernel tree.

That would be really neat, if you can get to it.

Regards
Suparna

> 
> > Failing that, it is _much_ easier to convert the generic code that needs
> > to support aio to use a new locking implementation and then test that.
> > It is not as if conversion to aio won't involve changes to the code in
> > the area surrounding those locks anyway.
> 
> Quite true.  There's a lot more work to do in this area, and these common 
> primatives are needed to make progress.  Someone at netapp sent me an 
> email yesterday asking about aio support in NFS, so there is some demand 
> out there.  Cheers,
> 
>   -ben
> -- 
> "Time is what keeps everything from happening all at once." -- John Wheeler
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AIO and vectored I/O support for cifs

2005-04-03 Thread Suparna Bhattacharya

cc'ing linux-aio, for the AIO part of the discussion. You might be able
to find some of your answers in the archives.

On Fri, Mar 25, 2005 at 03:26:23PM -0600, Steve French wrote:
> Christoph,
> I had time to add the generic vectored i/o and async i/o calls to cifs 
> that you had suggested last month.  They are within the ifdef for the 
> CIFS_EXPERIMENTAL config option for the time being.   I would like to do 
> more testing of these though - are there any tests (even primitive ones) 
> for readv/writev and async i/o?
> 
> Is there an easy way measuring the performance benefit of these (vs. the 
> fallback routines in fs/read_write.c) - since presumably async and 
> vectored i/o never kick in on a standard copy command such as cp or dd 
> and require a modified application that is vector i/o aware or async i/o 
> aware.

there are several tests for AIO - I tend to use Chris Mason's aio-stress
which can be used to compare performance in terms of throughput for
streaming reads/writes for different variations of options.

(the following page isn't exactly up-to-date, but should still give
you some pointers: lse.sf.net/io/aio.html)

> 
> You had mentioned do_sync_read - is there a reason to change the current 
> call to generic_file_read in the cifs read entry point to do_sync_read.  
> Some filesystems which export aio routines still call generic_file_read 
> and others call do_sync_read and it was not obvious to me what that 
> would change.

I think you could keep it the way it is - generic_file_read will take care
of things. But maybe I should comment only after I see your patch. Are
you planning to post it some time ?

Regards
Suparna

> 
> This is partly to better limit reading from the pagecache when the read 
> oplock is lost (ie when we do not have the network caching token 
> allowing readahead from the server) but primarily because I would like 
> to see if this could help with getting more parallelism in the single 
> client to single server large file sequential copy case.   Currently 
> CIFS can do large operations (as large as 128K for read or write in some 
> cases) - and this is much more efficient for network transfer - but 
> without mounting with forcedirectio I had limited my cifs_readpages to 
> 16K (4 pages typically) - and because I do the SMBread synchronously I 
> am severely limiting parallelism in the case of a single threaded app.  
> So where I would like to get to is during readahead having multiple SMB 
> reads for the same inode on the wire at one time - with the SMB reads 
> each larger than a page (between 4 and 30 pages) - and was hoping that 
> the aio and readv/writev support would make that easier.  
> 
> I probably need to look more at the NFS direct i/o example to see if 
> there are easy changes I can make to enable it on a per inode basis 
> (rather than as only a mount option), and to double check what other 
> filesystem do for returning errors on mmap and sendfile on inodes that 
> are marked direct i/o.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Add support for semaphore-like structure with support for asynchronous I/O

2005-04-01 Thread Suparna Bhattacharya
On Thu, Mar 31, 2005 at 08:22:17PM -0500, Trond Myklebust wrote:
> to den 31.03.2005 Klokka 16:13 (-0800) skreiv Andrew Morton:
> > Trond Myklebust <[EMAIL PROTECTED]> wrote:
> > >
> > >  on den 30.03.2005 Klokka 18:17 (-0500) skreiv Trond Myklebust:
> > >  > > Or have I misunderstood the intent?  Some /* comments */ would be 
> > > appropriate..
> > >  > 
> > >  > Will do.
> > > 
> > >  OK. Plenty of comments added that will hopefully clarify what is going
> > >  on and how to use the API. Also some cleanups of the code.
> > 
> > Ah, so that's what it does ;)
> > 
> > I guess once we have a caller in-tree we could merge this.  I wonder if
> > there's other existing code which should be converted to iosems.
> 
> I can put it into the NFS client stream which feeds into the -mm kernel.
> That will enable me to queue up the NFSv4 patches that depend on it
> too...
> 
> > You chose to not use the aio kernel threads?
> 
> I thought I'd do that in a separate patch since the aio workqueue is
> currently statically defined in aio.c.

I'll take a look at the patch over the weekend. I had a patch
for aio semaphores a long while back, but I need to spend some time
to understand how different this is.

Regards
Suparna

> 
> > Does iosem_lock_and_schedule_function() need locking?  It nonatomically
> > alters *lk_state.
> 
> iosem_lock_and_schedule_function() will always be called with the
> iosem->wait.lock held, since it is a waitqueue notification function.
> 
> In practice it is called by iosem_unlock(). The call to wake_up_locked()
> will trigger a call to __wake_up_common() which again tries the
> notification function of each waiter on the queue until it finds one
> that succeeds.
> 
> Cheers,
>   Trond
> 
> -- 
> Trond Myklebust <[EMAIL PROTECTED]>
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-14 Thread Suparna Bhattacharya
On Mon, Mar 14, 2005 at 05:36:58AM -0300, Werner Almesberger wrote:
> Mingming Cao wrote:
> > I agree delayed allocation make much sense with multiblock allocation.
> > But I still think itself worth the effort, even without multiple block
> > allocation.
> 
> On ABISS, we're currently also experimenting with delayed allocation.
> There, the goal is less to improve overall performance, but to move
> the accesses out of the synchronous code path for write(2).
> 
> The code works quite nicely for FAT and ext2, limiting the time it
> takes to make a write call writing new data to about 4-6 ms on a
> fairly sluggish machine (plus about 2-4 ms for moving the playout
> point, which is a separate operation in ABISS), and with eight
> competing best-effort writers who each enjoy write latencies of some
> 8 seconds, worst-case, overwriting old data.
> 
> Of course, this fails horribly on ext3, because it doesn't do anything
> useful with the journal. Another problem is error handling. Since FAT
> and ext2 don't have any form of reservation, a full disk isn't detected
> until it's far too late.
> 
> So, a VFS-level reservation function would indeed be nice to have.
> 
> I looked at ext3 delalloc briefly, and while it did indeed improve
> performance quite nicely, by being tied to ext3 internals, it would
> be difficult to use in the framework of ABISS, where the code paths
> are different (e.g. the prepare/commit functions should be as close
> to no-ops as possible, and leave all the work to the prefetcher
> thread), and which tries to be relatively file system independent.

I'm looking at whether we can do most of it at VFS level ... with
ext3 only taking care of the additional journalling bit - seems
quite feasible. There are two reqs (1) reservation (2) changing
mpage_writepages to use get_blocks(), which don't seem too hard.
ext3 ordered mode will need a bit more thought.

Of course, I haven't looked at how ABISS does delayed alloc -- 
do you have a patch snippet I can look at ?

Regards
Suparna

> 
> - Werner
> 
> -- 
>   _
>  / Werner Almesberger, Buenos Aires, Argentina [EMAIL PROTECTED] /
> /_http://www.almesberger.net//
> 
> 
> ---
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> ___
> Ext2-devel mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/ext2-devel

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Delayed alloc for ordered-mode

2005-03-13 Thread Suparna Bhattacharya
s/ext3/writeback.c 2005-03-04 15:10:01.0 
> +0300
> +++ linux-2.6.11/fs/ext3/writeback.c  2005-03-04 17:33:05.0 +0300
> @@ -145,6 +145,17 @@
>   if (bio->bi_size)
>   return 1;
>  
> + if (bio->bi_private) {
> + transaction_t *transaction = bio->bi_private;
> +
> + /* 
> +  * journal_commit_transaction() may be awaiting
> +  * the bio to complete.
> +  */
> + if (atomic_dec_and_test(&transaction->t_bios_in_flight))
> + wake_up(&transaction->t_journal->j_wait_bios);
> + }
> +
>   do {
>   struct page *page = bvec->bv_page;
>  
> @@ -162,6 +173,16 @@
>  static struct bio *ext3_wb_bio_submit(struct bio *bio, handle_t *handle)
>  {
>   bio->bi_end_io = ext3_wb_end_io;
> + if (handle) {
> + /*
> +  * In data=ordered we shouldn't commit the transaction
> +  * until all data related to the transaction get on a
> +  * platter.
> +  */
> + atomic_inc(&handle->h_transaction->t_bios_in_flight);
> + bio->bi_private = handle->h_transaction;
> + } else
> + bio->bi_private = NULL;
>   submit_bio(WRITE, bio);
>   return NULL;
>  }

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-04 Thread Suparna Bhattacharya
On Thu, Mar 03, 2005 at 02:40:21AM -0700, Andreas Dilger wrote:
> On Mar 03, 2005  14:03 +0530, Suparna Bhattacharya wrote:
> > diffstat of the 3 patches : 22 files changed, 5920 insertions(+), 
> > 47 deletions. The largest is in the extents patch (2743), mballoc 
> > is 1968, and delalloc is 1209. To use delalloc, which gives us
> > all the performance benefits, right now we need all the 3 patches
> > to be used in conjunction. Supporting extent map btrees as well 
> > as traditional indexing and associated options for compatibility etc
> > is perhaps the more invasive of changes. Given that keeping ext3 
> > stable and maintainable is a key concern (that is after all a major 
> > reason why a lot of users rely on ext3), a somewhat incremental 
> > approach is desirable. 
> > 
> > So, I'll start from the direction that has been suggested by
> > some -- (1) delayed allocation without changing the
> > on-disk format. And then later (2) go on to breaking format with 
> > all changes for scalability to larger files with full extents 
> > support (haven't thought enough about this yet - maybe in a
> > separate mail)
> 
> Well, for a starter, the extents format changes are not forced on
> users, only if they mount with "-o extents" and write files will
> it mark the superblock incompatible and start allocating files
> this way.  I believe (though I have never tested) that even if
> extents are enabled, writes to a block-mapped file will continue
> to work and that file will not be converted to an extent file.

Files that are created with extents will not be viewable by an older
kernel, though (I think) - which is where the format breakage comes
in (is that correct ?). But I don't see this as a major issue, since 
it can perhaps be taken care of through a little bit of migration 
tooling as Ted indicated. 

So, compatibility in itself wasn't the main concern bothering me 
but how we could make it easier to assure stability & maintainability
even with all the cool stuff. For example, if we have both mballoc 
and regular balloc and similarly extents and regular indexing based 
on growth patterns (a nice idea, btw), does it multiply the 
scenarios to verify on the testing front ? Or in dealing with changes
in the future ? I'm guessing that this might be one of the things (besides
agreement on the disk layout) holding up inclusion of extents, despite
the patches being around for a while now .. but then I could be wrong.
B-tree based extent maps were mentioned by sct way back in his 2000 
paper ! And of course every filesystem out there implements B-trees in
its own way.

I can see arguments flying both ways ... at what point do we decide
to break towards an ext4 ? 

BTW, has anyone tried playing with the idea of ext4 as not a 
cp -r fs/ext3 fs/ext4 and edit, but if possible using some layered
filesystem techniques to reuse much of ext3 directly, and just override
a few operations (like get_blocks for extents etc) where there 
is a layout impact ? 

Alex, have you had a chance to prototype your idea of rooting extents
in ea ?

> 
> > A few random things that come to mind for (1), going through the code:
> > 
> > - There might be possibilities for code reduction, by extending
> >   generic routines as far as possible, e.g. ext3_wb_writepages
> >   has a lot in common with generic writepages. That would
> >   also make it easier to maintain.
> 
> I'm sure some support for this could be gotten from e.g. XFS as well,
> since their filesystem (on Irix at least) was all about delayed alloc
> (not sure what it does under Linux), and I believe ReiserFS/Reiser4
> also desire the ability to have delayed allocation from the VFS (i.e.
> some sort of light-weight "reserve space" call for each page dirtied
> and then getting the actual file + offsets en masse later (if the
> VFS/VM doesn't discard the whole thing).

*nod*

Regards
Suparna

> 
> Cheers, Andreas
> --
> Andreas Dilger
> http://sourceforge.net/projects/ext2resize/
> http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/
> 



-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-03 Thread Suparna Bhattacharya
On Thu, Mar 03, 2005 at 05:46:13PM -0800, Mingming Cao wrote:
> On Thu, 2005-03-03 at 17:12 -0800, Badari Pulavarty wrote:
> > Just doing delayed allocation without multiblock allocation
> > (with the current layout) is not really a useful thing, IMHO.
> > We will benifit few cases, but in general - we moved the
> > block allocation overhead from prepare write to writepages/writepage
> > time. There is a little benifit of not doing journaling twice etc..
> > but I don't think it would be enough to justify the effort. 
> > Isn't it ?
> > 
> 
> Hi Badari
> 
> I agree delayed allocation make much sense with multiblock allocation.
> But I still think itself worth the effort, even without multiple block
> allocation. If we have a seeky random write application, and if later
> the application try to fill those holes, we normally will end up pretty
> ugly file layout. With delayed allocation, we could have better chance
> to get contigous blocks on disk for that file.
> 
> I happened found Ted has mentioned this before:
> http://marc.theaimsgroup.com/?l=ext2-devel&m=107239591117758&w=2
> 
> > So, may be we should look at adding multiblock allocation +
> > delayed allocation to current ext3 layout. Then we can evaluate
> > the benifits of having "extents" etc and then break the layout ?
> > 
> 
> Current reservation code could be improved to return back how big the
> free chunk inside the window, and we could use that to help make
> ext3_new_blocks()/ext3_get_blocks() happen.

Yup this is exactly what I was thinking.

It'll probably only be a step along the way ... but I am hoping that
this will give us a direction to merge these pieces in incrementally, 
a little at a time, each piece being very well-understood and with 
demonstrated performance improvements at every step. For example, 
the next step after the following could be to plug parts of mballoc
in to the above, etc ... 

Does that make sense ?

Regards
Suparna


-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Reviewing ext3 improvement patches (delalloc, mballoc, extents)

2005-03-03 Thread Suparna Bhattacharya

Since the performance improvements seen so far are quite encouraging, 
and momentum is picking up so well, I started looking through the
patches from Alex ... just a quick code walkthrough to get a hang
of it and think about what kind of simplifications might be possible
and what it might take for inclusion.

I haven't had a chance to go too deep line by line yet,
but thought I'd initiate some discussion with some first impressions
and summary of what directions I hear several people converging
towards to validate if I'm on the right track here.

diffstat of the 3 patches : 22 files changed, 5920 insertions(+), 
47 deletions. The largest is in the extents patch (2743), mballoc 
is 1968, and delalloc is 1209. To use delalloc, which gives us
all the performance benefits, right now we need all the 3 patches
to be used in conjunction. Supporting extent map btrees as well 
as traditional indexing and associated options for compatibility etc
is perhaps the more invasive of changes. Given that keeping ext3 
stable and maintainable is a key concern (that is after all a major 
reason why a lot of users rely on ext3), a somewhat incremental 
approach is desirable. 

So, I'll start from the direction that has been suggested by
some -- (1) delayed allocation without changing the
on-disk format. And then later (2) go on to breaking format with 
all changes for scalability to larger files with full extents 
support (haven't thought enough about this yet - maybe in a
separate mail)

A few random things that come to mind for (1), going through the code:

- There might be possibilities for code reduction, by extending
  generic routines as far as possible, e.g. ext3_wb_writepages
  has a lot in common with generic writepages. That would
  also make it easier to maintain.
- Similarly, how about (as Mingming I think already hinted) 
  implementing ext3_get_blocks to do multi-block lookup and 
  allocation and using it in delalloc ?

Hmm, maybe I speak too soon - have to look at the interfaces more
closely and verify if this is feasible. 

Regards
Suparna


-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html