Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-02 Thread Aneesh Kumar K.V



Mingming Cao wrote:

This patch is on top of i_version_update_vfs.
The i_version field of the inode is set on inode creation and incremented when
the inode is being modified.
ta));
ei->i_dir_start_lookup = 0;
Index: linux-2.6.22-rc4/fs/ext4/inode.c
===
--- linux-2.6.22-rc4.orig/fs/ext4/inode.c   2007-06-13 17:21:29.0 
-0700
+++ linux-2.6.22-rc4/fs/ext4/inode.c2007-06-13 17:24:45.0 -0700
@@ -3082,6 +3082,7 @@ int ext4_mark_iloc_dirty(handle_t *handl
 {
int err = 0;

+   inode->i_version++;
/* the do_update_inode consumes one bh->b_count */
get_bh(iloc->bh);



If we bump i_version  in ext4_mark_iloc_dity then we should be removing the 
i_version update
at other places. A simple grep of ext4 dir shows i_version being updated followed by 
ext4_mark_inode_dirty.



-aneesh
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-02 Thread Aneesh Kumar K.V



Mingming Cao wrote:


Index: linux-2.6.22-rc4/fs/ext4/super.c
===
--- linux-2.6.22-rc4.orig/fs/ext4/super.c   2007-06-13 17:19:11.0 
-0700
+++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-13 17:24:45.0 -0700
@@ -2846,8 +2846,8 @@ out:
i_size_write(inode, off+len-towrite);
EXT4_I(inode)->i_disksize = inode->i_size;
}
-   inode->i_version++;
inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+   inode->i_version = 1;
ext4_mark_inode_dirty(handle, inode);
mutex_unlock(&inode->i_mutex);
return len - towrite;



Is this correct ? . Why do we set the qutoa file inodes version to 1  during 
write ?


- aneesh
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-02 Thread Aneesh Kumar K.V



Mingming Cao wrote:

This patch is a spinoff of the old nanosecond patches.

It includes some cleanups and addition of a creation timestamp. The
EXT3_FEATURE_RO_COMPAT_EXTRA_ISIZE flag has also been added along with
s


Should be EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE

-aneesh
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Teach do_mpage_readpage() about unwritten buffers

2007-07-02 Thread David Chinner
On Mon, Jul 02, 2007 at 08:28:27PM -0700, Andrew Morton wrote:
> On Tue, 3 Jul 2007 11:10:19 +1000 David Chinner <[EMAIL PROTECTED]> wrote:
> 
> Seems sane, although one does wonder whether it's a worthy tradeoff.  We
> add additional overhead to readpage[s]() just to avoid some IO during
> mkswap?

It removes the hidden magic from XFS that hides unwritten extents
behind holes so that they get zero filled by readpages. With other
filesystems starting to use unwritten extents, these sorts of tricksy
hacks should be avoided and we should be explicitly handling unwritten
buffers just like we explicitly handle holes.

The side effect of this change is that other things (like sys_swapon)
behave sanely when faced with unwritten extents (i.e. they are not
detected incorrectly as holes).

FWIW, using unwritten extents for swap files means you can allocate
gigabytes of swap space on demand, even under severe memory pressure
because you don't need to write those gigabytes of data to disk.
Also, the only way to read the contents of the swap file is to go
directly to the underlying device so even if you screw up the
permissions on the swap file it will still always read as zeros.
So there are some benefits here

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Teach do_mpage_readpage() about unwritten buffers

2007-07-02 Thread Andrew Morton
On Tue, 3 Jul 2007 11:10:19 +1000 David Chinner <[EMAIL PROTECTED]> wrote:

> Teach do_mpage_readpage() about unwritten extents so we can
> always map them in get_blocks rather than they are are holes on
> read. Allows setup_swap_extents() to use preallocated files on XFS
> filesystems for swap files without ever needing to convert them.
> 
> Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]>
> 
> ---
>  fs/mpage.c  |5 +++--
>  fs/xfs/linux-2.6/xfs_aops.c |   13 +++--
>  2 files changed, 6 insertions(+), 12 deletions(-)
> 
> Index: 2.6.x-xfs-new/fs/mpage.c
> ===
> --- 2.6.x-xfs-new.orig/fs/mpage.c 2007-05-29 16:17:59.0 +1000
> +++ 2.6.x-xfs-new/fs/mpage.c  2007-06-27 22:39:35.568852270 +1000
> @@ -207,7 +207,8 @@ do_mpage_readpage(struct bio *bio, struc
>* Map blocks using the result from the previous get_blocks call first.
>*/
>   nblocks = map_bh->b_size >> blkbits;
> - if (buffer_mapped(map_bh) && block_in_file > *first_logical_block &&
> + if (buffer_mapped(map_bh) && !buffer_unwritten(map_bh) &&
> + block_in_file > *first_logical_block &&
>   block_in_file < (*first_logical_block + nblocks)) {
>   unsigned map_offset = block_in_file - *first_logical_block;
>   unsigned last = nblocks - map_offset;
> @@ -242,7 +243,7 @@ do_mpage_readpage(struct bio *bio, struc
>   *first_logical_block = block_in_file;
>   }
>  
> - if (!buffer_mapped(map_bh)) {
> + if (!buffer_mapped(map_bh) || buffer_unwritten(map_bh)) {
>   fully_mapped = 0;
>   if (first_hole == blocks_per_page)
>   first_hole = page_block;
> Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c
> ===
> --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c2007-06-05 
> 22:14:39.0 +1000
> +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-06-27 22:39:29.545636749 
> +1000
> @@ -1340,16 +1340,9 @@ __xfs_get_blocks(
>   return 0;
>  
>   if (iomap.iomap_bn != IOMAP_DADDR_NULL) {
> - /*
> -  * For unwritten extents do not report a disk address on
> -  * the read case (treat as if we're reading into a hole).
> -  */
> - if (create || !(iomap.iomap_flags & IOMAP_UNWRITTEN)) {
> - xfs_map_buffer(bh_result, &iomap, offset,
> -inode->i_blkbits);
> - }
> - if (create && (iomap.iomap_flags & IOMAP_UNWRITTEN)) {
> - if (direct)
> + xfs_map_buffer(bh_result, &iomap, offset, inode->i_blkbits);
> + if (iomap.iomap_flags & IOMAP_UNWRITTEN) {
> + if (create && direct)
>   bh_result->b_private = inode;
>   set_buffer_unwritten(bh_result);
>   }

Seems sane, although one does wonder whether it's a worthy tradeoff.  We
add additional overhead to readpage[s]() just to avoid some IO during
mkswap?

Also, I wonder what ext4 does in this situation?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-02 Thread Mingming Cao
Trond or Bruce, can you please review these patch series and ack if you
agrees? Thanks.

As to performance concerns that raise before the inode version counter
(at least for ext4) is done inside ext4_mark_inode_dirty), so there is
no extra IO work to store this counter to disk.

Mingming
On Sun, 2007-07-01 at 03:37 -0400, Mingming Cao wrote:
> This patch converts the 32-bit i_version in the generic inode to a 64-bit
> i_version field.
> 
> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> Signed-off-by: Jean Noel Cordenner <[EMAIL PROTECTED]>
> Signed-off-by: Kalpak Shah <[EMAIL PROTECTED]>
> 
> Index: linux-2.6.21/include/linux/fs.h
> ===
> --- linux-2.6.21.orig/include/linux/fs.h
> +++ linux-2.6.21/include/linux/fs.h
> @@ -549,7 +549,7 @@ struct inode {
>   uid_t   i_uid;
>   gid_t   i_gid;
>   dev_t   i_rdev;
> - unsigned long   i_version;
> + u64 i_version;
>   loff_t  i_size;
>  #ifdef __NEED_I_SIZE_ORDERED
>   seqcount_t  i_size_seqcount;
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-07-02 Thread Mingming Cao
On Sat, 2007-06-30 at 13:29 -0400, Andreas Dilger wrote:
> On Jun 30, 2007  10:13 -0400, Mingming Cao wrote:
> > Another approach we have been thinking  is using a backing
> > inode(per-inode-with-preallocation) to store the preallocated blocks.
> > When user asked for preallocation on the base inode, ext2/3 create a
> > temporary backing inode, and it's (pre)allocate the corresponding
> > blocks in the backing inode. 
> > 
> > When writes to the base inode, and realize we need to block allocation
> > on, before doing the fs real block allocation, it will check if the file
> > has a backing inode stores some preallocated blocks for the same logical
> > blocks.  If so, it will transfer the preallocated blocks from backing
> > inode to the base inode.
> > 
> > We need to link the two inodes in some way, maybe store the backing
> > inode number via EA in the base inode, and flag the base inode that it
> > has a backing inode to get preallocated blocks.
> > 
> > Since it doesn't change the block mapping on the original file until
> > writeout, so it doesn't require a incompat feature to protect the
> > preallocated contents to be read in "old" kernel. There some work need
> > to be done in e2fsck to understand the backing inode.
> 
> I don't know if you realize, but this is half-way to supporting
> snapshots within the filesystem.  

>From your description it seems similar, but not sure if it's half-way
yet. Just to clarify: What's stored in the backing inode(in the
preallocation case) is just metablocks, not data blocks. The transfer
(from backing inode to the base inode) do not involve any data blocks
migration.

Another comment, if we seriously looking for supporting preallocation in
ext2 in upstreeam, I'd like to choose a solution suitable for ext3 as
well. Taking a bit from block number to flag preallocated blocks means
reduce ext2/3 fs limit to 8TB, which probably not a big deal for ext2,
but not so good for ext3.

Mingming



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-07-02 Thread Badari Pulavarty
On Sat, 2007-06-30 at 10:13 -0400, Mingming Cao wrote:
> On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote:
> > Guys, Mike and Sreenivasa at google are looking into implementing
> > fallocate() on ext2.  Of course, any such implementation could and should
> > also be portable to ext3 and ext4 bitmapped files.
> > 
> > I believe that Sreenivasa will mainly be doing the implementation work.
> > 
> > 
> > The basic plan is as follows:
> > 
> > - Create (with tune2fs and mke2fs) a hidden file using one of the
> >   reserved inode numbers.  That file will be sized to have one bit for each
> >   block in the partition.  Let's call this the "unwritten block file".
> > 
> >   The unwritten block file will be initialised with all-zeroes
> > 
> > - at fallocate()-time, allocate the blocks to the user's file (in some
> >   yet-to-be-determined fashion) and, for each one which is uninitialised,
> >   set its bit in the unwritten block file.  The set bit means "this block
> >   is uninitialised and needs to be zeroed out on read".
> > 
> > - truncate() would need to clear out set-bits in the unwritten blocks file.
> > 
> > - When the fs comes to read a block from disk, it will need to consult
> >   the unwritten blocks file to see if that block should be zeroed by the
> >   CPU.
> > 
> > - When the unwritten-block is written to, its bit in the unwritten blocks
> >   file gets zeroed.
> > 
> > - An obvious efficiency concern: if a user file has no unwritten blocks
> >   in it, we don't need to consult the unwritten blocks file.
> > 
> >   Need to work out how to do this.  An obvious solution would be to have
> >   a number-of-unwritten-blocks counter in the inode.  But do we have space
> >   for that?
> > 
> >   (I expect google and others would prefer that the on-disk format be
> >   compatible with legacy ext2!)
> > 
> > - One concern is the following scenario:
> > 
> >   - Mount fs with "new" kernel, fallocate() some blocks to a file.
> > 
> >   - Now, mount the fs under "old" kernel (which doesn't understand the
> > unwritten blocks file).
> > 
> > - This kernel will be able to read uninitialised data from that
> >   fallocated-to file, which is a security concern.
> > 
> >   - Now, the "old" kernel writes some data to a fallocated block.  But
> > this kernel doesn't know that it needs to clear that block's flag in
> > the unwritten blocks file!
> > 
> >   - Now mount that fs under the "new" kernel and try to read that file.
> >  The flag for the block is set, so this kernel will still zero out the
> > data on a read, thus corrupting the user's data
> > 
> >   So how to fix this?  Perhaps with a per-inode flag indicating "this
> >   inode has unwritten blocks".  But to fix this problem, we'd require that
> >   the "old" kernel clear out that flag.
> > 
> >   Can anyone propose a solution to this?
> > 
> >   Ah, I can!  Use the compatibility flags in such a way as to prevent the
> >   "old" kernel from mounting this filesystem at all.  To mount this fs
> >   under an "old" kernel the user will need to run some tool which will
> > 
> >   - read the unwritten blocks file
> > 
> >   - for each set-bit in the unwritten blocks file, zero out the
> > corresponding block
> > 
> >   - zero out the unwritten blocks file
> > 
> >   - rewrite the superblock to indicate that this fs may now be mounted
> > by an "old" kernel.
> > 
> >   Sound sane?
> > 
> > - I'm assuming that there are more reserved inodes available, and that
> >   the changes to tune2fs and mke2fs will be basically a copy-n-paste job
> >   from the `tune2fs -j' code.  Correct?
> > 
> > - I haven't thought about what fsck changes would be needed.
> > 
> >   Presumably quite a few.  For example, fsck should check that set-bits
> >   in the unwriten blobks file do not correspond to freed blocks.  If they
> >   do, that should be fixed up.
> > 
> >   And fsck can check each inodes number-of-unwritten-blocks counters
> >   against the unwritten blocks file (if we implement the per-inode
> >   number-of-unwritten-blocks counter)
> > 
> >   What else should fsck do?
> > 
> > - I haven't thought about the implications of porting this into ext3/4. 
> >   Probably the commit to the unwritten blocks file will need to be atomic
> >   with the commit to the user's file's metadata, so the unwritten-blocks
> >   file will effectively need to be in journalled-data mode.
> > 
> >   Or, more likely, we access the unwritten blocks file via the blockdev
> >   pagecache (ie: use bmap, like the journal file) and then we're just
> >   talking direct to the disk's blocks and it becomes just more fs metadata.
> > 
> > - I guess resize2fs will need to be taught about the unwritten blocks
> >   file: to shrink and grow it appropriately.
> > 
> > 
> > That's all I can think of for now - I probably missed something. 
> > 
> > Suggestions and thought are sought, please.
> > 
> > 
> 
> Another approach we have been thinking  is using a backing
> inode

Re: [RFC] BIG_BG vs extended META_BG in ext4

2007-07-02 Thread Mingming Cao
On Mon, 2007-07-02 at 11:49 -0400, Theodore Tso wrote:
> On Sun, Jul 01, 2007 at 09:48:33AM -0500, Jose R. Santos wrote:
> > Is your concern due to being unable to find contiguous block in the
> > case that a bad disk area is in one of the bitmap blocks?  One thing we
> > can do is try to search for another set of contiguous blocks and if we
> > fail to find one, we can flag the block group and move to an indirect
> > block approach to allocating the bitmaps.  At this point, we do lose
> > some of the performance benefits of BIG_BG, but we would still be able
> > to use the block group.
> 
> Yes, my concern is what we might need to do if for some reason e2fsck
> needs to reallocate the bitmap blocks.  I don't think an indirect
> block scheme is the right approach, though; we're adding a lot of
> complexity for a case that probably wouldn't be used but very, very
> rarely.
> 
> My proposal (as we discsused) in the call, is to implement BIG_BG as
> meaning the following:
> 
>   1) Implementations must understand and use the s_desc_size
>   superblock field to determine whether block group descriptors
>   are the old 32 bytes or the newer 64 bytes format.  
>   
>   2) Implementations must support the newer ext4_group_desc
>   format in particular to support bg_free_blocks_count_hi and
>   bg_free_inodes_count_hi
> 
>   3) Implementations will relax constraints on where the
>   superblock, bitmaps, and inode tables for a particular block
>   group will be stored.
>

I agree.

> So with that, we can experiment with what size block groups really
> make sense, versus using the extended metablockgroup idea, or possibly
> doing both.
> 

How about incorporating some of the chunkfs ideas into this BIG_BG or
extended metablockgroups? The original block group size (128MB) is
probably too small that would results in many continous inodes. By
enlarging the size of groups via BIG_BG or extended metablockgroups, we
could add dirty/clean bit to allow partical/parallel fsck, and something
like that. Any thoughts on thhis?


Mingming


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] BIG_BG vs extended META_BG in ext4

2007-07-02 Thread Theodore Tso
On Sun, Jul 01, 2007 at 09:48:33AM -0500, Jose R. Santos wrote:
> Is your concern due to being unable to find contiguous block in the
> case that a bad disk area is in one of the bitmap blocks?  One thing we
> can do is try to search for another set of contiguous blocks and if we
> fail to find one, we can flag the block group and move to an indirect
> block approach to allocating the bitmaps.  At this point, we do lose
> some of the performance benefits of BIG_BG, but we would still be able
> to use the block group.

Yes, my concern is what we might need to do if for some reason e2fsck
needs to reallocate the bitmap blocks.  I don't think an indirect
block scheme is the right approach, though; we're adding a lot of
complexity for a case that probably wouldn't be used but very, very
rarely.

My proposal (as we discsused) in the call, is to implement BIG_BG as
meaning the following:

1) Implementations must understand and use the s_desc_size
superblock field to determine whether block group descriptors
are the old 32 bytes or the newer 64 bytes format.  

2) Implementations must support the newer ext4_group_desc
format in particular to support bg_free_blocks_count_hi and
bg_free_inodes_count_hi

3) Implementations will relax constraints on where the
superblock, bitmaps, and inode tables for a particular block
group will be stored.

So with that, we can experiment with what size block groups really
make sense, versus using the extended metablockgroup idea, or possibly
doing both.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] BIG_BG vs extended META_BG in ext4

2007-07-02 Thread Jose R. Santos
On Sun, 1 Jul 2007 12:31:53 -0400
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> On Jun 30, 2007  23:40 -0500, Jose R. Santos wrote:
> > Yes, I think bigger block groups will benefit extents a great deal
> > since not only can we have larger extents, but I believe that as the
> > filesystem ages the chances of getting large number contiguous
> > block can be reduce with small block groups.
> 
> This turns out not to be true, and in fact we need to change the
> unwritten extents patch a tiny bit.  The reason is that we have
> limited the maximum extent size to 2^16-1 = 32767 blocks.  The
> current maximum for the number of blocks in a group is 65528, so that
> we can always fit the "free blocks" count into a __u16 if the bitmaps
> and inode table are moved out of the group.  Moving the bitmaps and
> itable will hit the max extent length.

I miss this while looking at the extent code.  I thought that the
extents limit was caused by being unable to allocate enough contiguous
blocks due to the small block groups.

Are there no plans to support very large extents?  It seems like this
would be a good reason to support either BIG_BG or xMETA_BG.  Aside
from some possible alignment issues with the structure, what else would
keep would keep ee_len from being larger?
 
> There are still other benefits to moving the metadata together.
> 
> Now, the one minor problem with the unwritten extent patches is that
> by using the high bit of the ee_len this limits the extent length to
> 2^15-1 blocks, but it would be MUCH better if this limit was 2^16
> blocks and it fit evenly into an empty group, consecutive extents
> were aligned, etc. It also doesn't make sense to have an
> uninitialized 0-length extent, so I think the unwritten extent
> (fallocate) patch needs to special case the ee_len = 65536 to be a
> "regular" extent instead of "unwritten"
> 
> > > With less groups, we load less group descriptors in memory, we
> > > have less I/O to read bitmap and inode array (because we manage
> > > less group descriptors again, because we load bigger bitmap and
> > > array in one time)
> > 
> > Presumably, we would still need to access the same amount data but
> > latencies should be reduce since we could do larger IO's and less
> > seeks to read the bitmaps.  I also wonder if there are benefits in
> > terms of locality to having the bitmaps closer to its blocks vs
> > having them far away like in xMETA_BG.
> 
> Having the bitmaps together will fix this independent of "BIG_BG".

I was referring to the locality of block bit maps and the actual free
blocks.  If we move the block bitmaps out of block group, wouldn't we 
be promoting larger seeks on operations that heavily write to both the
bitmaps and blocks?

This would not be a problem for inode bitmap and itables since those
would be move together in xMETA_BG.

Thanks

-JRS
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-02 Thread Amit K. Arora
On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote:
> On Sat, Jun 30, 2007 at 11:21:11AM +0100, Christoph Hellwig wrote:
> > On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
> > > > Can you clarify - what is the current behaviour when ENOSPC (or some 
> > > > other
> > > > error) is hit?  Does it keep the current fallocate() or does it free it?
> > > 
> > > Currently it is left on the file system implementation. In ext4, we do
> > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > > end up with partial (pre)allocation. This is inline with dd and
> > > posix_fallocate, which also do not free the partially allocated space.
> > 
> > I can't find anything in the specification of posix_fallocate
> > (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
> > that tells what should happen to allocate blocks on error.
> 
> Yeah, and AFAICT glibc leaves them behind ATM.

Yes, it does.
 
> > But common sense would be to not leak disk space on failure of this
> > syscall, and this definitively should not be left up to the filesystem,
> > either we always leak it or always free it, and I'd strongly favour
> > the latter variant.

I would not call it a "leak", since the blocks which got allocated as
part of the partial success of the fallocate syscall can be strictly
accounted for (i.e. they are assigned to a particular inode). And these
can be freed by the application, using a suitable @mode of fallocate.
 
> We can't simply walk the range an remove unwritten extents, as some
> of them may have been present before the fallocate() call. That
> makes it extremely difficult to undo a failed call and not remove
> more pre-existing pre-allocations.

Same is true for ext4 too. It is very difficult to keep track of which
uninitialized (unwritten) extents got allocated as part of the current
syscall. This is because, as David mentions, some of them might be
already present; and also because some of the older ones may have got
merged with the *new* uninitialized/unwritten extents as part of the
current syscall. 
 
> Given the current behaviour for posix_fallocate() in glibc, I think
> that retaining the same error semantic and punting the cleanup to
> userspace (where the app will fail with ENOSPC anyway) is the only
> sane thing we can do here. Trying to undo this in the kernel leads
> to lots of extra rarely used code in error handling paths...

Right. This gives applications the free hand if they really want to use
the partially preallocated space, OR they want to free it; without
introducing additional complexity in the kernel.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html