Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-08 Thread Amit K. Arora
On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote:
 On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
  On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
   On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] 
   wrote:
 
+int ext4_fallocate(struct inode *inode, int mode, loff_t offset, 
loff_t len)
+{
+   handle_t *handle;
+   ext4_fsblk_t block, max_blocks;
+   int ret, ret2, nblocks = 0, retries = 0;
+   struct buffer_head map_bh;
+   unsigned int credits, blkbits = inode-i_blkbits;
+
+   /* Currently supporting (pre)allocate mode _only_ */
+   if (mode != FA_ALLOCATE)
+   return -EOPNOTSUPP;
+
+   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
+   return -ENOTTY;
   
   So we don't implement fallocate on bitmap-based files!  Well that's huge
   news.  The changelog would be an appropriate place to communicate this,
   along with reasons why, or a description of the plan to fix it.
  
  Ok. Will add this in the function description as well.
  
   Also, posix says nothing about fallocate() returning ENOTTY.
  
  Right. I don't seem to find any suitable error from posix description.
  Can you please suggest an error code which might make more sense here ?
  Will -ENOTSUPP be ok ? Since we want to say here that we don't support
  non-extent files.
 
 Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
 here, and fall back to the current library code to do preallocation?
 This way, the caller of fallocate() will never see this return code, so
 it won't violate posix.

You are right.

But, we still need to standardize (and limit) the error codes
which we should return from kernel when we want to fall back on the
library implementation. The posix_fallocate() library function will have
to look for a set of errors from fallocate() system call, upon receiving
which it will do preallocation from user level; or else, it will return
success/error-code returned by the system call to the user.

I think we can make it fall back to library implementation of fallocate,
whenever posix_fallocate() receives any of the following errors from
fallocate() system call:

1. ENOSYS
2. EOPNOTSUPP
3. ENOTTY(?)

Now the question is - should we limit the set of errors for this purpose
to just 1  2 above ? In that case I will need to change the error being
returned here to -EOPNOTSUPP (from current -ENOTTY).

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-08 Thread Dave Kleikamp
On Tue, 2007-05-08 at 16:22 +0530, Amit K. Arora wrote:
 On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote:
  On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
   On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:

So we don't implement fallocate on bitmap-based files!  Well that's huge
news.  The changelog would be an appropriate place to communicate this,
along with reasons why, or a description of the plan to fix it.
   
   Ok. Will add this in the function description as well.
   
Also, posix says nothing about fallocate() returning ENOTTY.
   
   Right. I don't seem to find any suitable error from posix description.
   Can you please suggest an error code which might make more sense here ?
   Will -ENOTSUPP be ok ? Since we want to say here that we don't support
   non-extent files.
  
  Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
  here, and fall back to the current library code to do preallocation?
  This way, the caller of fallocate() will never see this return code, so
  it won't violate posix.
 
 You are right.
 
 But, we still need to standardize (and limit) the error codes
 which we should return from kernel when we want to fall back on the
 library implementation. The posix_fallocate() library function will have
 to look for a set of errors from fallocate() system call, upon receiving
 which it will do preallocation from user level; or else, it will return
 success/error-code returned by the system call to the user.
 
 I think we can make it fall back to library implementation of fallocate,
 whenever posix_fallocate() receives any of the following errors from
 fallocate() system call:
 
 1. ENOSYS
 2. EOPNOTSUPP
 3. ENOTTY(?)
 
 Now the question is - should we limit the set of errors for this purpose
 to just 1  2 above ? In that case I will need to change the error being
 returned here to -EOPNOTSUPP (from current -ENOTTY).

If you want my opinion, -EOPNOTSUPP is better than -ENOTTY.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-08 Thread Mingming Cao
On Mon, 2007-05-07 at 21:43 -0400, Theodore Tso wrote:
 On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
  We could check the total number of fs free blocks account before
  preallocation happens, if there isn't enough space left, there is no
  need to bother preallocating.
 
 Checking against the fs free blocks is a good idea, since it will
 prevent the obvious error case where someone tries to preallocate 10GB
 when there is only 2GB left.
Think it again, this check is useful when preallocate blocks at EOF.
It's not much useful is preallocating a range with holes. In that case
2GB space might be enough if the application tries to preallocate a
10GB.

   But it won't help if there are multiple
 processes trying to allocate blocks the same time.  On the other hand,
 that case is probably relatively rare, and in that case, the
 filesystem was probably going to be left completely full in any case.

 On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote:
  Userspace could presumably repair the mess in most situations by truncating
  the file back again.  The kernel cannot do that because there might be live
  data in amongst there.
 
 Actually, the kernel could do it, in that could simply release all
 unitialized extents back to the system.  The problem is distinguishing
 between the unitialized extents that had just been newly added, versus
 the ones that had there from before.

True, the new uninitialized extents can be merged to the near old
uninitialized extents, there is no way to distinguish the just added
unintialized extents from the merged one.

   (On the other hand, if the
 filesystem was completely full, releasing unitialized blocks wouldn't
 be the worse thing in the world to do, although releasing previously
 fallocated blocks probably does violate the princple of least
 surprise, even if it's what the user would have wanted.)
 
 On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
  If there is enough free space, we could make a reservation window that
  have at least N free blocks and mark it not stealable by other files. So
  later we will not run into the ENOSPC error.
 
 Could you really use a single reservation window?  When the filesystem
 is almost full, the free extents are likely going to be scattered all
 over the disk.  The general principle of grabbing all of the extents
 and keeping them in an in-memory data structure, and only adding them
 to the extent tree would work, though; I'm just not sure we could do
 it using the existing reservation window code, since it only supports
 a single reservation window per file, yes?
 
You are right.  One reservation window per file and there is limit to
the maximum window size). So yeah this way it's not going to prevent
ENOSPC for sure:(

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Amit K. Arora
On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  This patch has the ext4 implemtation of fallocate system call.
  
  ...
 
  +   /* ext4_can_extents_be_merged should have checked that either
  +* both extents are uninitialized, or both aren't. Thus we
  +* need to check only one of them here.
  +*/
 
 Please always format multiline comments like this:
 
   /*
* ext4_can_extents_be_merged should have checked that either
* both extents are uninitialized, or both aren't. Thus we
* need to check only one of them here.
*/

Ok.
 
  ...
 
  +/*
  + * ext4_fallocate:
  + * preallocate space for a file
  + * mode is for future use, e.g. for unallocating preallocated blocks etc.
  + */
 
 This description is rather thin.  What is the filesystem's actual behaviour
 here?  If the file is using extents then the implementation will do
 something.  If the file is using bitmaps then we will do something else.
 
 But what?   Here is where it should be described.

Ok. Will expand the description.
 
  +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
  len)
  +{
  +   handle_t *handle;
  +   ext4_fsblk_t block, max_blocks;
  +   int ret, ret2, nblocks = 0, retries = 0;
  +   struct buffer_head map_bh;
  +   unsigned int credits, blkbits = inode-i_blkbits;
  +
  +   /* Currently supporting (pre)allocate mode _only_ */
  +   if (mode != FA_ALLOCATE)
  +   return -EOPNOTSUPP;
  +
  +   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
  +   return -ENOTTY;
 
 So we don't implement fallocate on bitmap-based files!  Well that's huge
 news.  The changelog would be an appropriate place to communicate this,
 along with reasons why, or a description of the plan to fix it.

Ok. Will add this in the function description as well.
 
 Also, posix says nothing about fallocate() returning ENOTTY.

Right. I don't seem to find any suitable error from posix description.
Can you please suggest an error code which might make more sense here ?
Will -ENOTSUPP be ok ? Since we want to say here that we don't support
non-extent files.
 
  +   block = offset  blkbits;
  +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  blkbits)
  +- block;
  +   mutex_lock(EXT4_I(inode)-truncate_mutex);
  +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
  +   mutex_unlock(EXT4_I(inode)-truncate_mutex);
 
 Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
 space, and that this disk space will require an arbitrary amount of
 metadata, how can we work out how much journal space we'll be needing
 without at least looking at `len'?

You are right to say that the credits can not be fixed here. But, 'len'
will not directly tell us how many extents might need to be inserted and
how many block groups (if any - think about the segment range already
being allocated case) the allocation request might touch.
One solution I have thought is to check the buffer credits after a call to
ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the
credits are falling short. Incase journal_extend fails, we call
journal_restart. This will automatically take care of how much journal
space we might need for any value of len.
 
  +   handle=ext4_journal_start(inode, credits +
 
 Please always put spaces around =A
Ok.
 
  +   EXT4_DATA_TRANS_BLOCKS(inode-i_sb)+1);
 
 And around +
Ok.
 
  +   if (IS_ERR(handle))
  +   return PTR_ERR(handle);
  +retry:
  +   ret = 0;
  +   while (ret = 0  ret  max_blocks) {
  +   block = block + ret;
  +   max_blocks = max_blocks - ret;
  +   ret = ext4_ext_get_blocks(handle, inode, block,
  + max_blocks, map_bh,
  + EXT4_CREATE_UNINITIALIZED_EXT, 0);
  +   BUG_ON(!ret);
 
 BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
 ext4_error() would be safer and more useful here.

Ok. Will do that.
 
  +   if (ret  0  test_bit(BH_New, map_bh.b_state)
 
 Use buffer_new() here.   A separate patch which fixes the three existing
 instances of open-coded BH_foo usage would be appreciated.

Ok.
 
  +((block + ret)  (i_size_read(inode)  blkbits)))
 
 Check for wrap though the sign bit and through zero please.
Ok.
 
  +   nblocks = nblocks + ret;
  +   }
  +
  +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, retries))
  +   goto retry;
  +
  +   /* Time to update the file size.
  +* Update only when preallocation was requested beyond the file size.
  +*/
 
 Fix comment layout.
Ok.
 
  +   if ((offset + len)  i_size_read(inode)) {
 
 Both the lhs and the rhs here are signed.  Please review for possible
 overflows through 

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Dave Kleikamp
On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
 On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
  On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] 
  wrote:

   +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
   len)
   +{
   + handle_t *handle;
   + ext4_fsblk_t block, max_blocks;
   + int ret, ret2, nblocks = 0, retries = 0;
   + struct buffer_head map_bh;
   + unsigned int credits, blkbits = inode-i_blkbits;
   +
   + /* Currently supporting (pre)allocate mode _only_ */
   + if (mode != FA_ALLOCATE)
   + return -EOPNOTSUPP;
   +
   + if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
   + return -ENOTTY;
  
  So we don't implement fallocate on bitmap-based files!  Well that's huge
  news.  The changelog would be an appropriate place to communicate this,
  along with reasons why, or a description of the plan to fix it.
 
 Ok. Will add this in the function description as well.
 
  Also, posix says nothing about fallocate() returning ENOTTY.
 
 Right. I don't seem to find any suitable error from posix description.
 Can you please suggest an error code which might make more sense here ?
 Will -ENOTSUPP be ok ? Since we want to say here that we don't support
 non-extent files.

Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
here, and fall back to the current library code to do preallocation?
This way, the caller of fallocate() will never see this return code, so
it won't violate posix.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 03, 2007  21:31 -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
  + * ext4_fallocate:
  + * preallocate space for a file
  + * mode is for future use, e.g. for unallocating preallocated blocks etc.
  + */
 
 This description is rather thin.  What is the filesystem's actual behaviour
 here?  If the file is using extents then the implementation will do
 something.  If the file is using bitmaps then we will do something else.
 
 But what?   Here is where it should be described.

My understanding is that glibc will handle zero-filling of files for
filesystems that do not support fallocate().

  +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
  len)
  +{
  +   handle_t *handle;
  +   ext4_fsblk_t block, max_blocks;
  +   int ret, ret2, nblocks = 0, retries = 0;
  +   struct buffer_head map_bh;
  +   unsigned int credits, blkbits = inode-i_blkbits;
  +
  +   /* Currently supporting (pre)allocate mode _only_ */
  +   if (mode != FA_ALLOCATE)
  +   return -EOPNOTSUPP;
  +
  +   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
  +   return -ENOTTY;
 
 So we don't implement fallocate on bitmap-based files!  Well that's huge
 news.  The changelog would be an appropriate place to communicate this,
 along with reasons why, or a description of the plan to fix it.
 
 Also, posix says nothing about fallocate() returning ENOTTY.

I _think_ this is to convince glibc to do the zero-filling in userspace,
but I'm not up on the API specifics.

  +   block = offset  blkbits;
  +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  blkbits)
  +- block;
  +   mutex_lock(EXT4_I(inode)-truncate_mutex);
  +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
  +   mutex_unlock(EXT4_I(inode)-truncate_mutex);
 
 Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
 space, and that this disk space will require an arbitrary amount of
 metadata, how can we work out how much journal space we'll be needing
 without at least looking at `len'?

Good question.

The uninitialized extent can cover up to 128MB with a single entry.
If @path isn't specified, then ext4_ext_calc_credits_for_insert()
function returns the maximum number of extents needed to insert a leaf,
including splitting all of the index blocks.  That would allow up to 43GB
(340 extents/block * 128MB) to be preallocated, but it still needs to take
the size of the preallocation into account (adding 3 blocks per 43GB - a
leaf block, a bitmap block and a group descriptor).

Also, since @path is not being given then truncate_mutex is not needed.

  +   ret = ext4_ext_get_blocks(handle, inode, block,
  + max_blocks, map_bh,
  + EXT4_CREATE_UNINITIALIZED_EXT, 0);
  +   BUG_ON(!ret);
 
 BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
 ext4_error() would be safer and more useful here.

Ouch, not very friendly error handling.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 05:37:54 -0600
Andreas Dilger [EMAIL PROTECTED] wrote:

   + block = offset  blkbits;
   + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  blkbits)
   +  - block;
   + mutex_lock(EXT4_I(inode)-truncate_mutex);
   + credits = ext4_ext_calc_credits_for_insert(inode, NULL);
   + mutex_unlock(EXT4_I(inode)-truncate_mutex);
  
  Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
  space, and that this disk space will require an arbitrary amount of
  metadata, how can we work out how much journal space we'll be needing
  without at least looking at `len'?
 
 Good question.
 
 The uninitialized extent can cover up to 128MB with a single entry.
 If @path isn't specified, then ext4_ext_calc_credits_for_insert()
 function returns the maximum number of extents needed to insert a leaf,
 including splitting all of the index blocks.  That would allow up to 43GB
 (340 extents/block * 128MB) to be preallocated, but it still needs to take
 the size of the preallocation into account (adding 3 blocks per 43GB - a
 leaf block, a bitmap block and a group descriptor).

I think the use of ext4_journal_extend() (as Amit has proposed) will help
here, but it is not sufficient.

Because under some circumstances, a journal_extend() failure could mean
that we fail to allocate all the required disk space.  If it is infrequent
enough, that is acceptable when the caller is using fallocate() for
performance reasons.

But it is very much not acceptable if the caller is using fallocate() for
space-reservation reasons.  If you used fallocate to reserve 1GB of disk
and fallocate() succeeded and you later get ENOSPC then you'd have a
right to get a bit upset.

So I think the ext3/4 fallocate() implementation will need to be
implemented as a loop: 

while (len) {
journal_start();
len -= do_fallocate(len, ...);
journal_stop();
}


Now the interesting question is: what do we do if we get halfway through
this loop and then run out of space?  We could leave the disk all filled up
and then return failure to the caller, but that's pretty poor behaviour,
IMO.



Does the proposed implementation handle quotas correctly, btw?  Has that
been tested?


Final point: it's fairly disappointing that the present implementation is
ext4-only, and extent-only.  I do think we should be aiming at an ext4
bitmap-based implementation and an ext3 implementation.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 15:21:04 -0700
Andreas Dilger [EMAIL PROTECTED] wrote:

 On May 07, 2007  13:58 -0700, Andrew Morton wrote:
  Final point: it's fairly disappointing that the present implementation is
  ext4-only, and extent-only.  I do think we should be aiming at an ext4
  bitmap-based implementation and an ext3 implementation.
 
 Actually, this is a non-issue.  The reason that it is handled for extent-only
 is that this is the only way to allocate space in the filesystem without
 doing the explicit zeroing.  For other filesystems (including ext3 and
 ext4 with block-mapped files) the filesystem should return an error (e.g.
 -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

hrm, spose so.

It can be a bit suboptimal from the layout POV.  The reservations code will
largely save us here, but kernel support might make it a bit better.

Totally blowing pagecache could be a problem.  Fixable in userspace by
using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
  Actually, this is a non-issue.  The reason that it is handled for 
  extent-only
  is that this is the only way to allocate space in the filesystem without
  doing the explicit zeroing.  For other filesystems (including ext3 and
  ext4 with block-mapped files) the filesystem should return an error (e.g.
  -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.
 
 It can be a bit suboptimal from the layout POV.  The reservations code will
 largely save us here, but kernel support might make it a bit better.

Actually, the reservations code won't matter, since glibc will fall
back to its current behavior, which is it will do the preallocation by
explicitly writing zeros to the file.  This wlil result in the same
layout as if we had done the persistent preallocation, but of course
it will mean the posix_fallocate() could potentially take a long time
if you're a PVR and you're reserving a gig or two for a two hour movie
at high quality.  That seems suboptimal, granted, and ideally the
application should be warned about this before it calls
posix_fallocate().  On the other hand, it's what happens today, all
the time, so applications won't be too badly surprised.  

If we think applications programmers badly need to know in advance if
posix_fallocate() will be fast or slow, probably the right thing is to
define a new fpathconf() configuration option so they can query to see
whether a particular file will support a fast posix_fallocate().  I'm
not 100% convinced such complexity is really needed, but I'm willing
to be convinced  what do folks think?

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote:
 Andreas Dilger wrote:
 On May 07, 2007  13:58 -0700, Andrew Morton wrote:
 Final point: it's fairly disappointing that the present implementation is
 ext4-only, and extent-only.  I do think we should be aiming at an ext4
 bitmap-based implementation and an ext3 implementation.
 
 Actually, this is a non-issue.  The reason that it is handled for 
 extent-only
 is that this is the only way to allocate space in the filesystem without
 doing the explicit zeroing.  For other filesystems (including ext3 and
 
 Precisely /how/ do you avoid the zeroing issue, for extents?
 
 If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
 otherwise the implementation is broken.

There is a bit in the extent structure which indicates that the extent
has not been initialized.  When reading from a block where the extent
is marked as unitialized, ext4 returns zero's, to avoid returning the
uninitalized contents of the disk, which might contain someone else's
love letters, p0rn, or other information which we shouldn't leak out.
When writing to an extent which is uninitalized, we may potentially
have to split the extent into three extents in the worst case.

My understanding is that XFS uses a similar implementation; it's a
pretty obvious and standard way to implement allocated-but-not-initialized
extents.

We thought about supporting persistent preallocation for inodes using
indirect blocks, but it would require stealing a bit from each entry
in the indirect block, reducing the maximum size of the filesystem by
two (i.e., 2**31 blocks).  It was decided it wasn't worth the
complexity, given the tradeoffs.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao
On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote:
 On Mon, 7 May 2007 05:37:54 -0600
 Andreas Dilger [EMAIL PROTECTED] wrote:
 
+   block = offset  blkbits;
+   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  
blkbits)
+- block;
+   mutex_lock(EXT4_I(inode)-truncate_mutex);
+   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
+   mutex_unlock(EXT4_I(inode)-truncate_mutex);
   
   Now I'm mystified.  Given that we're allocating an arbitrary amount of 
   disk
   space, and that this disk space will require an arbitrary amount of
   metadata, how can we work out how much journal space we'll be needing
   without at least looking at `len'?
  
  Good question.
  
  The uninitialized extent can cover up to 128MB with a single entry.
  If @path isn't specified, then ext4_ext_calc_credits_for_insert()
  function returns the maximum number of extents needed to insert a leaf,
  including splitting all of the index blocks.  That would allow up to 43GB
  (340 extents/block * 128MB) to be preallocated, but it still needs to take
  the size of the preallocation into account (adding 3 blocks per 43GB - a
  leaf block, a bitmap block and a group descriptor).
 
 I think the use of ext4_journal_extend() (as Amit has proposed) will help
 here, but it is not sufficient.
 
 Because under some circumstances, a journal_extend() failure could mean
 that we fail to allocate all the required disk space.  If it is infrequent
 enough, that is acceptable when the caller is using fallocate() for
 performance reasons.
 
 But it is very much not acceptable if the caller is using fallocate() for
 space-reservation reasons.  If you used fallocate to reserve 1GB of disk
 and fallocate() succeeded and you later get ENOSPC then you'd have a
 right to get a bit upset.
 
 So I think the ext3/4 fallocate() implementation will need to be
 implemented as a loop: 
 
   while (len) {
   journal_start();
   len -= do_fallocate(len, ...);
   journal_stop();
   }
 
 

I agree.  There is already a loop in Amit's current's patch to call
ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to
ask for in each journal_start()?
 +/*
 + * ext4_fallocate:
 + * preallocate space for a file
 + * mode is for future use, e.g. for unallocating preallocated blocks etc.
 + */
 +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
 +{


 +   mutex_lock(EXT4_I(inode)-truncate_mutex);
 +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
 +   mutex_unlock(EXT4_I(inode)-truncate_mutex);

I think the calculation is based on the assumption that there is only a
single extent to be inserted, which is the ideal case. But in some cases
we may end up allocating several chunk of blocks(extents) for this
single preallocation request when fs is fragmented (or part of
preallocation request is already fulfilled)

I think we should move this calculation inside the loop as well,and we
really do not need to grab the lock to calculate the credit if the @path
is always NULL, all the function does is mathmatics.

I can't think of any good way to estimate the total credits needed for
this whole preallocation request. Looked at ext4_get_block(), which is
used for DIO code to deal with large amount of block allocation. The
credit reservation is quite weak there too. The DIO_CREDIT is only
(EXT4_RESERVE_TRANS_BLOCKS + 32)

 +   handle=ext4_journal_start(inode, credits +
 +   
 EXT4_DATA_TRANS_BLOCKS(inode-i_sb)+1);
 +   if (IS_ERR(handle))
 +   return PTR_ERR(handle);
 +retry:
 +   ret = 0;
 +   while (ret = 0  ret  max_blocks) {
 +   block = block + ret;
 +   max_blocks = max_blocks - ret;
 +   ret = ext4_ext_get_blocks(handle, inode, block,
 + max_blocks, map_bh,
 + EXT4_CREATE_UNINITIALIZED_EXT, 0);
 +   BUG_ON(!ret);
 +   if (ret  0  test_bit(BH_New, map_bh.b_state)
 +((block + ret)  (i_size_read(inode)  blkbits)))
 +   nblocks = nblocks + ret;
 +   }
 +
 +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, retries))
 +   goto retry;
 +
 Now the interesting question is: what do we do if we get halfway through
 this loop and then run out of space?  We could leave the disk all filled up
 and then return failure to the caller, but that's pretty poor behaviour,
 IMO.
 
The current code handles earlier ENOSPC by three times retries. After
that if we still run out of space, then it's propably right to notify
the caller there isn't much space left.

We could extend the block reservation window size before the while loop
so we could get a lower chance to get more fragmented.

 
 Does the proposed implementation handle quotas 

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 07 May 2007 17:00:24 -0700
Mingming Cao [EMAIL PROTECTED] wrote:

  +   while (ret = 0  ret  max_blocks) {
  +   block = block + ret;
  +   max_blocks = max_blocks - ret;
  +   ret = ext4_ext_get_blocks(handle, inode, block,
  + max_blocks, map_bh,
  + EXT4_CREATE_UNINITIALIZED_EXT, 0);
  +   BUG_ON(!ret);
  +   if (ret  0  test_bit(BH_New, map_bh.b_state)
  +((block + ret)  (i_size_read(inode)  
  blkbits)))
  +   nblocks = nblocks + ret;
  +   }
  +
  +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, 
  retries))
  +   goto retry;
  +
  Now the interesting question is: what do we do if we get halfway through
  this loop and then run out of space?  We could leave the disk all filled up
  and then return failure to the caller, but that's pretty poor behaviour,
  IMO.
  
 The current code handles earlier ENOSPC by three times retries. After
 that if we still run out of space, then it's propably right to notify
 the caller there isn't much space left.
 
 We could extend the block reservation window size before the while loop
 so we could get a lower chance to get more fragmented.

yes, but my point is that the proposed behaviour is really quite bad.

We will attempt to allocate the disk space and then we will return failure,
having consumed all the disk space and having partially and uselessly
populated an unknown amount of the file.

Userspace could presumably repair the mess in most situations by truncating
the file back again.  The kernel cannot do that because there might be live
data in amongst there.

So we'd need to either keep track of which blocks were newly-allocated and
then free them all again on the error path (doesn't work right across
commit+crash+recovery) or we could later use the space-reservation scheme which
delayed allocation will need to introduce.

Or we could decide to live with the above IMO-crappy behaviour.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao
On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote:
 On Mon, 7 May 2007 19:14:42 -0400
 Theodore Tso [EMAIL PROTECTED] wrote:
 
  On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
Actually, this is a non-issue.  The reason that it is handled for 
extent-only
is that this is the only way to allocate space in the filesystem without
doing the explicit zeroing.  For other filesystems (including ext3 and
ext4 with block-mapped files) the filesystem should return an error 
(e.g.
-EOPNOTSUPP) and glibc will do manual zero-filling of the file in 
userspace.
   
   It can be a bit suboptimal from the layout POV.  The reservations code 
   will
   largely save us here, but kernel support might make it a bit better.
  
  Actually, the reservations code won't matter, since glibc will fall
  back to its current behavior, which is it will do the preallocation by
  explicitly writing zeros to the file.
 
 No!  Reservations code is *critical* here.  Without reservations, we get
 disastrously-bad layout if two processes were running a large fallocate()
 at the same time.  (This is an SMP-only problem, btw: on UP the timeslice
 lengths save us).
 
 My point is that even though reservations save us, we could do even-better
 in-kernel.
 

In this case, since the number of blocks to preallocate (eg. N=10GB) is
clear, we could improve the current reservation code, to allow callers
explicitly ask for a new window that have the minimum N free blocks for
the blocks-to-preallocated(rather than just have at least 1 free
blocks).

Before the ext4_fallocate() is called, the right reservation window size
is set with the flag to indicating please spend time if needed to find
a window covers at least N free blocks.

So for ex4 block mapped files, later when glibc is doing allocation and
zeroing, the ext4 block-mapped allocator will knows to reserve the right
amount of free blocks before allocating and zeroing 10GB space.

I am not sure whether this worth the effort though.

 But then, a smart application would bypass the glibc() fallocate()
 implementation and would tune the reservation window size and would use
 direct-IO or sync_file_range()+fadvise(FADV_DONTNEED).
 
  This wlil result in the same
  layout as if we had done the persistent preallocation, but of course
  it will mean the posix_fallocate() could potentially take a long time
  if you're a PVR and you're reserving a gig or two for a two hour movie
  at high quality.  That seems suboptimal, granted, and ideally the
  application should be warned about this before it calls
  posix_fallocate().  On the other hand, it's what happens today, all
  the time, so applications won't be too badly surprised.
 
 A PVR implementor would take all this over and would do it themselves, for
 sure.
 
  If we think applications programmers badly need to know in advance if
  posix_fallocate() will be fast or slow, probably the right thing is to
  define a new fpathconf() configuration option so they can query to see
  whether a particular file will support a fast posix_fallocate().  I'm
  not 100% convinced such complexity is really needed, but I'm willing
  to be convinced  what do folks think?
  
 
 An application could do sys_fallocate(one-byte) to work out whether it's
 supported in-kernel, I guess.
 

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao
On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote:
 On Mon, 07 May 2007 17:00:24 -0700
 Mingming Cao [EMAIL PROTECTED] wrote:
 
   +   while (ret = 0  ret  max_blocks) {
   +   block = block + ret;
   +   max_blocks = max_blocks - ret;
   +   ret = ext4_ext_get_blocks(handle, inode, block,
   + max_blocks, map_bh,
   + EXT4_CREATE_UNINITIALIZED_EXT, 
   0);
   +   BUG_ON(!ret);
   +   if (ret  0  test_bit(BH_New, map_bh.b_state)
   +((block + ret)  (i_size_read(inode)  
   blkbits)))
   +   nblocks = nblocks + ret;
   +   }
   +
   +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, 
   retries))
   +   goto retry;
   +
   Now the interesting question is: what do we do if we get halfway through
   this loop and then run out of space?  We could leave the disk all filled 
   up
   and then return failure to the caller, but that's pretty poor behaviour,
   IMO.
   
  The current code handles earlier ENOSPC by three times retries. After
  that if we still run out of space, then it's propably right to notify
  the caller there isn't much space left.
  
  We could extend the block reservation window size before the while loop
  so we could get a lower chance to get more fragmented.
 
 yes, but my point is that the proposed behaviour is really quite bad.
 
I agree your point, that's why I mention it only helped the
fragmentation issue but not the ENOSPC case.


 We will attempt to allocate the disk space and then we will return failure,
 having consumed all the disk space and having partially and uselessly
 populated an unknown amount of the file.
 

Not totally useless I think. If only half of the space is preallocated
because run out of space, the application can decide whether it's good
enough to start to use this preallocated space or wait for the fs to
have more free space.

 Userspace could presumably repair the mess in most situations by truncating
 the file back again.  The kernel cannot do that because there might be live
 data in amongst there.
 
 So we'd need to either keep track of which blocks were newly-allocated and
 then free them all again on the error path (doesn't work right across
 commit+crash+recovery) or we could later use the space-reservation scheme 
 which
 delayed allocation will need to introduce.
 
 Or we could decide to live with the above IMO-crappy behaviour.

In fact Amit and I had raised this issue before, whether it's okay to do
allow partial preallocation. At that moment the feedback is it's no much
different than the current zero-out-preallocation behavior: people might
preallocating half-way then later deal with ENOSPC.

We could check the total number of fs free blocks account before
preallocation happens, if there isn't enough space left, there is no
need to bother preallocating.

If there is enough free space, we could make a reservation window that
have at least N free blocks and mark it not stealable by other files. So
later we will not run into the ENOSPC error.

The fs free blocks account is just a estimate though.


Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 07, 2007  19:02 -0400, Jeff Garzik wrote:
 Andreas Dilger wrote:
 Actually, this is a non-issue.  The reason that it is handled for 
 extent-only is that this is the only way to allocate space in the
 filesystem without doing the explicit zeroing.
 
 Precisely /how/ do you avoid the zeroing issue, for extents?
 
 If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
 otherwise the implementation is broken.

In ext4 (as in XFS) there is a flag stored in the extent that tells if
the extent is initialized or not.  Reads from uninitialized extents will
return zero-filled data, and writes that don't span the whole extent
will cause the uninitialized extent to be split into a regular extent
and one or two uninitialized extents (depending where the write is).

My comment was just that the extent doesn't have to be explicitly zero
filled on the disk, by virtue of the fact that the uninitialized flag
will cause reads to return zero.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] ext4: fallocate support in ext4

2007-04-26 Thread Amit K. Arora
This patch has the ext4 implemtation of fallocate system call.

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  201 +++-
 fs/ext4/file.c  |1 
 include/linux/ext4_fs.h |7 +
 include/linux/ext4_fs_extents.h |   13 ++
 4 files changed, 179 insertions(+), 43 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1107,7 +1107,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int depth, len1;
 
b1 = le32_to_cpu(newext-ee_block);
-   len1 = le16_to_cpu(newext-ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1181,9 +1193,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *ex, *fex;
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
-   int depth, len, err, next;
+   int depth, len, err, next, uninitialized = 0;
 
-   BUG_ON(newext-ee_len == 0);
+   BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);
ex = path[depth].p_ext;
BUG_ON(path[depth].p_hdr == NULL);
@@ -1191,14 +1203,23 @@ int ext4_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex  ext4_can_extents_be_merged(inode, ex, newext)) {
ext_debug(append %d block to %d:%d (from %llu)\n,
-   le16_to_cpu(newext-ee_len),
+   ext4_ext_get_actual_len(newext),
le32_to_cpu(ex-ee_block),
-