date:20070507

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Amit K. Arora

Andrew,

Thanks for the review comments!

On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  This patch implements the fallocate() system call and adds support for
  i386, x86_64 and powerpc.
  
  ...
 
  +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
 
 Please add a comment over this function which specifies its behaviour. 
 Really it should be enough material from which a full manpage can be
 written.
 
 If that's all too much, this material should at least be spelled out in the
 changelog.  Because there's no way in which this change can be fully
 reviewed unless someone (ie: you) tells us what it is setting out to
 achieve.
 
 If we 100% implement some standard then a URL for what we claim to
 implement would suffice.  Given that we're at least using different types from
 posix I doubt if such a thing would be sufficient.
 
 And given the complexity and potential variability within the filesystem
 implementations of this, I'd expect that _something_ additional needs to be
 said?

Ok. I will add a detailed comment here.

 
  +{
  +   struct file *file;
  +   struct inode *inode;
  +   long ret = -EINVAL;
  +
  +   if (len == 0 || offset  0)
  +   goto out;
 
 The posix spec implies that negative `len' is permitted - presumably allocate
 ahead of `offset'.  How peculiar.

I think we should go ahead with current glibc implementation (which
Jakub poited at) of not allowing a negative 'len', since posix also
doesn't explicitly say anything about allowing negative 'len'.

 
  +   ret = -EBADF;
  +   file = fget(fd);
  +   if (!file)
  +   goto out;
  +   if (!(file-f_mode  FMODE_WRITE))
  +   goto out_fput;
  +
  +   inode = file-f_path.dentry-d_inode;
  +
  +   ret = -ESPIPE;
  +   if (S_ISFIFO(inode-i_mode))
  +   goto out_fput;
  +
  +   ret = -ENODEV;
  +   if (!S_ISREG(inode-i_mode))
  +   goto out_fput;
 
 So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
 seems a bit silly of them.

True. 
 
  +   ret = -EFBIG;
  +   if (offset + len  inode-i_sb-s_maxbytes)
  +   goto out_fput;
 
 This code does handle offset+len going negative, but only by accident, I
 suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
 here would settle the reader's mind.

Ok. I will add a check here for wrap though zero.
 
  +   if (inode-i_op  inode-i_op-fallocate)
  +   ret = inode-i_op-fallocate(inode, mode, offset, len);
  +   else
  +   ret = -ENOSYS;
 
 If we _are_ going to support negative `len', as posix suggests, I think we
 should perform the appropriate sanity conversions to `offset' and `len'
 right here, rather than expecting each filesystem to do it.
 
 If we're not going to handle negative `len' then we should check for it.

Will add a check for negative 'len' and return -EINVAL. This will be
done where currently we check for negative offset (i.e. at the start of
the function).
 
  +out_fput:
  +   fput(file);
  +out:
  +   return ret;
  +}
  +EXPORT_SYMBOL(sys_fallocate);
 
 I don't believe this needs to be exported to modules?

Ok. Will remove it.
 
  +/*
  + * fallocate() modes
  + */
  +#define FA_ALLOCATE0x1
  +#define FA_DEALLOCATE  0x2
 
 Now those aren't in posix.  They should be documented, along with their
 expected semantics.

Will add a comment describing the role of these modes.
 
   #ifdef __KERNEL__
   
   #include linux/linkage.h
  @@ -1125,6 +1131,7 @@ struct inode_operations {
  ssize_t (*listxattr) (struct dentry *, char *, size_t);
  int (*removexattr) (struct dentry *, const char *);
  void (*truncate_range)(struct inode *, loff_t, loff_t);
  +   long (*fallocate)(struct inode *, int, loff_t, loff_t);
 
 I really do think it's better to put the variable names in definitions such
 as this.  Especially when we have two identically-typed variables next to
 each other like that.  Quick: which one is the offset and which is the
 length?

Ok. Will add the variable names here.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] ext4: Extent overlap bugfix

2007-05-07 Thread Amit K. Arora

On Thu, May 03, 2007 at 09:30:02PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:41:01 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  +unsigned int ext4_ext_check_overlap(struct inode *inode,
  +   struct ext4_extent *newext,
  +   struct ext4_ext_path *path)
  +{
  +   unsigned long b1, b2;
  +   unsigned int depth, len1;
  +
  +   b1 = le32_to_cpu(newext-ee_block);
  +   len1 = le16_to_cpu(newext-ee_len);
  +   depth = ext_depth(inode);
  +   if (!path[depth].p_ext)
  +   goto out;
  +   b2 = le32_to_cpu(path[depth].p_ext-ee_block);
  +
  +   /* get the next allocated block if the extent in the path
  +* is before the requested block(s) */
  +   if (b2  b1) {
  +   b2 = ext4_ext_next_allocated_block(path);
  +   if (b2 == EXT_MAX_BLOCK)
  +   goto out;
  +   }
  +
  +   if (b1 + len1  b2) {
 
 Are we sure that b1+len cannot wrap through zero here?

No. Will add a check here for this. Thanks!
 
  +   newext-ee_len = cpu_to_le16(b2 - b1);
  +   return 1;
  +   }


--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Amit K. Arora

On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  This patch has the ext4 implemtation of fallocate system call.
  
  ...
 
  +   /* ext4_can_extents_be_merged should have checked that either
  +* both extents are uninitialized, or both aren't. Thus we
  +* need to check only one of them here.
  +*/
 
 Please always format multiline comments like this:
 
   /*
* ext4_can_extents_be_merged should have checked that either
* both extents are uninitialized, or both aren't. Thus we
* need to check only one of them here.
*/

Ok.
 
  ...
 
  +/*
  + * ext4_fallocate:
  + * preallocate space for a file
  + * mode is for future use, e.g. for unallocating preallocated blocks etc.
  + */
 
 This description is rather thin.  What is the filesystem's actual behaviour
 here?  If the file is using extents then the implementation will do
 something.  If the file is using bitmaps then we will do something else.
 
 But what?   Here is where it should be described.

Ok. Will expand the description.
 
  +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
  len)
  +{
  +   handle_t *handle;
  +   ext4_fsblk_t block, max_blocks;
  +   int ret, ret2, nblocks = 0, retries = 0;
  +   struct buffer_head map_bh;
  +   unsigned int credits, blkbits = inode-i_blkbits;
  +
  +   /* Currently supporting (pre)allocate mode _only_ */
  +   if (mode != FA_ALLOCATE)
  +   return -EOPNOTSUPP;
  +
  +   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
  +   return -ENOTTY;
 
 So we don't implement fallocate on bitmap-based files!  Well that's huge
 news.  The changelog would be an appropriate place to communicate this,
 along with reasons why, or a description of the plan to fix it.

Ok. Will add this in the function description as well.
 
 Also, posix says nothing about fallocate() returning ENOTTY.

Right. I don't seem to find any suitable error from posix description.
Can you please suggest an error code which might make more sense here ?
Will -ENOTSUPP be ok ? Since we want to say here that we don't support
non-extent files.
 
  +   block = offset  blkbits;
  +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  blkbits)
  +- block;
  +   mutex_lock(EXT4_I(inode)-truncate_mutex);
  +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
  +   mutex_unlock(EXT4_I(inode)-truncate_mutex);
 
 Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
 space, and that this disk space will require an arbitrary amount of
 metadata, how can we work out how much journal space we'll be needing
 without at least looking at `len'?

You are right to say that the credits can not be fixed here. But, 'len'
will not directly tell us how many extents might need to be inserted and
how many block groups (if any - think about the segment range already
being allocated case) the allocation request might touch.
One solution I have thought is to check the buffer credits after a call to
ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the
credits are falling short. Incase journal_extend fails, we call
journal_restart. This will automatically take care of how much journal
space we might need for any value of len.
 
  +   handle=ext4_journal_start(inode, credits +
 
 Please always put spaces around =A
Ok.
 
  +   EXT4_DATA_TRANS_BLOCKS(inode-i_sb)+1);
 
 And around +
Ok.
 
  +   if (IS_ERR(handle))
  +   return PTR_ERR(handle);
  +retry:
  +   ret = 0;
  +   while (ret = 0  ret  max_blocks) {
  +   block = block + ret;
  +   max_blocks = max_blocks - ret;
  +   ret = ext4_ext_get_blocks(handle, inode, block,
  + max_blocks, map_bh,
  + EXT4_CREATE_UNINITIALIZED_EXT, 0);
  +   BUG_ON(!ret);
 
 BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
 ext4_error() would be safer and more useful here.

Ok. Will do that.
 
  +   if (ret  0  test_bit(BH_New, map_bh.b_state)
 
 Use buffer_new() here.   A separate patch which fixes the three existing
 instances of open-coded BH_foo usage would be appreciated.

Ok.
 
  +((block + ret)  (i_size_read(inode)  blkbits)))
 
 Check for wrap though the sign bit and through zero please.
Ok.
 
  +   nblocks = nblocks + ret;
  +   }
  +
  +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, retries))
  +   goto retry;
  +
  +   /* Time to update the file size.
  +* Update only when preallocation was requested beyond the file size.
  +*/
 
 Fix comment layout.
Ok.
 
  +   if ((offset + len)  i_size_read(inode)) {
 
 Both the lhs and the rhs here are signed.  Please review for possible
 overflows through

Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-05-07 Thread Amit K. Arora

On Thu, May 03, 2007 at 09:32:38PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:46:23 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
  + */
  +int ext4_ext_try_to_merge(struct inode *inode,
  +   struct ext4_ext_path *path,
  +   struct ext4_extent *ex)
  +{
  +   struct ext4_extent_header *eh;
  +   unsigned int depth, len;
  +   int merge_done=0, uninitialized = 0;
 
 space around =, please.
 
 Many people prefer not to do the multiple-definitions-per-line, btw:
 
   int merge_done = 0;
   int uninitialized = 0;

Ok. Will make the change.

 
 reasons:
 
 - If gives you some space for a nice comment
 
 - It makes patches much more readable, and it makes rejects easier to fix
 
 - standardisation.
 
  +   depth = ext_depth(inode);
  +   BUG_ON(path[depth].p_hdr == NULL);
  +   eh = path[depth].p_hdr;
  +
  +   while (ex  EXT_LAST_EXTENT(eh)) {
  +   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
  +   break;
  +   /* merge with next extent! */
  +   if (ext4_ext_is_uninitialized(ex))
  +   uninitialized = 1;
  +   ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
  +   + ext4_ext_get_actual_len(ex + 1));
  +   if (uninitialized)
  +   ext4_ext_mark_uninitialized(ex);
  +
  +   if (ex + 1  EXT_LAST_EXTENT(eh)) {
  +   len = (EXT_LAST_EXTENT(eh) - ex - 1)
  +   * sizeof(struct ext4_extent);
  +   memmove(ex + 1, ex + 2, len);
  +   }
  +   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
 
 Kenrel convention is to put spaces around -

Will fix this.

 
  +   merge_done = 1;
  +   BUG_ON(eh-eh_entries == 0);
 
 eek, scary BUG_ON.  Do we really need to be that severe?  Would it be
 better to warn and run ext4_error() here?
Ok.
 
  +   }
  +
  +   return merge_done;
  +}
  +
  +
 
  ...
 
  +/*
  + * ext4_ext_convert_to_initialized:
  + * this function is called by ext4_ext_get_blocks() if someone tries to 
  write
  + * to an uninitialized extent. It may result in splitting the uninitialized
  + * extent into multiple extents (upto three). Atleast one initialized 
  extent
  + * and atmost two uninitialized extents can result.
 
 There are some typos here
 
  + * There are three possibilities:
  + *   a No split required: Entire extent should be initialized.
  + *   b Split into two extents: Only one end of the extent is being 
  written to.
  + *   c Split into three extents: Somone is writing in middle of the 
  extent.
 
 and here
 
Ok. Will fix them.
  + */
  +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
  +   struct ext4_ext_path *path,
  +   ext4_fsblk_t iblock,
  +   unsigned long max_blocks)
  +{
  +   struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex;
  +   struct ext4_extent_header *eh;
  +   unsigned int allocated, ee_block, ee_len, depth;
  +   ext4_fsblk_t newblock;
  +   int err = 0, ret = 0;
  +
  +   depth = ext_depth(inode);
  +   eh = path[depth].p_hdr;
  +   ex = path[depth].p_ext;
  +   ee_block = le32_to_cpu(ex-ee_block);
  +   ee_len = ext4_ext_get_actual_len(ex);
  +   allocated = ee_len - (iblock - ee_block);
  +   newblock = iblock - ee_block + ext_pblock(ex);
  +   ex2 = ex;
  +
  +   /* ex1: ee_block to iblock - 1 : uninitialized */
  +   if (iblock  ee_block) {
  +   ex1 = ex;
  +   ex1-ee_len = cpu_to_le16(iblock - ee_block);
  +   ext4_ext_mark_uninitialized(ex1);
  +   ex2 = newex;
  +   }
  +   /* for sanity, update the length of the ex2 extent before
  +* we insert ex3, if ex1 is NULL. This is to avoid temporary
  +* overlap of blocks.
  +*/
  +   if (!ex1  allocated  max_blocks)
  +   ex2-ee_len = cpu_to_le16(max_blocks);
  +   /* ex3: to ee_block + ee_len : uninitialised */
  +   if (allocated  max_blocks) {
  +   unsigned int newdepth;
  +   ex3 = newex;
  +   ex3-ee_block = cpu_to_le32(iblock + max_blocks);
  +   ext4_ext_store_pblock(ex3, newblock + max_blocks);
  +   ex3-ee_len = cpu_to_le16(allocated - max_blocks);
  +   ext4_ext_mark_uninitialized(ex3);
  +   err = ext4_ext_insert_extent(handle, inode, path, ex3);
  +   if (err)
  +   goto out;
  +   /* The depth, and hence eh  ex might change
  +* as part of the insert above.
  +*/
  +   newdepth = ext_depth(inode);
  +   if (newdepth != depth)
  +   {
 
 Use
 
   if (newdepth != depth) {

Ok.
 
  +   depth=newdepth;
 
 spaces
Ok.
 
  +   path = ext4_ext_find_extent(inode, iblock, NULL);
  +   if (IS_ERR(path)) {
  +

Re: [PATCH 1/4] Add unix COW io manager

2007-05-07 Thread Theodore Tso

On Fri, May 04, 2007 at 02:43:21PM +0530, Aneesh Kumar K.V wrote:
 + * unix_cow_io.c --- This is the Unix (well, really POSIX) implementation
 + *   of the COW I/O manager.

I really wish you had based this not on unix_io.c, but rather on
test_io.c, for two reasons (a) this reduces code duplication, (b) it
means that we can chain/stack I/O managers.  i.e., test_io -
cow_manager - unix_io.

The one change I would recommend is that instead of assigning to a
globally visible external variable (as I did in test_io, and which I
now regret as a not necessarily being portable --- different shared
library implementations have different restrictions on whether you can
access global variables defined in shared libraries from an
application or vice versa --- with AIX having some of the most peverse
shared library restritions, at least 7-8 years ago when I had the
misfortune of trying to make Kerberos v5 shared libraries work on AIX
:-).  Instead, what I would recommend is creating a function which an
application could call to set the backing I/O manager for the
cow_manager, as well as the filenames to use for the tdb journal.

The second change I would recommend is to NOT call it cow_manager
but rather an undo_manager, since that makes it clear that what is
being backed up is not the work that we are about to do, but rather
the original contents of the disk.  I would also rename the
ext4_replay tool something like e2undofs, since replay has the
connotation that you are applying changes that were done in a log,
when in fact what you are really doing is backing out changes.

We want to make sure we avoid this confusion, since Xen and UML and
most other virtualization engines, for good reasons, do the exact
opposite --- the COW file contains the changes, and the base
file/device remains unchanged.

 + retval = tdb_store(data-tdb, tdb_key, tdb_data, TDB_INSERT);
 + if (retval == -1) {
 + /* FIXME!! Not sure what should be the error */
 + retval = EXT2_ET_SHORT_WRITE;
 + goto error_out;
 + }

You need to write the tdb_store() around a tdb_transaction_start() and
tdb_transaction_commit() pair, so that we can be sure the original
data is actually synced to disk.  Otherwise, on a crash, we could have
a situation where the original data on disk has been overwritten, but
the backup of the original block was never forced to disk!

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-05-07 Thread Amit K. Arora

On Mon, May 07, 2007 at 03:40:26PM +0300, Pekka Enberg wrote:
 On 4/26/07, Amit K. Arora [EMAIL PROTECTED] wrote:
  /*
 + * ext4_ext_try_to_merge:
 + * tries to merge the ex extent to the next extent in the tree.
 + * It always tries to merge towards right. If you want to merge towards
 + * left, pass ex - 1 as argument instead of ex.
 + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
 + * 1 if they got merged.
 + */
 +int ext4_ext_try_to_merge(struct inode *inode,
 +   struct ext4_ext_path *path,
 +   struct ext4_extent *ex)
 +{
 
 Please either use proper kerneldoc format or drop
 ext4_ext_try_to_merge from the comment.

Ok, Thanks.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Ulrich Drepper


Jakub Jelinek wrote:

is what glibc does ATM.  Seems we violate the case where len == 0, as
EINVAL in that case is shall fail.  But reading the standard to imply
negative len is ok is too much guessing, there is no word what it means
when len is negative and
required storage for regular file data starting at offset and continuing for len 
bytes
doesn't make sense for negative size.  


This wording has already been cleaned up.  The current draft for the 
next revision reads:



[EINVAL] The len argument is less than or equal to zero, or the offset
 argument is less than zero, or the underlying file system does not
 support this operation.


I still don't like it since len==0 shouldn't create an error (it's 
inconsistent) but len0 is already outlawed.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][take 4] e2fsprogs: Add ext4migrate

2007-05-07 Thread Theodore Tso

On Mon, May 07, 2007 at 08:56:11AM +0530, Aneesh Kumar K.V wrote:
 Andreas Dilger wrote:
 If this code could also (or optionally just) increase the size of inodes
 it would be very useful.  AFAIK right now it will only change the inodes
 from block-mapped to extent-mapped?

Actually, changing the inode size is actually going to be more useful
to a larger number of people, since they can use it today to support
in-inode EA's.  In addition, changing the inode size must be done
off-line, whereas it's not so clear that off-line conversion to
extents is the best way to go (more on that below).

One generic concern I have about simply migrating existing files to
use extents is that unless we actually also combine the tool with
defragmentation support, the file ends up being fairly badly
fragmented simply because of the holes left in the file from the
indirect blocks.  Therefore the tree ends up being far larger than it
needs to be, and future attempts allocate blocks in that block group
will fill in the holes, further degrading the performance of the
filesystem and whatever file is being written at the time.

The other reason why increasing the inode size would actually be more
valuable is that right now, if we have the disk space, the user can do
on-line migration of a file simply by copying it and then moving it
over the original file --- and if we have delayed allocation support
and sufficient memory, we can avoid the fragmentation problems
currently with the off-line ext4migration approach.

So we may need to think a bit about what's the best way to handle
ext4migrate.  It may be that correct approach is to combine it with a
defragmentation tool, either on-line or off-line.  

 That is correct. My next step is to enhance the tool to support 
 migration to large inodes.

I would actually suggest that the place to add that functionality
would be via tune2fs -I inode_size.  Since mke2fs -I inode_size
is used to set the inode size, it makes sense that tune2fs -I would
change the inode size.  Obviously this would have to be done when the
filesystem is not mounted, and tune2fs should abort gracefully if
determines that the filesystem is mounted.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][take 4] e2fsprogs: Add ext4migrate

2007-05-07 Thread Aneesh Kumar K.V


Hi,

Theodore Tso wrote:

On Mon, May 07, 2007 at 08:56:11AM +0530, Aneesh Kumar K.V wrote:

Andreas Dilger wrote:

If this code could also (or optionally just) increase the size of inodes
it would be very useful.  AFAIK right now it will only change the inodes

from block-mapped to extent-mapped?


Actually, changing the inode size is actually going to be more useful
to a larger number of people, since they can use it today to support
in-inode EA's.  In addition, changing the inode size must be done
off-line, whereas it's not so clear that off-line conversion to
extents is the best way to go (more on that below).




One of the option i was thinking was to use this tool to migrate to 
extent map and then towards the end use the online defrag ioctl to 
defrag the resulting ext4 inode.


That way we don't need to add kernel code that convert ext3 inode to 
ext4 inode while defragmentation. Last time i looked at the online 
defrag code, i was not able to figure out a easy way to take indirect 
map inode as input, then defrag the same and write the result in extent 
map.



-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][take 4] e2fsprogs: Add ext4migrate

2007-05-07 Thread Aneesh Kumar K.V




Theodore Tso wrote:


That is correct. My next step is to enhance the tool to support 
migration to large inodes.


I would actually suggest that the place to add that functionality
would be via tune2fs -I inode_size.  Since mke2fs -I inode_size
is used to set the inode size, it makes sense that tune2fs -I would
change the inode size.  Obviously this would have to be done when the
filesystem is not mounted, and tune2fs should abort gracefully if
determines that the filesystem is mounted.



That is nice !!. I will  add the large inode migration support to tune2fs.

-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: JBD: ext2online wants too many credits (744 256)

2007-05-07 Thread Frank van Maarseveen

On Sun, May 06, 2007 at 09:40:14PM -0700, Andrew Morton wrote:
 On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen [EMAIL PROTECTED] 
 wrote:
 
  Steps to reproduce:
  Create a 3G partition, say /dev/vol1/project
  mke2fs -j -b 4096 /dev/vol1/project 22812
  mount it
  ext2online /dev/vol1/project said:
  
  | ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b
  | ext2online: ext2_ioctl: No space left on device
  | 
  | ext2online: unable to resize /dev/mapper/vol1-project
  
  kernel said:
  
  | JBD: ext2online wants too many credits (721  256)

There's a threshold for the problem depending on the initial
size. This one fails:

mke2fs -j -b 4096 /dev/3GB-blockdev 32768
(mount + ext2online or resize2fs)
kernel: JBD: resize2fs wants too many credits (1034  1024)

Add one block to the initial mke2fs (32768+1 == 32769) and the
problem is gone.

Without the -b 4096 there's another resize problem

mke2fs -j /dev/loop1 2048
mount /dev/loop1 /1
resize2fs /dev/loop1

says:
resize2fs 1.40-WIP (14-Nov-2006)
Filesystem at /dev/loop1 is mounted on /1; on-line resizing required
old desc_blocks = 1, new_desc_blocks = 12
Performing an on-line resize of /dev/loop1 to 3072000 (1k) blocks.
resize2fs: Invalid argument While trying to add group #256

and the kernel says:

May  7 15:36:08 lokka EXT3-fs warning (device loop1): verify_reserved_gdb: 
May  7 15:36:08 lokka reserved GDT 10 missing grp 1 (8202)
May  7 15:36:08 lokka  

After that, the filesystem has been resized to 2GB. I recall a 2G
(?) limit for ext3 resizing with 1k blocksize but trying the above with
4096 1k blocks seems to work. fsck says it's ok all the time.

-- 
Frank
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: JBD: ext2online wants too many credits (744 256)

2007-05-07 Thread Theodore Tso

On Mon, May 07, 2007 at 12:26:26AM +0200, Frank van Maarseveen wrote:
 2.6.20.6, FC4:
 
 |JBD: ext2online wants too many credits (744  256)

You would be better off using resize2fs from e2fsprogs 1.39, BTW

About the only reason to keep using the ext2resize packages is
ext2prepare, which is the off-line tool to add a resizing inode after
the fact, and the main reason that functionality hasn't been absorbed
into e2fsprogs is that it needs to be completely rewritten before I'm
willing to trust and maintain it...

 What is the limitation I should be aware of? Has it something to do with
 the journal log size?

Yes, the problem is the journal log size.  Since the original
filesystem was so small, the journal was kept small so as not to use a
huge percentage of the filesystem size.  Of course, with a larger
filesystem you probably want a larger journal anyway to get better
performance.  So that brings up a related problem, which is we don't
have a way of dynamically increasing the size of the journal, which we
probably would want to do as part of resizing the filesystem size.

So the short-term workaround for now, if you know that you're going to
take a filesystem which is 22M and increase it to 3G, is to create it
with largish journal in the first place:

mke2fs -j -b 4096 -J size=16M /dev/vol1/project 22812

Note that if you were creating a 3G filesystem from scratch, by
default it would be using a 128M journal --- but it's a little hard to
fit a 128M journal in a 22M filesystem.  :-/

We're going to have to rethink the defaults of the journal size as it
relates to on-line resizing, but there might not be a lot of great
answers here.  We'll also have to look at the on-line resizing code
and see if there's a way to break up the resize operation into smaller
transactions as a way of avoiding this problem --- but that would
still leave the user stuck with a pathetically small 4M journal on a
3G filesystem.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: JBD: ext2online wants too many credits (744 256)

2007-05-07 Thread david


On Mon, 7 May 2007, Theodore Tso wrote:


We're going to have to rethink the defaults of the journal size as it
relates to on-line resizing, but there might not be a lot of great
answers here.  We'll also have to look at the on-line resizing code
and see if there's a way to break up the resize operation into smaller
transactions as a way of avoiding this problem --- but that would
still leave the user stuck with a pathetically small 4M journal on a
3G filesystem.


why not just allocate a new journal (something along the lines of convert 
the ext3 to ext2 then back to ext3 would be the ugly, scripted way of 
doing it)


or is the problem that you are trying to resize things without remounting 
them (and therefor without flushing the journal)


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Dave Kleikamp

On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
 On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
  On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] 
  wrote:

   +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
   len)
   +{
   + handle_t *handle;
   + ext4_fsblk_t block, max_blocks;
   + int ret, ret2, nblocks = 0, retries = 0;
   + struct buffer_head map_bh;
   + unsigned int credits, blkbits = inode-i_blkbits;
   +
   + /* Currently supporting (pre)allocate mode _only_ */
   + if (mode != FA_ALLOCATE)
   + return -EOPNOTSUPP;
   +
   + if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
   + return -ENOTTY;
  
  So we don't implement fallocate on bitmap-based files!  Well that's huge
  news.  The changelog would be an appropriate place to communicate this,
  along with reasons why, or a description of the plan to fix it.
 
 Ok. Will add this in the function description as well.
 
  Also, posix says nothing about fallocate() returning ENOTTY.
 
 Right. I don't seem to find any suitable error from posix description.
 Can you please suggest an error code which might make more sense here ?
 Will -ENOTSUPP be ok ? Since we want to say here that we don't support
 non-extent files.

Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
here, and fall back to the current library code to do preallocation?
This way, the caller of fallocate() will never see this return code, so
it won't violate posix.

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: JBD: ext2online wants too many credits (744 256)

2007-05-07 Thread Theodore Tso

On Mon, May 07, 2007 at 07:46:38AM -0700, [EMAIL PROTECTED] wrote:
 On Mon, 7 May 2007, Theodore Tso wrote:
 
 We're going to have to rethink the defaults of the journal size as it
 relates to on-line resizing, but there might not be a lot of great
 answers here.  We'll also have to look at the on-line resizing code
 and see if there's a way to break up the resize operation into smaller
 transactions as a way of avoiding this problem --- but that would
 still leave the user stuck with a pathetically small 4M journal on a
 3G filesystem.
 
 why not just allocate a new journal (something along the lines of convert 
 the ext3 to ext2 then back to ext3 would be the ugly, scripted way of 
 doing it)
 
 or is the problem that you are trying to resize things without remounting 
 them (and therefor without flushing the journal)

Yes, I'm referring to the difficulties of getting this right when
doing online (mounted) resizing, as that was the context of the
original bug report.

Clearly the right thing for resize2fs to do when it's doing an
off-line resize is to just adjust the journal inode and make it
bigger.  I'll add that to my todo list, but of course, patches to do
that would be gratefully accepted...

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][take 4] e2fsprogs: Add ext4migrate

2007-05-07 Thread Theodore Tso

On Mon, May 07, 2007 at 07:16:31PM +0530, Aneesh Kumar K.V wrote:
 
 One of the option i was thinking was to use this tool to migrate to 
 extent map and then towards the end use the online defrag ioctl to 
 defrag the resulting ext4 inode.

As we discussed on the phone, my recommendation would be to take your
existing code and move it into the kernel so that triggered off of an
ioctl, your code could migrate an inode from using indirect blocks to
extents while a filesystem is mounted.  The main things you will need
to watch for is to make sure the inode is locked so that another
process doesn't try to extend or truncate it, and to use the JBD layer
to provide appropriate journalling support.

Given that ext4migrated imported the kernel extent functions, it
should be relatively straightforward to simply make them use the
kernel extent functions while in kernel space.

Once the the inode has been converted on-line then it can be
defragmented on-line.  That would be much more convenient than having
to unmount the filesystem to do the off-line migration, followed by
mounting it to do the defragmentation.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults

2007-05-07 Thread Eric Sandeen

One of our testers filed a bug that said mkfs.ext3 is much slower when 
mke2fs.conf is missing...

This is because the shipped defaults in mke2fs.conf do not match the shipped 
defaults in the
mkfs code itself; he wound up making a 1k block filesystem on a very large 
block device,
for example.

So - How about this patch, to bring them back into line?  Which makes me 
wonder; having
defaults in 2 different places is bound to get out of sync; should we instead 
generate
both code  config file defaults (and maybe man page defaults) from a common 
source?

Anyway, here's a patch to bring mke2fs.conf and mke2fs.c into line for current 
defaults...

Thanks,
-Eric

Signed-off-by: Eric Sandeen [EMAIL PROTECTED]

Index: e2fsprogs-hg/misc/mke2fs.c
===
--- e2fsprogs-hg.orig/misc/mke2fs.c
+++ e2fsprogs-hg/misc/mke2fs.c
@@ -1288,7 +1288,8 @@ static void PRS(int argc, char *argv[])
tmp = tmp2 = NULL;
if (fs_param.s_rev_level != EXT2_GOOD_OLD_REV) {
profile_get_string(profile, defaults, base_features, 0,
-  filetype,sparse_super, tmp);
+  
sparse_super,filetype,resize_inode,dir_index,
+  tmp);
profile_get_string(profile, fs_types, fs_type, 
   base_features, tmp, tmp2);
edit_feature(tmp2, fs_param.s_feature_compat);
@@ -1365,7 +1366,7 @@ static void PRS(int argc, char *argv[])

if (blocksize = 0) {
profile_get_integer(profile, defaults, blocksize, 0,
-   1024, use_bsize);
+   4096, use_bsize);
profile_get_integer(profile, fs_types, fs_type, 
blocksize, use_bsize, use_bsize);
 


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Accomodate 32-bit uid/guid values in e2fsprogs

2007-05-07 Thread Eric Sandeen

e2fsprogs doesn't handle large ( 16 bit) UID/GID...

Create an fs by user with large UID:

bash-3.1$ id
uid=501666(newuser) gid=501666(newuser) groups=501666(newuser)
bash-3.1$ /sbin/mke2fs -q fsfile
fsfile is not a block special device.
Proceed anyway? (y,n) y

Now mount it:

[EMAIL PROTECTED] tmp]# mount -o loop fsfile /mnt/tmp/
[EMAIL PROTECTED] tmp]# ls -la /mnt/tmp/
total 13
drwxr-xr-x 3 42914 42914  1024 May  7 11:55 .
drwxr-xr-x 4 root  root 28 Jan 23 10:34 ..
drwx-- 2 root  root  12288 May  7 11:55 lost+found

uid/gid is wrong (only bottom 16 bits)!  (it's wrong on disk...)

And even if I get it right on disk:

[EMAIL PROTECTED] tmp]# ls -lan /mnt/tmp/
total 13
drwxr-xr-x 3 501666 501666  1024 May  7 11:57 .
drwxr-xr-x 4  0  028 Jan 23 10:34 ..
drwx-- 2  0  0 12288 May  7 11:57 lost+found

debugfs has further problems:

[EMAIL PROTECTED] tmp]# /sbin/debugfs /tmp/fsfile 
debugfs 1.39 (29-May-2006)
debugfs:  stat .
Inode: 2   Type: directoryMode:  0755   Flags: 0x0   Generation: 0
User: 42914   Group: 42914   Size: 1024
File ACL: 0Directory ACL: 0
Links: 3   Blockcount: 2
Fragment:  Address: 0Number: 0Size: 0
ctime: 0x463f5a8a -- Mon May  7 11:57:46 2007
atime: 0x463f5a90 -- Mon May  7 11:57:52 2007
mtime: 0x463f5a8a -- Mon May  7 11:57:46 2007
BLOCKS:
(0):508
TOTAL: 1

debugfs:  ls -l
  2   40755 (2)  42914  429141024  7-May-2007 11:57 .
  2   40755 (2)  42914  429141024  7-May-2007 11:57 ..
 11   40700 (2)  0  0   12288  7-May-2007 11:57 lost+found

The attached patch creates inode_uid() and inode_gid() macros to accommodate
the high bits when reading uid/gid, on platforms that support it.

It also accommodates large UID/GID at mkfs time.

Signed-off-by: Eric Sandeen [EMAIL PROTECTED]

Index: e2fsprogs-hg/debugfs/debugfs.c
===
--- e2fsprogs-hg.orig/debugfs/debugfs.c
+++ e2fsprogs-hg/debugfs/debugfs.c
@@ -534,7 +534,7 @@ void internal_dump_inode(FILE *out, cons
prefix, 
inode-i_mode  0777, inode-i_flags, inode-i_generation);
fprintf(out, %sUser: %5d   Group: %5d   Size: ,
-   prefix, inode-i_uid, inode-i_gid);
+   prefix, inode_uid(*inode), inode_gid(*inode));
if (LINUX_S_ISREG(inode-i_mode)) {
__u64 i_size = (inode-i_size |
((unsigned long long)inode-i_size_high  32));
Index: e2fsprogs-hg/debugfs/ls.c
===
--- e2fsprogs-hg.orig/debugfs/ls.c
+++ e2fsprogs-hg/debugfs/ls.c
@@ -88,7 +88,7 @@ static int list_dir_proc(ext2_ino_t dir 
}
fprintf(ls-f, %c%6u%c %6o (%d)  %5d  %5d   , lbr, ino, rbr,
inode.i_mode, dirent-name_len  8,
-   inode.i_uid, inode.i_gid);
+   inode_uid(inode), inode_gid(inode));
if (LINUX_S_ISDIR(inode.i_mode))
fprintf(ls-f, %5d, inode.i_size);
else
Index: e2fsprogs-hg/debugfs/lsdel.c
===
--- e2fsprogs-hg.orig/debugfs/lsdel.c
+++ e2fsprogs-hg/debugfs/lsdel.c
@@ -23,7 +23,7 @@
 struct deleted_info {
ext2_ino_t  ino;
unsigned short  mode;
-   unsigned short  uid;
+   __u32   uid;
__u64   size;
time_t  dtime;
int num_blocks;
@@ -160,7 +160,7 @@ void do_lsdel(int argc, char **argv)

delarray[num_delarray].ino = ino;
delarray[num_delarray].mode = inode.i_mode;
-   delarray[num_delarray].uid = inode.i_uid;
+   delarray[num_delarray].uid = inode_uid(inode);
delarray[num_delarray].size = inode.i_size;
if (!LINUX_S_ISDIR(inode.i_mode))
delarray[num_delarray].size |= 
Index: e2fsprogs-hg/e2fsck/message.c
===
--- e2fsprogs-hg.orig/e2fsck/message.c
+++ e2fsprogs-hg/e2fsck/message.c
@@ -304,12 +304,10 @@ static _INLINE_ void expand_inode_expres
  inode-i_dir_acl : 0));
break;
case 'u':
-   printf(%d, (inode-i_uid |
- (inode-osd2.linux2.l_i_uid_high  16)));
+   printf(%d, inode_uid(*inode));
break;
case 'g':
-   printf(%d, (inode-i_gid |
- (inode-osd2.linux2.l_i_gid_high  16)));
+   printf(%d, inode_gid(*inode));
break;
case 't':
if (LINUX_S_ISREG(inode-i_mode)) 
Index: e2fsprogs-hg/lib/ext2fs/ext2_fs.h
===
---

Re: Creating a 32bit blocks filesystem.

2007-05-07 Thread Jose R. Santos

On Mon, 7 May 2007 11:19:52 -0500
Jose R. Santos [EMAIL PROTECTED] wrote:
 I tried the patches and while the tools such as mkfs and debugfs seem
 to work fine, I am still unable to mount a filesystem with block
 numbers exceeding 32 bits.

Correction,

I tested debugfs on a 32bit blocks filesystem.  Both debugfs and fsck
failed when attempting to use a 64bit filesystem.

-JRS
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults

2007-05-07 Thread Theodore Tso

On Mon, May 07, 2007 at 11:31:22AM -0500, Eric Sandeen wrote: One of
 our testers filed a bug that said mkfs.ext3 is much slower when
 mke2fs.conf is missing...
 
 This is because the shipped defaults in mke2fs.conf do not match the
 shipped defaults in the mkfs code itself; he wound up making a 1k
 block filesystem on a very large block device, for example.
 
 So - How about this patch, to bring them back into line?  

It doesn't actually bring them completely back into line, since mke2fs
will use different block sizes depending on the size of the
filesystem.  So your patch makes the default probably a bit more
reasonable, and so I'll probably end up applying it, but it definitely
isn't a complete replacement for /etc/mke2fs.conf.

How likely do you think the case will be that mke2fs.conf would be
missing?  I'm trying to figure out how high priority of an item this
really is.

 Which makes me wonder; having defaults in 2 different places is
 bound to get out of sync; should we instead generate both code 
 config file defaults (and maybe man page defaults) from a common
 source?

I had been working on the assumption that the defaults if mke2fs.conf
were not present were more in the nature of emergency defaults as
opposed something that could be used a fully functional set of
configuration parameters.  So the assumption was that when you
installed the RPM, mke2fs.conf would also be there.

We could enhance the profile code so that it could read in the profile
from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs,
but that adds bloat --- the question is how necessary do we think that
really is?

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: JBD: ext2online wants too many credits (744 256)

2007-05-07 Thread Andreas Dilger

On May 06, 2007  21:40 -0700, Andrew Morton wrote:
 On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen [EMAIL PROTECTED] 
 wrote:
 
  2.6.20.6, FC4:
  
  I created a 91248k ext3 fs with 4k blocksize:
  
  Next, I tried to resize it to about 3G using ext2online while mounted:
  
  At that time the kernel said:
  
  |JBD: ext2online wants too many credits (744  256)
  
  What is the limitation I should be aware of? Has it something to do with
  the journal log size?

Yes, for very small filesystems the default journal size is only 4MB, and
the maximum transaction size is 1MB (256 blocks).

If the filesystem was 128MB or larger in the initial mke2fs then the
journal would be 16MB and the resize would get up to 1024 blocks for
the transaction.

I'm not sure what the right solution is for this.  If you know you will
be resizing the fs you could increase the initial journal size at mke2fs
time (-J size=16).  Alternately, it is possible resize to an intermediate
size and then delete and recreate the journal via tune2fs (which would
be the larger size by default) but that can only be done offline.  

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger

On May 03, 2007  21:31 -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
  + * ext4_fallocate:
  + * preallocate space for a file
  + * mode is for future use, e.g. for unallocating preallocated blocks etc.
  + */
 
 This description is rather thin.  What is the filesystem's actual behaviour
 here?  If the file is using extents then the implementation will do
 something.  If the file is using bitmaps then we will do something else.
 
 But what?   Here is where it should be described.

My understanding is that glibc will handle zero-filling of files for
filesystems that do not support fallocate().

  +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
  len)
  +{
  +   handle_t *handle;
  +   ext4_fsblk_t block, max_blocks;
  +   int ret, ret2, nblocks = 0, retries = 0;
  +   struct buffer_head map_bh;
  +   unsigned int credits, blkbits = inode-i_blkbits;
  +
  +   /* Currently supporting (pre)allocate mode _only_ */
  +   if (mode != FA_ALLOCATE)
  +   return -EOPNOTSUPP;
  +
  +   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
  +   return -ENOTTY;
 
 So we don't implement fallocate on bitmap-based files!  Well that's huge
 news.  The changelog would be an appropriate place to communicate this,
 along with reasons why, or a description of the plan to fix it.
 
 Also, posix says nothing about fallocate() returning ENOTTY.

I _think_ this is to convince glibc to do the zero-filling in userspace,
but I'm not up on the API specifics.

  +   block = offset  blkbits;
  +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  blkbits)
  +- block;
  +   mutex_lock(EXT4_I(inode)-truncate_mutex);
  +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
  +   mutex_unlock(EXT4_I(inode)-truncate_mutex);
 
 Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
 space, and that this disk space will require an arbitrary amount of
 metadata, how can we work out how much journal space we'll be needing
 without at least looking at `len'?

Good question.

The uninitialized extent can cover up to 128MB with a single entry.
If @path isn't specified, then ext4_ext_calc_credits_for_insert()
function returns the maximum number of extents needed to insert a leaf,
including splitting all of the index blocks.  That would allow up to 43GB
(340 extents/block * 128MB) to be preallocated, but it still needs to take
the size of the preallocation into account (adding 3 blocks per 43GB - a
leaf block, a bitmap block and a group descriptor).

Also, since @path is not being given then truncate_mutex is not needed.

  +   ret = ext4_ext_get_blocks(handle, inode, block,
  + max_blocks, map_bh,
  + EXT4_CREATE_UNINITIALIZED_EXT, 0);
  +   BUG_ON(!ret);
 
 BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
 ext4_error() would be safer and more useful here.

Ouch, not very friendly error handling.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults

2007-05-07 Thread Eric Sandeen

Theodore Tso wrote:
 On Mon, May 07, 2007 at 11:31:22AM -0500, Eric Sandeen wrote: One of
 our testers filed a bug that said mkfs.ext3 is much slower when
 mke2fs.conf is missing...

 This is because the shipped defaults in mke2fs.conf do not match the
 shipped defaults in the mkfs code itself; he wound up making a 1k
 block filesystem on a very large block device, for example.

 So - How about this patch, to bring them back into line?  
 
 It doesn't actually bring them completely back into line, since mke2fs
 will use different block sizes depending on the size of the
 filesystem.  So your patch makes the default probably a bit more
 reasonable, and so I'll probably end up applying it, but it definitely
 isn't a complete replacement for /etc/mke2fs.conf.

Well, sure, I don't mean for it to *replace* mke2fs.conf...

Hm, does having

[defaults]
blocksize = 4096

in mke2fs.conf turn off the blocksize heuristics and force 4k?  is
what's in mke2fs.conf a starting point or an absolute?  I guess I need
to read up on the code...

 How likely do you think the case will be that mke2fs.conf would be
 missing?  I'm trying to figure out how high priority of an item this
 really is.

Well, not too likely, although for some reason I guess it happened in
the installer root in FC6 or so.  That's what raised the issue.

 We could enhance the profile code so that it could read in the profile
 from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs,
 but that adds bloat --- the question is how necessary do we think that
 really is?

I guess it doesn't really sound *necessary* - it's just that if we have
2 different defaults and they drift, it can be confusing...

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.21-ext4-1

2007-05-07 Thread Mingming Cao

On Mon, 2007-04-30 at 11:14 -0400, Theodore Ts'o wrote:
 I've respun the ext4 development patchset, with Amit's updated fallocate
 patches.  I've added Dave's patch to add ia64 support to the fallocate
 system call, but *not* the XFS fallocate support patches.  (Probably
 better for them to live in an xfs tree, where they can more easily
 tested and updated.)  Yes, we haven't reached complete closure on the
 fallocate system call calling convention, but it's enough for us to get
 more testing in -mm.
 
 Also added Johann's jbd2-stats-through-procfs patches; it provides
 useful help in turning the size of the journal, which will be useful in
 benchmarking efforts.  In addition, Alex Tomas's patch to free
 just-allocated patches when there is an error inserting the extent into
 the extent tree has also been included.
 
 The patches have been compile-tested on x86, and compile/run-tested on
 x86/UML.  Would appreciate reports about testing on other platforms.
 
I have tested this patch series on ppc64, x86_64 with
dbench/tiobench/fsx, all runs fine. 

I am not sure what level of testing Amit has done about the fallocate()
and preallocation code on various archs. I couldn't find a available
s390 and ia64 machines with free partition yet.

In any case, it would be useful to add a new set of testsuites for the
new fallocate() syscall and fsstress in LTP testsuites to automatically
the preallocation code in ext4/XFS.

thanks,
Mingming
 Thanks,
 
   - Ted
 
 P.S.  One bug which I've noted --- if there is a failure due to disk
 filling up, running e2fsck on the filesystem will show that the i_blocks
 fields on the inodes where there was a failure to allocate disk blocks
 are left incorrect.  I'm guessing this is a bug in the delayed
 allocation patches.  Alex, when you have a moment, could you take a
 look?  Thanks!!
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton

On Mon, 7 May 2007 05:37:54 -0600
Andreas Dilger [EMAIL PROTECTED] wrote:

   + block = offset  blkbits;
   + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  blkbits)
   +  - block;
   + mutex_lock(EXT4_I(inode)-truncate_mutex);
   + credits = ext4_ext_calc_credits_for_insert(inode, NULL);
   + mutex_unlock(EXT4_I(inode)-truncate_mutex);
  
  Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
  space, and that this disk space will require an arbitrary amount of
  metadata, how can we work out how much journal space we'll be needing
  without at least looking at `len'?
 
 Good question.
 
 The uninitialized extent can cover up to 128MB with a single entry.
 If @path isn't specified, then ext4_ext_calc_credits_for_insert()
 function returns the maximum number of extents needed to insert a leaf,
 including splitting all of the index blocks.  That would allow up to 43GB
 (340 extents/block * 128MB) to be preallocated, but it still needs to take
 the size of the preallocation into account (adding 3 blocks per 43GB - a
 leaf block, a bitmap block and a group descriptor).

I think the use of ext4_journal_extend() (as Amit has proposed) will help
here, but it is not sufficient.

Because under some circumstances, a journal_extend() failure could mean
that we fail to allocate all the required disk space.  If it is infrequent
enough, that is acceptable when the caller is using fallocate() for
performance reasons.

But it is very much not acceptable if the caller is using fallocate() for
space-reservation reasons.  If you used fallocate to reserve 1GB of disk
and fallocate() succeeded and you later get ENOSPC then you'd have a
right to get a bit upset.

So I think the ext3/4 fallocate() implementation will need to be
implemented as a loop: 

while (len) {
journal_start();
len -= do_fallocate(len, ...);
journal_stop();
}


Now the interesting question is: what do we do if we get halfway through
this loop and then run out of space?  We could leave the disk all filled up
and then return failure to the caller, but that's pretty poor behaviour,
IMO.



Does the proposed implementation handle quotas correctly, btw?  Has that
been tested?


Final point: it's fairly disappointing that the present implementation is
ext4-only, and extent-only.  I do think we should be aiming at an ext4
bitmap-based implementation and an ext3 implementation.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][take 4] e2fsprogs: Add ext4migrate

2007-05-07 Thread Andreas Dilger

On May 07, 2007  19:16 +0530, Aneesh Kumar K.V wrote:
 Theodore Tso wrote:
 Actually, changing the inode size is actually going to be more useful
 to a larger number of people, since they can use it today to support
 in-inode EA's.  In addition, changing the inode size must be done
 off-line, whereas it's not so clear that off-line conversion to
 extents is the best way to go (more on that below).
 
 That way we don't need to add kernel code that convert ext3 inode to 
 ext4 inode while defragmentation. Last time i looked at the online 
 defrag code, i was not able to figure out a easy way to take indirect 
 map inode as input, then defrag the same and write the result in extent 
 map.

I think the online defrag code _should_ be fairly easily made to handle
defrag of block-mapped files, so long as it is always creating extent-
mapped files in the end.  All that should be necessary is to have the
kernel read data from the block-mapped file and write it into the newly
mapped blocks.  That functionality needs to exist for a long time anyways
to support upgrading the filesystem so it shouldn't be a huge burden.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: JBD: ext2online wants too many credits (744 256)

2007-05-07 Thread Andreas Dilger

On May 07, 2007  11:50 -0400, Theodore Tso wrote:
 On Mon, May 07, 2007 at 07:46:38AM -0700, [EMAIL PROTECTED] wrote:
  On Mon, 7 May 2007, Theodore Tso wrote:
  We're going to have to rethink the defaults of the journal size as it
  relates to on-line resizing, but there might not be a lot of great
  answers here.  We'll also have to look at the on-line resizing code
  and see if there's a way to break up the resize operation into smaller
  transactions as a way of avoiding this problem --- but that would
  still leave the user stuck with a pathetically small 4M journal on a
  3G filesystem.
 
 Clearly the right thing for resize2fs to do when it's doing an
 off-line resize is to just adjust the journal inode and make it
 bigger.  I'll add that to my todo list, but of course, patches to do
 that would be gratefully accepted...

For that matter, there was a paper from U. Wisconsin showing that
having the journal in the middle of the filesystem gave noticably
better performance due to lower average seek times.  If anyone is
looking at messing with the journal that is probably also a good
and easy place to start (e.g. may be as easy as just giving a goal
block of s_blocks_count / 2 to the journal create code).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] - Make mke2fs.c defaults match mke2fs.conf defaults

2007-05-07 Thread Andreas Dilger

On May 07, 2007  14:52 -0500, Eric Sandeen wrote:
 Theodore Tso wrote:
  How likely do you think the case will be that mke2fs.conf would be
  missing?  I'm trying to figure out how high priority of an item this
  really is.
 
 Well, not too likely, although for some reason I guess it happened in
 the installer root in FC6 or so.  That's what raised the issue.

Ah, good point - there are probably lots of installers and rescue disks
that grab mke2fs but don't know anything about /etc/mke2fs.conf.

  We could enhance the profile code so that it could read in the profile
  from a memory buffer, and simply compile /etc/mke2fs.conf into mke2fs,
  but that adds bloat --- the question is how necessary do we think that
  really is?
 
 I guess it doesn't really sound *necessary* - it's just that if we have
 2 different defaults and they drift, it can be confusing...

Since the shipped mke2fs.conf is only on the order of a few hundred bytes
I don't think it is a huge issue to include it.  I like the idea of it
being compiled into the code and yet consistent with the default install.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton

On Mon, 7 May 2007 15:21:04 -0700
Andreas Dilger [EMAIL PROTECTED] wrote:

 On May 07, 2007  13:58 -0700, Andrew Morton wrote:
  Final point: it's fairly disappointing that the present implementation is
  ext4-only, and extent-only.  I do think we should be aiming at an ext4
  bitmap-based implementation and an ext3 implementation.
 
 Actually, this is a non-issue.  The reason that it is handled for extent-only
 is that this is the only way to allocate space in the filesystem without
 doing the explicit zeroing.  For other filesystems (including ext3 and
 ext4 with block-mapped files) the filesystem should return an error (e.g.
 -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

hrm, spose so.

It can be a bit suboptimal from the layout POV.  The reservations code will
largely save us here, but kernel support might make it a bit better.

Totally blowing pagecache could be a problem.  Fixable in userspace by
using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso

On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
  Actually, this is a non-issue.  The reason that it is handled for 
  extent-only
  is that this is the only way to allocate space in the filesystem without
  doing the explicit zeroing.  For other filesystems (including ext3 and
  ext4 with block-mapped files) the filesystem should return an error (e.g.
  -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.
 
 It can be a bit suboptimal from the layout POV.  The reservations code will
 largely save us here, but kernel support might make it a bit better.

Actually, the reservations code won't matter, since glibc will fall
back to its current behavior, which is it will do the preallocation by
explicitly writing zeros to the file.  This wlil result in the same
layout as if we had done the persistent preallocation, but of course
it will mean the posix_fallocate() could potentially take a long time
if you're a PVR and you're reserving a gig or two for a two hour movie
at high quality.  That seems suboptimal, granted, and ideally the
application should be warned about this before it calls
posix_fallocate().  On the other hand, it's what happens today, all
the time, so applications won't be too badly surprised.  

If we think applications programmers badly need to know in advance if
posix_fallocate() will be fast or slow, probably the right thing is to
define a new fpathconf() configuration option so they can query to see
whether a particular file will support a fast posix_fallocate().  I'm
not 100% convinced such complexity is really needed, but I'm willing
to be convinced  what do folks think?

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso

On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote:
 Andreas Dilger wrote:
 On May 07, 2007  13:58 -0700, Andrew Morton wrote:
 Final point: it's fairly disappointing that the present implementation is
 ext4-only, and extent-only.  I do think we should be aiming at an ext4
 bitmap-based implementation and an ext3 implementation.
 
 Actually, this is a non-issue.  The reason that it is handled for 
 extent-only
 is that this is the only way to allocate space in the filesystem without
 doing the explicit zeroing.  For other filesystems (including ext3 and
 
 Precisely /how/ do you avoid the zeroing issue, for extents?
 
 If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
 otherwise the implementation is broken.

There is a bit in the extent structure which indicates that the extent
has not been initialized.  When reading from a block where the extent
is marked as unitialized, ext4 returns zero's, to avoid returning the
uninitalized contents of the disk, which might contain someone else's
love letters, p0rn, or other information which we shouldn't leak out.
When writing to an extent which is uninitalized, we may potentially
have to split the extent into three extents in the worst case.

My understanding is that XFS uses a similar implementation; it's a
pretty obvious and standard way to implement allocated-but-not-initialized
extents.

We thought about supporting persistent preallocation for inodes using
indirect blocks, but it would require stealing a bit from each entry
in the indirect block, reducing the maximum size of the filesystem by
two (i.e., 2**31 blocks).  It was decided it wasn't worth the
complexity, given the tradeoffs.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao

On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote:
 On Mon, 7 May 2007 05:37:54 -0600
 Andreas Dilger [EMAIL PROTECTED] wrote:
 
+   block = offset  blkbits;
+   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  
blkbits)
+- block;
+   mutex_lock(EXT4_I(inode)-truncate_mutex);
+   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
+   mutex_unlock(EXT4_I(inode)-truncate_mutex);
   
   Now I'm mystified.  Given that we're allocating an arbitrary amount of 
   disk
   space, and that this disk space will require an arbitrary amount of
   metadata, how can we work out how much journal space we'll be needing
   without at least looking at `len'?
  
  Good question.
  
  The uninitialized extent can cover up to 128MB with a single entry.
  If @path isn't specified, then ext4_ext_calc_credits_for_insert()
  function returns the maximum number of extents needed to insert a leaf,
  including splitting all of the index blocks.  That would allow up to 43GB
  (340 extents/block * 128MB) to be preallocated, but it still needs to take
  the size of the preallocation into account (adding 3 blocks per 43GB - a
  leaf block, a bitmap block and a group descriptor).
 
 I think the use of ext4_journal_extend() (as Amit has proposed) will help
 here, but it is not sufficient.
 
 Because under some circumstances, a journal_extend() failure could mean
 that we fail to allocate all the required disk space.  If it is infrequent
 enough, that is acceptable when the caller is using fallocate() for
 performance reasons.
 
 But it is very much not acceptable if the caller is using fallocate() for
 space-reservation reasons.  If you used fallocate to reserve 1GB of disk
 and fallocate() succeeded and you later get ENOSPC then you'd have a
 right to get a bit upset.
 
 So I think the ext3/4 fallocate() implementation will need to be
 implemented as a loop: 
 
   while (len) {
   journal_start();
   len -= do_fallocate(len, ...);
   journal_stop();
   }
 
 

I agree.  There is already a loop in Amit's current's patch to call
ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to
ask for in each journal_start()?
 +/*
 + * ext4_fallocate:
 + * preallocate space for a file
 + * mode is for future use, e.g. for unallocating preallocated blocks etc.
 + */
 +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
 +{


 +   mutex_lock(EXT4_I(inode)-truncate_mutex);
 +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
 +   mutex_unlock(EXT4_I(inode)-truncate_mutex);

I think the calculation is based on the assumption that there is only a
single extent to be inserted, which is the ideal case. But in some cases
we may end up allocating several chunk of blocks(extents) for this
single preallocation request when fs is fragmented (or part of
preallocation request is already fulfilled)

I think we should move this calculation inside the loop as well,and we
really do not need to grab the lock to calculate the credit if the @path
is always NULL, all the function does is mathmatics.

I can't think of any good way to estimate the total credits needed for
this whole preallocation request. Looked at ext4_get_block(), which is
used for DIO code to deal with large amount of block allocation. The
credit reservation is quite weak there too. The DIO_CREDIT is only
(EXT4_RESERVE_TRANS_BLOCKS + 32)

 +   handle=ext4_journal_start(inode, credits +
 +   
 EXT4_DATA_TRANS_BLOCKS(inode-i_sb)+1);
 +   if (IS_ERR(handle))
 +   return PTR_ERR(handle);
 +retry:
 +   ret = 0;
 +   while (ret = 0  ret  max_blocks) {
 +   block = block + ret;
 +   max_blocks = max_blocks - ret;
 +   ret = ext4_ext_get_blocks(handle, inode, block,
 + max_blocks, map_bh,
 + EXT4_CREATE_UNINITIALIZED_EXT, 0);
 +   BUG_ON(!ret);
 +   if (ret  0  test_bit(BH_New, map_bh.b_state)
 +((block + ret)  (i_size_read(inode)  blkbits)))
 +   nblocks = nblocks + ret;
 +   }
 +
 +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, retries))
 +   goto retry;
 +
 Now the interesting question is: what do we do if we get halfway through
 this loop and then run out of space?  We could leave the disk all filled up
 and then return failure to the caller, but that's pretty poor behaviour,
 IMO.
 
The current code handles earlier ENOSPC by three times retries. After
that if we still run out of space, then it's propably right to notify
the caller there isn't much space left.

We could extend the block reservation window size before the while loop
so we could get a lower chance to get more fragmented.

 
 Does the proposed implementation handle quotas

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton

On Mon, 07 May 2007 17:00:24 -0700
Mingming Cao [EMAIL PROTECTED] wrote:

  +   while (ret = 0  ret  max_blocks) {
  +   block = block + ret;
  +   max_blocks = max_blocks - ret;
  +   ret = ext4_ext_get_blocks(handle, inode, block,
  + max_blocks, map_bh,
  + EXT4_CREATE_UNINITIALIZED_EXT, 0);
  +   BUG_ON(!ret);
  +   if (ret  0  test_bit(BH_New, map_bh.b_state)
  +((block + ret)  (i_size_read(inode)  
  blkbits)))
  +   nblocks = nblocks + ret;
  +   }
  +
  +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, 
  retries))
  +   goto retry;
  +
  Now the interesting question is: what do we do if we get halfway through
  this loop and then run out of space?  We could leave the disk all filled up
  and then return failure to the caller, but that's pretty poor behaviour,
  IMO.
  
 The current code handles earlier ENOSPC by three times retries. After
 that if we still run out of space, then it's propably right to notify
 the caller there isn't much space left.
 
 We could extend the block reservation window size before the while loop
 so we could get a lower chance to get more fragmented.

yes, but my point is that the proposed behaviour is really quite bad.

We will attempt to allocate the disk space and then we will return failure,
having consumed all the disk space and having partially and uselessly
populated an unknown amount of the file.

Userspace could presumably repair the mess in most situations by truncating
the file back again.  The kernel cannot do that because there might be live
data in amongst there.

So we'd need to either keep track of which blocks were newly-allocated and
then free them all again on the error path (doesn't work right across
commit+crash+recovery) or we could later use the space-reservation scheme which
delayed allocation will need to introduce.

Or we could decide to live with the above IMO-crappy behaviour.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao

On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote:
 On Mon, 7 May 2007 19:14:42 -0400
 Theodore Tso [EMAIL PROTECTED] wrote:
 
  On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
Actually, this is a non-issue.  The reason that it is handled for 
extent-only
is that this is the only way to allocate space in the filesystem without
doing the explicit zeroing.  For other filesystems (including ext3 and
ext4 with block-mapped files) the filesystem should return an error 
(e.g.
-EOPNOTSUPP) and glibc will do manual zero-filling of the file in 
userspace.
   
   It can be a bit suboptimal from the layout POV.  The reservations code 
   will
   largely save us here, but kernel support might make it a bit better.
  
  Actually, the reservations code won't matter, since glibc will fall
  back to its current behavior, which is it will do the preallocation by
  explicitly writing zeros to the file.
 
 No!  Reservations code is *critical* here.  Without reservations, we get
 disastrously-bad layout if two processes were running a large fallocate()
 at the same time.  (This is an SMP-only problem, btw: on UP the timeslice
 lengths save us).
 
 My point is that even though reservations save us, we could do even-better
 in-kernel.
 

In this case, since the number of blocks to preallocate (eg. N=10GB) is
clear, we could improve the current reservation code, to allow callers
explicitly ask for a new window that have the minimum N free blocks for
the blocks-to-preallocated(rather than just have at least 1 free
blocks).

Before the ext4_fallocate() is called, the right reservation window size
is set with the flag to indicating please spend time if needed to find
a window covers at least N free blocks.

So for ex4 block mapped files, later when glibc is doing allocation and
zeroing, the ext4 block-mapped allocator will knows to reserve the right
amount of free blocks before allocating and zeroing 10GB space.

I am not sure whether this worth the effort though.

 But then, a smart application would bypass the glibc() fallocate()
 implementation and would tune the reservation window size and would use
 direct-IO or sync_file_range()+fadvise(FADV_DONTNEED).
 
  This wlil result in the same
  layout as if we had done the persistent preallocation, but of course
  it will mean the posix_fallocate() could potentially take a long time
  if you're a PVR and you're reserving a gig or two for a two hour movie
  at high quality.  That seems suboptimal, granted, and ideally the
  application should be warned about this before it calls
  posix_fallocate().  On the other hand, it's what happens today, all
  the time, so applications won't be too badly surprised.
 
 A PVR implementor would take all this over and would do it themselves, for
 sure.
 
  If we think applications programmers badly need to know in advance if
  posix_fallocate() will be fast or slow, probably the right thing is to
  define a new fpathconf() configuration option so they can query to see
  whether a particular file will support a fast posix_fallocate().  I'm
  not 100% convinced such complexity is really needed, but I'm willing
  to be convinced  what do folks think?
  
 
 An application could do sys_fallocate(one-byte) to work out whether it's
 supported in-kernel, I guess.
 

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao

On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote:
 On Mon, 07 May 2007 17:00:24 -0700
 Mingming Cao [EMAIL PROTECTED] wrote:
 
   +   while (ret = 0  ret  max_blocks) {
   +   block = block + ret;
   +   max_blocks = max_blocks - ret;
   +   ret = ext4_ext_get_blocks(handle, inode, block,
   + max_blocks, map_bh,
   + EXT4_CREATE_UNINITIALIZED_EXT, 
   0);
   +   BUG_ON(!ret);
   +   if (ret  0  test_bit(BH_New, map_bh.b_state)
   +((block + ret)  (i_size_read(inode)  
   blkbits)))
   +   nblocks = nblocks + ret;
   +   }
   +
   +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, 
   retries))
   +   goto retry;
   +
   Now the interesting question is: what do we do if we get halfway through
   this loop and then run out of space?  We could leave the disk all filled 
   up
   and then return failure to the caller, but that's pretty poor behaviour,
   IMO.
   
  The current code handles earlier ENOSPC by three times retries. After
  that if we still run out of space, then it's propably right to notify
  the caller there isn't much space left.
  
  We could extend the block reservation window size before the while loop
  so we could get a lower chance to get more fragmented.
 
 yes, but my point is that the proposed behaviour is really quite bad.
 
I agree your point, that's why I mention it only helped the
fragmentation issue but not the ENOSPC case.


 We will attempt to allocate the disk space and then we will return failure,
 having consumed all the disk space and having partially and uselessly
 populated an unknown amount of the file.
 

Not totally useless I think. If only half of the space is preallocated
because run out of space, the application can decide whether it's good
enough to start to use this preallocated space or wait for the fs to
have more free space.

 Userspace could presumably repair the mess in most situations by truncating
 the file back again.  The kernel cannot do that because there might be live
 data in amongst there.
 
 So we'd need to either keep track of which blocks were newly-allocated and
 then free them all again on the error path (doesn't work right across
 commit+crash+recovery) or we could later use the space-reservation scheme 
 which
 delayed allocation will need to introduce.
 
 Or we could decide to live with the above IMO-crappy behaviour.

In fact Amit and I had raised this issue before, whether it's okay to do
allow partial preallocation. At that moment the feedback is it's no much
different than the current zero-out-preallocation behavior: people might
preallocating half-way then later deal with ENOSPC.

We could check the total number of fs free blocks account before
preallocation happens, if there isn't enough space left, there is no
need to bother preallocating.

If there is enough free space, we could make a reservation window that
have at least N free blocks and mark it not stealable by other files. So
later we will not run into the ENOSPC error.

The fs free blocks account is just a estimate though.


Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger

On May 07, 2007  19:02 -0400, Jeff Garzik wrote:
 Andreas Dilger wrote:
 Actually, this is a non-issue.  The reason that it is handled for 
 extent-only is that this is the only way to allocate space in the
 filesystem without doing the explicit zeroing.
 
 Precisely /how/ do you avoid the zeroing issue, for extents?
 
 If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
 otherwise the implementation is broken.

In ext4 (as in XFS) there is a flag stored in the extent that tells if
the extent is initialized or not.  Reads from uninitialized extents will
return zero-filled data, and writes that don't span the whole extent
will cause the uninitialized extent to be split into a regular extent
and one or two uninitialized extents (depending where the write is).

My comment was just that the extent doesn't have to be explicitly zero
filled on the disk, by virtue of the fact that the uninitialized flag
will cause reads to return zero.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Accomodate 32-bit uid/guid values in e2fsprogs

2007-05-07 Thread Theodore Tso

On Mon, May 07, 2007 at 12:10:37PM -0500, Eric Sandeen wrote:
 e2fsprogs doesn't handle large ( 16 bit) UID/GID...

Applied, with one correction:

 --- e2fsprogs-hg.orig/misc/mke2fs.c
 +++ e2fsprogs-hg/misc/mke2fs.c
 @@ -492,9 +494,14 @@ static void create_root_dir(ext2_filsys 
   _(while reading root inode));
   exit(1);
   }
 - inode.i_uid = getuid();
 - if (inode.i_uid)
 - inode.i_gid = getgid();
 + uid = getuid();
 + inode.i_uid = uid;
 + inode.i_uid_high = uid  16;
 + if (inode.i_uid) {
^^^

This should be uid instead.  Otherwise, the gid won't be set if the
uid is a multiple of 65536.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] e2fsprogs: Offsets of EAs in inode need not be sorted

2007-05-07 Thread Theodore Tso

On Thu, Apr 19, 2007 at 05:35:36PM +0530, Kalpak Shah wrote:
 This patch removes a code snippet from check_ea_in_inode() in pass1
 which checks if the EA values in the inode are sorted or not. The
 comments in fs/ext*/xattr.c state that the EA values in the external
 EA block are sorted but those in the inode need not be sorted. I
 have also attached a test image which has unsorted EAs in the
 inodes. The current e2fsck wrongly clears the EAs in the inode.

Applied.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] e2fsprogs: Check journal inode sanity and recreate journal

2007-05-07 Thread Theodore Tso

On Thu, Mar 15, 2007 at 02:43:47AM +0530, Kalpak Shah wrote:
 Index: e2fsprogs-1.39/lib/ext2fs/mkjournal.c
 ===
 --- e2fsprogs-1.39.orig/lib/ext2fs/mkjournal.c
 +++ e2fsprogs-1.39/lib/ext2fs/mkjournal.c
 + if (fs-super-s_blocks_count  2048) {
 + fputs((\nFilesystem too small for a journal\n), stderr);
 + return 0;
 + }

Code in lib/ext2fs isn't allowed to do any output to stdio (except for
debugging purposes).  It causes internationalization problems, and
it's just in general a bad idea for ext2fs library code to try to do
any UI.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

39 matches

Mail list logo