from:"Amit K. Arora"

Re: [PATCH] ext4: fix uniniatilized extend splitting error.

2008-01-11 Thread Amit K. Arora

On Thu, Jan 10, 2008 at 02:31:03PM -0700, Andreas Dilger wrote:
 On Jan 10, 2008  17:31 +0300, Dmitry Monakhov wrote:
  While playing with new fancy fallocate interface on ext4 i've triggered
  bug which corrupted my grub :).

 I notice I'm CC'd on this, but in fact Amit wrote the code.  I've CC'd
 him even though I expect he will notice it anyways.

Andreas, thanks for adding me to the CC list!
 
  My testcase:
  
  blksize = 0x1000;
  fd = open(argv[1], O_RDWR|O_CREAT, 0700);
  unsigned long long sz = 0x1000UL;
  /* allocating big blocks chunk */
  syscall(__NR_fallocate, fd, 0, 0UL, sz)
  
  /* grab all other available filesystem space */
  tfd = open(tmp, O_RDWR|O_CREAT|O_DIRECT, 0700);
  while( write(tfd, buf, 4096)  0); /* loop untill ENOSPC */
  fsync(fd); /* just in case */
  while (pos  sz) {
  /* each seek+ write operation result in splits uninitialized extent
  in three extents. Splitting may result in new extent allocation
  which probably will fail because of ENOSPC*/
  
  lseek(fd, blksize*2 -1, SEEK_CUR);
  if ((ret = write(fd, 'a', 1)) != 1)
  exit(1);
  pos += blksize * 2;
  }
 
 Interesting test, and well thought out...

Dmitry, Good catch and thanks for the patch below !
Please add Acked-by: Amit Arora [EMAIL PROTECTED].
 
 The other item that Amit and I discussed in the past is in the case of
 ENOSPC it would be possible instead of splitting the extent to zero-fill
 the smaller extent (1 block in your test case) and write the whole thing
 as an initialized extent.  This could then either be merged with the
 previous or following allocated extent, or the whole extent zeroed if that
 was not possible.

Yes, this is one of the things pending..
 
 It would add some latency in the worst case to do this in the kernel,
 but this would only happen if there is no free space at all.  It might
 even be desirable to always zero-fill small extents instead of splitting
 uninitialized extents, because a random write of 64kB is not more expensive
 than 4kB and avoids overhead of splitting the nicely contiguous extent tree.

I feel this is debatable and it may not be easy to define what extent size is
small enough. Anyhow, since we merge the extents when possible it should
not be too bad, unless someone deliberately writes to alternate blocks
in the uninitialized extent. Hence, as Mingming suggested, I too think that we
should be doing it only when necessary.

--
Regards,
Amit Arora
 
  Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED]
  ---
   fs/ext4/extents.c |5 +++--
   1 files changed, 3 insertions(+), 2 deletions(-)
  
  diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
  index 8528774..fc8e508 100644
  --- a/fs/ext4/extents.c
  +++ b/fs/ext4/extents.c
  @@ -2320,9 +2320,10 @@ int ext4_ext_get_blocks(handle_t *handle, struct 
  inode *inode,
  ret = ext4_ext_convert_to_initialized(handle, inode,
  path, iblock,
  max_blocks);
  -   if (ret = 0)
  +   if (ret = 0) {
  +   err = ret;
  goto out2;
  -   else
  +   } else
  allocated = ret;
  goto outnew;
  }
  -- 
  1.5.3.1.40.g6972-dirty
  
  
  -
  To unsubscribe from this list: send the line unsubscribe linux-ext4 in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 Cheers, Andreas
 --
 Andreas Dilger
 Sr. Staff Engineer, Lustre Group
 Sun Microsystems of Canada, Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: New e2fsprog doc on the ext4 wiki page.

2007-07-25 Thread Amit K. Arora

On Tue, Jul 24, 2007 at 11:07:41AM -0500, Jose R. Santos wrote:
 Hi folks
 
 As discussed in the conference call, we are going to create a new doc
 on the ext4 wiki dedicated to track the development of some of the
 features needed in e2fsprogs.  The page will consist of mostly changes
 needed in order to keep e2fsprogs up to date with mainline ext4 kernel
 code.
 
 I don't plan to add bug fixes, cleanup or trivial changes to the page
 as this would make it hard to keep the page up to date.  The link to
 the page will be:
 
 http://ext4.wiki.kernel.org/index.php?title=E2fsprogs_features_and_patchesaction=edit
 
 Comments on what you would like to see of this page or in the initial
 list of features I have gather below are welcome. 
:
:
 Extents support:
 - Patches submitted?

Uninitialized extents:
We will need uninitialized extents support for preallocated blocks
(allocated by fallocate()) too.
I can send a patch for this, but I don't think extents support is there
in 1.40.2 release. Is there a place where I can find the latest extents
support patch on top of 1.40 ? I can prepare patch for uninitialized
extents on top of it.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6][TAKE7] manpage for fallocate

2007-07-15 Thread Amit K. Arora

On Sat, Jul 14, 2007 at 10:23:42AM +0200, Michael Kerrisk wrote:
 [CC += [EMAIL PROTECTED]
 
 Amit,
 
Hi Michael,

 Thanks for this page.  I will endeavour to review it in 
 the coming days.  In the meantime, the better address to CC
 me on fot man pages stuff is [EMAIL PROTECTED]

Sure.

BTW, this man page has changed a bit and the one in TAKE8 of fallocate
patches is the latest one. You are copied on that too.
I will forward that mail to [EMAIL PROTECTED] id also, so that you
do not miss it. Thanks!

--
Regards,
Amit Arora

 
 Cheers,
 
 Michael
 
  Following is the modified version of the manpage originally submitted by
  David Chinner. Please use `nroff -man fallocate.2 | less` to view.
  
  This includes changes suggested by Heikki Orsila and Barry Naujok.
  
  
  .TH fallocate 2
  .SH NAME
  fallocate \- allocate or remove file space
  .SH SYNOPSIS
  .nf
  .B #include fcntl.h
  .PP
  .BI long fallocate(int  fd , int  mode , loff_t  offset , loff_t 
  len);
  .SH DESCRIPTION
  The
  .B fallocate
  syscall allows a user to directly manipulate the allocated disk space
  for the file referred to by
  .I fd
  for the byte range starting at
  .I offset
  and continuing for
  .I len
  bytes.
  The
  .I mode
  parameter determines the operation to be performed on the given range.
  Currently there are two modes:
  .TP
  .B FALLOC_ALLOCATE
  allocates and initialises to zero the disk space within the given range.
  After a successful call, subsequent writes are guaranteed not to fail
  because
  of lack of disk space.  If the size of the file is less than
  .IR offset + len ,
  then the file is increased to this size; otherwise the file size is left
  unchanged.
  .B FALLOC_ALLOCATE
  closely resembles
  .BR posix_fallocate (3)
  and is intended as a method of optimally implementing this function.
  .B FALLOC_ALLOCATE
  may allocate a larger range than that was specified.
  .TP
  .B FALLOC_RESV_SPACE
  provides the same functionality as
  .B FALLOC_ALLOCATE
  except it does not ever change the file size. This allows allocation
  of zero blocks beyond the end of file and is useful for optimising
  append workloads.
  .SH RETURN VALUE
  .B fallocate
  returns zero on success, or an error number on failure.
  Note that
  .I errno
  is not set.
  .SH ERRORS
  .TP
  .B EBADF
  .I fd
  is not a valid file descriptor, or is not opened for writing.
  .TP
  .B EFBIG
  .IR offset + len
  exceeds the maximum file size.
  .TP
  .B EINVAL
  .I offset
  was less than 0, or
  .I len
  was less than or equal to 0.
  .TP
  .B ENODEV
  .I fd
  does not refer to a regular file or a directory.
  .TP
  .B ENOSPC
  There is not enough space left on the device containing the file
  referred to by
  .IR fd .
  .TP
  .B ESPIPE
  .I fd
  refers to a pipe of file descriptor.
  .TP
  .B ENOSYS
  The filesystem underlying the file descriptor does not support this
  operation.
  .TP
  .B EINTR
  A signal was caught during execution
  .TP
  .B EIO
  An I/O error occurred while reading from or writing to a file system.
  .TP
  .B EOPNOTSUPP
  The mode is not supported on the file descriptor.
  .SH AVAILABILITY
  The
  .B fallocate
  system call is available since 2.6.XX
  .SH SEE ALSO
  .BR syscall (2),
  .BR posix_fadvise (3),
  .BR ftruncate (3).
 
 -- 
 Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten 
 Browser-Versionen downloaden: http://www.gmx.net/de/go/browser
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/6][TAKE7] ext4: change for better extent-to-group alignment

2007-07-13 Thread Amit K. Arora

From: Amit Arora [EMAIL PROTECTED]

Change on-disk format for extent to represent uninitialized/initialized extents

This change was suggested by Andreas Dilger. 
This patch changes the EXT_MAX_LEN value and extent code which marks/checks
uninitialized extents. With this change it will be possible to have
initialized extents with 2^15 blocks (earlier the max blocks we could have
was 2^15 - 1). This way we can have better extent-to-block alignment.
Now, maximum number of blocks we can have in an initialized extent is 2^15
and in an uninitialized extent is 2^15 - 1.

This patch takes care of Andreas's suggestion of using EXT_INIT_MAX_LEN
instead of 0x8000 at some places.

Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1106,7 +1106,7 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   unsigned short ext1_ee_len, ext2_ee_len;
+   unsigned short ext1_ee_len, ext2_ee_len, max_len;
 
/*
 * Make sure that either both extents are uninitialized, or
@@ -1115,6 +1115,11 @@ ext4_can_extents_be_merged(struct inode 
if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
return 0;
 
+   if (ext4_ext_is_uninitialized(ex1))
+   max_len = EXT_UNINIT_MAX_LEN;
+   else
+   max_len = EXT_INIT_MAX_LEN;
+
ext1_ee_len = ext4_ext_get_actual_len(ex1);
ext2_ee_len = ext4_ext_get_actual_len(ex2);
 
@@ -1127,7 +1132,7 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  max_len)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
@@ -1814,7 +1819,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
ex-ee_block = cpu_to_le32(block);
ex-ee_len = cpu_to_le16(num);
-   if (uninitialized)
+   /*
+* Do not mark uninitialized if all the blocks in the
+* extent have been removed.
+*/
+   if (uninitialized  num)
ext4_ext_mark_uninitialized(ex);
 
err = ext4_ext_dirty(handle, inode, path + depth);
@@ -2307,6 +2316,19 @@ int ext4_ext_get_blocks(handle_t *handle
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
 
+   /*
+* See if request is beyond maximum number of blocks we can have in
+* a single extent. For an initialized extent this limit is
+* EXT_INIT_MAX_LEN and for an uninitialized extent this limit is
+* EXT_UNINIT_MAX_LEN.
+*/
+   if (max_blocks  EXT_INIT_MAX_LEN 
+   create != EXT4_CREATE_UNINITIALIZED_EXT)
+   max_blocks = EXT_INIT_MAX_LEN;
+   else if (max_blocks  EXT_UNINIT_MAX_LEN 
+create == EXT4_CREATE_UNINITIALIZED_EXT)
+   max_blocks = EXT_UNINIT_MAX_LEN;
+
/* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
newex.ee_block = cpu_to_le32(iblock);
newex.ee_len = cpu_to_le16(max_blocks);
Index: linux-2.6.22/include/linux/ext4_fs_extents.h
===
--- linux-2.6.22.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22/include/linux/ext4_fs_extents.h
@@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru
 
 #define EXT_MAX_BLOCK  0x
 
-#define EXT_MAX_LEN((1UL  15) - 1)
+/*
+ * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an
+ * initialized extent. This is 2^15 and not (2^16 - 1), since we use the
+ * MSB of ee_len field in the extent datastructure to signify if this
+ * particular extent is an initialized extent or an uninitialized (i.e.
+ * preallocated).
+ * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an
+ * uninitialized extent.
+ * If ee_len is = 0x8000, it is an initialized extent. Otherwise, it is an
+ * uninitialized one. In other words, if MSB of ee_len is set, it is an
+ * uninitialized extent with only one special scenario when ee_len = 0x8000.
+ * In this case we can not have an uninitialized extent of zero length and
+ * thus we make it as a special case of initialized extent with 0x8000 length.
+ * This way we get better extent-to-group alignment for initialized extents.
+ * Hence, the maximum number of blocks we can have in an *initialized*
+ * extent is 2^15 (32768) and in an *uninitialized* extent is 2^15-1 (32767).
+ */
+#define EXT_INIT_MAX_LEN   (1UL  15)
+#define EXT_UNINIT_MAX_LEN (EXT_INIT_MAX_LEN - 1)

[PATCH 3/6][TAKE7] revalidate write permissions for fallocate

2007-07-13 Thread Amit K. Arora

From: David P. Quigley [EMAIL PROTECTED]

Revalidate the write permissions for fallocate(2), in case security policy has
changed since the files were opened.

Acked-by: James Morris [EMAIL PROTECTED]
Signed-off-by: David P. Quigley [EMAIL PROTECTED]

---
 fs/open.c |3 +++
 1 files changed, 3 insertions(+)

Index: linux-2.6.22/fs/open.c
===
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -407,6 +407,9 @@ asmlinkage long sys_fallocate(int fd, in
goto out;
if (!(file-f_mode  FMODE_WRITE))
goto out_fput;
+   ret = security_file_permission(file, MAY_WRITE);
+   if (ret)
+   goto out_fput;
 
inode = file-f_path.dentry-d_inode;
 
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/6][TAKE7] ext4: fallocate support in ext4

2007-07-13 Thread Amit K. Arora

From: Amit Arora [EMAIL PROTECTED]

fallocate support in ext4

This patch implements -fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation. Current implementation only supports
preallocation for regular files (directories not supported as of date)
with extent maps. This patch does not support block-mapped files currently.
Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of
now.


Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int ret = 0;
 
b1 = le32_to_cpu(newext-ee_block);
-   len1 = le16_to_cpu(newext-ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
int depth, len, err, next;
+   unsigned uninitialized = 0;
 
-   BUG_ON(newext-ee_len == 0);
+   BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);
ex = path[depth].p_ext;
BUG_ON(path[depth].p_hdr == NULL);
@@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex  ext4_can_extents_be_merged(inode, ex, newext)) {
ext_debug(append %d block to %d:%d (from %llu)\n,
-   le16_to_cpu(newext-ee_len),
+

[PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-13 Thread Amit K. Arora

From: Amit Arora [EMAIL PROTECTED]

sys_fallocate() implementation on i386, x86_64 and powerpc

fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called -fallocate().
Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.
ToDos:
1. Implementation on other architectures (other than i386, x86_64,
   and ppc). Patches for s390(x) and ia64 are already available from
   previous posts, but it was decided that they should be added later
   once fallocate is in the mainline. Hence not including those patches
   in this take.
2. A generic file system operation to handle fallocate
   (generic_fallocate), for filesystems that do _not_ have the fallocate
   inode operation implemented.
3. Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.22.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+   .long sys_fallocate
Index: linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
===
--- linux-2.6.22.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
return sys_truncate(path, (high  32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+u32 lenhi, u32 lenlo)
+{
+   return sys_fallocate(fd, mode, ((loff_t)offhi  32) | offlo,
+((loff_t)lenhi  32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long 
high,
 unsigned long low)
 {
Index: linux-2.6.22/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.22.orig/arch/x86_64/ia32/ia32entry.S
+++ linux-2.6.22/arch/x86_64/ia32/ia32entry.S
@@ -719,4 +719,5 @@ ia32_sys_call_table:
.quad compat_sys_signalfd
.quad compat_sys_timerfd
.quad sys_eventfd
+   .quad sys32_fallocate
 ia32_syscall_end:
Index: linux-2.6.22/fs/open.c
===
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies the behavior of allocation.
+ * @offset: The offset within file, from where allocation is being
+ * requested. It should not have a negative value.
+ * @len: The amount of space in bytes to be allocated, from the offset.
+ *  This can not be zero or a negative value.
+ *
+ * This system call preallocates space for a file. The range of blocks
+ * allocated depends on the value of offset and len arguments provided
+ * by the user/application. With FALLOC_ALLOCATE or FALLOC_RESV_SPACE
+ * modes, if the system call succeeds, subsequent writes to the file in
+ * the given range (specified by offset  len) should not fail - even if
+ * the file system later becomes full. Hence the preallocation done is
+ * persistent (valid even after reopen of the file and remount/reboot).
+ *
+ * It is expected that the -fallocate() inode operation implemented by
+ * the individual file systems will update the file size and/or
+ * ctime/mtime depending on the mode and also on the success of the
+ * operation.
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ * 0   : On

[PATCH 0/6][TAKE7] fallocate system call

2007-07-13 Thread Amit K. Arora

This is the latest fallocate patchset and is based on 2.6.22.

* Following are the changes from TAKE6:
1) We now just have two modes (and no deallocation modes).
2) Updated the man page
3) Added a new patch submitted by David P. Quigley  (Patch 3/6).
4) Used EXT_INIT_MAX_LEN instead of 0x8000 in Patch 6/6.
5) Included below in the end is a small testcase to test fallocate.

* Following are the changes from TAKE5 to TAKE6:
1) Rebased to 2.6.22
2) Added compat wrapper for x86_64
3) Dropped s390 and ia64 patches, since the platform maintaners can
   add the support for fallocate once it is in mainline.
4) Added a change suggested by Andreas for better extent-to-group
   alignment in ext4 (Patch 6/6). Please refer following post:
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg02445.html
5) Renamed mode flags and values from FA_ to FALLOC_
6) Added manpage (updated version of the one initially submitted by
   David Chinner).


Todos:
-
1 Implementation on other architectures (other than i386, x86_64,
   and ppc64). s390(x) and ia64 patches are ready and will be pushed
   by platform maintaners when the fallocate is in mainline.
2 A generic file system operation to handle fallocate
   (generic_fallocate), for filesystems that do _not_ have the fallocate
   inode operation implemented.
3 Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()
4 Patch to e2fsprogs to recognize and display uninitialized extents.


Following patches follow:
Patch 1/6 : manpage for fallocate
Patch 2/6 : fallocate() implementation in i386, x86_64 and powerpc
Patch 3/6 : revalidate write permissions for fallocate
Patch 4/6 : ext4: fallocate support in ext4
Patch 5/6 : ext4: write support for preallocated blocks
Patch 6/6 : ext4: change for better extent-to-group alignment

Note: Attached below is a small testcase to test fallocate. The __NR_fallocate
will need to be changed depending on the system call number in the kernel (it
may get changed due to merge) and also depending on the architecture.

--
Regards,
Amit Arora



#include stdio.h
#include stdlib.h
#include fcntl.h
#include errno.h

#include linux/unistd.h
#include sys/vfs.h
#include sys/stat.h

#define VERBOSE 0

#define __NR_fallocate324

#define FALLOC_FL_KEEP_SIZE 0x01
#define FALLOC_ALLOCATE 0x0
#define FALLOC_RESV_SPACE   FALLOC_FL_KEEP_SIZE


int do_fallocate(int fd, int mode, loff_t offset, loff_t len)
{
  int ret;

  if (VERBOSE)
printf(Trying to preallocate blocks (offset=%llu, len=%llu)\n,
offset, len);
  ret = syscall(__NR_fallocate, fd, mode, offset, len);

  if (ret 0) {
printf(SYSCALL: received error %d, ret=%d\n, errno, ret);
close(fd);
return(1);
  }

  if (VERBOSE)
printf(fallocate system call succedded !  ret=%d\n, ret);

  return ret;
}

int test_fallocate(int fd, int mode, loff_t offset, loff_t len)
{
  int ret, blocks;
  struct stat statbuf1, statbuf2;

  fstat(fd, statbuf1);

  ret = do_fallocate(fd, mode, offset, len);

  fstat(fd, statbuf2);

  /* check file size after preallocation */
  if (mode == FALLOC_ALLOCATE) {
if (!ret  statbuf1.st_size  (offset + len) 
statbuf2.st_size != (offset + len)) {
printf(Error: fallocate succeeded, but the file size did not 
change, where it should have!\n);
ret = 1;
}
  } else if (statbuf1.st_size != statbuf2.st_size) {
printf(Error : File size changed, when it should not have!\n);
ret = 1;
  }

  blocks = ((statbuf2.st_blocks - statbuf1.st_blocks) * 512)/ 
statbuf2.st_blksize;

  /* Print report */
  printf(# FALLOCATE TEST REPORT #\n);
  printf(\tNew blocks preallocated = %d.\n, blocks);
  printf(\tNumber of bytes preallocated = %d\n, blocks * statbuf2.st_blksize);
  printf(\tOld file size = %d, New file size %d.\n,
  statbuf1.st_size, statbuf2.st_size);
  printf(\tOld num blocks = %d, New num blocks %d.\n,
  (statbuf1.st_blocks * 512)/1024, (statbuf2.st_blocks * 512)/1024);

  return ret;
}


int do_write(int fd, loff_t offset, loff_t len)
{
  int ret;
  char *buf;

  buf = (char *)malloc(len);
  if (!buf) {
printf(error: malloc failed.\n);
return(-1);
  }

  if (VERBOSE)
printf(Trying to write to file (offset=%llu, len=%llu)\n, 
offset, len);

  ret = lseek(fd, offset, SEEK_SET);
  if (ret != offset) {
printf(lseek() failed error=%d, ret=%d\n, errno, ret);
close(fd); 
return(-1);
  }

  ret = write(fd, buf, len);
  if (ret != len) {
 printf(write() failed error=%d, ret=%d\n, errno, ret);
close(fd); 
return(-1);
  }

  if (VERBOSE)
printf(Write succedded ! Written %llu bytes ret=%d\n, len, ret);

  return ret;
}


int test_write(int fd, loff_t offset, loff_t len)
{
  int ret;

  ret = do_write(fd, offset, len);

Re: [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-13 Thread Amit K. Arora

On Fri, Jul 13, 2007 at 02:21:19PM +0100, Christoph Hellwig wrote:
 On Fri, Jul 13, 2007 at 06:17:55PM +0530, Amit K. Arora wrote:
   /*
  + * sys_fallocate - preallocate blocks or free preallocated blocks
  + * @fd: the file descriptor
  + * @mode: mode specifies the behavior of allocation.
  + * @offset: The offset within file, from where allocation is being
  + * requested. It should not have a negative value.
  + * @len: The amount of space in bytes to be allocated, from the offset.
  + *  This can not be zero or a negative value.
 
 kerneldoc comments are for in-kernel APIs which syscalls aren't.  I'd say
 just temove this comment, the manpage is a much better documentation anyway.

Ok. I will remove this entire comment.
 
  + * TBD Generic fallocate to be added for file systems that do not
  + *  support fallocate.
 
 Please remove the comment, adding a generic fallback in kernelspace is a
 very dumb idea as we already discussed long time ago.

  --- linux-2.6.22.orig/include/linux/fs.h
  +++ linux-2.6.22/include/linux/fs.h
  @@ -266,6 +266,21 @@ extern int dir_notify_enable;
   #define SYNC_FILE_RANGE_WRITE  2
   #define SYNC_FILE_RANGE_WAIT_AFTER 4
   
  +/*
  + * sys_fallocate modes
  + * Currently sys_fallocate supports two modes:
  + * FALLOC_ALLOCATE :   This is the preallocate mode, using which an 
  application
  + * may request reservation of space for a particular file.
  + * The file size will be changed if the allocation is
  + * beyond EOF.
  + * FALLOC_RESV_SPACE : This is same as the above mode, with only one 
  difference
  + * that the file size will not be modified.
  + */
  +#define FALLOC_FL_KEEP_SIZE0x01 /* default is extend/shrink size */
  +
  +#define FALLOC_ALLOCATE0
  +#define FALLOC_RESV_SPACE  FALLOC_FL_KEEP_SIZE
 
 Just remove FALLOC_ALLOCATE, 0 flags should be the default.  I'm also
 not sure there is any point in having two namespace now that we have a flags-
 based ABI.

Ok. Since we have only one flag (FALLOC_FL_KEEP_SIZE) and we do not want
to declare the default mode (FALLOC_ALLOCATE), we can _just_ have this
flag and remove the other mode too (FALLOC_RESV_SPACE).
Is this what you are suggesting ?

 Also please don't add this to fs.h.  fs.h is a complete mess and the
 falloc flags are a new user ABI.  Add a linux/falloc.h instead which can
 be added to headers-y so the ABI constant can be exported to userspace.

Should we need a header file just to declare one flag - i.e.
FALLOC_FL_KEEP_SIZE (since now there is no point of declaring the two
modes) ? If linux/fs.h is not a good place, will asm-generic/fcntl.h
be a sane place for this flag ?

Thanks!
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6][TAKE7] revalidate write permissions for fallocate

2007-07-13 Thread Amit K. Arora

On Fri, Jul 13, 2007 at 02:21:37PM +0100, Christoph Hellwig wrote:
 On Fri, Jul 13, 2007 at 06:18:47PM +0530, Amit K. Arora wrote:
  From: David P. Quigley [EMAIL PROTECTED]
  
  Revalidate the write permissions for fallocate(2), in case security policy 
  has
  changed since the files were opened.
  
  Acked-by: James Morris [EMAIL PROTECTED]
  Signed-off-by: David P. Quigley [EMAIL PROTECTED]
 
 This should be merged into the main falloc patch.

Ok. Will merge it...

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/5][TAKE8] fallocate system call

2007-07-13 Thread Amit K. Arora

This is the latest fallocate patchset and is based on 2.6.22.

* Following are the changes from TAKE7:
1) Updated the man page.
2) Merged revalidate write permissions patch with the main falloc patch.
3) Added linux/falloc.h and moved FALLOC_FL_KEEP_SIZE flag to it.
   Also removed the two modes (FALLOC_ALLOCATE and FALLOC_RESV_SPACE).
4) Removed comment above sys_fallocate definition.
5) Updated the testcase below to use FALLOC_FL_KEEP_SIZE flag instead
   of previous two modes.

* Following are the changes from TAKE6:
1) We now just have two modes (and no deallocation modes).
2) Updated the man page
3) Added a new patch submitted by David P. Quigley  (Patch 3/6).
4) Used EXT_INIT_MAX_LEN instead of 0x8000 in Patch 6/6.
4) Included below in the end is a small testcase to test fallocate.


* Following are the changes from TAKE5 to TAKE6:
1) Rebased to 2.6.22
2) Added compat wrapper for x86_64
3) Dropped s390 and ia64 patches, since the platform maintaners can
   add the support for fallocate once it is in mainline.
4) Added a change suggested by Andreas for better extent-to-group
   alignment in ext4 (Patch 6/6). Please refer following post:
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg02445.html
5) Renamed mode flags and values from FA_ to FALLOC_
6) Added manpage (updated version of the one initially submitted by
   David Chinner).


Todos:
-
1 Implementation on other architectures (other than i386, x86_64,
   and ppc64). s390(x) and ia64 patches are ready and will be pushed
   by platform maintaners when the fallocate is in mainline.
2 A generic file system operation to handle fallocate
   (generic_fallocate), for filesystems that do _not_ have the fallocate
   inode operation implemented.
3 Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()
4 Patch to e2fsprogs to recognize and display uninitialized extents.


Following patches follow:
Patch 1/5 : manpage for fallocate
Patch 2/5 : fallocate() implementation in i386, x86_64 and powerpc
Patch 3/5 : ext4: fallocate support in ext4
Patch 4/5 : ext4: write support for preallocated blocks
Patch 5/5 : ext4: change for better extent-to-group alignment

**
Attached below is a small testcase to test fallocate. The __NR_fallocate will
need to be changed depending on the system call number in the kernel (it may
get changed due to merge) and also depending on the architecture.

--
Regards,
Amit Arora



#include stdio.h
#include stdlib.h
#include fcntl.h
#include errno.h

#include linux/unistd.h
#include sys/vfs.h
#include sys/stat.h

#define VERBOSE 0

#define __NR_fallocate324

#define FALLOC_FL_KEEP_SIZE 0x01

int do_fallocate(int fd, int mode, loff_t offset, loff_t len)
{
  int ret;

  if (VERBOSE)
printf(Trying to preallocate blocks (offset=%llu, len=%llu)\n,
offset, len);
  ret = syscall(__NR_fallocate, fd, mode, offset, len);

  if (ret 0) {
printf(SYSCALL: received error %d, ret=%d\n, errno, ret);
close(fd);
return(1);
  }

  if (VERBOSE)
printf(fallocate system call succedded !  ret=%d\n, ret);

  return ret;
}

int test_fallocate(int fd, int mode, loff_t offset, loff_t len)
{
  int ret, blocks;
  struct stat statbuf1, statbuf2;

  fstat(fd, statbuf1);

  ret = do_fallocate(fd, mode, offset, len);

  fstat(fd, statbuf2);

  /* check file size after preallocation */
  if (!mode) {
if (!ret  statbuf1.st_size  (offset + len) 
statbuf2.st_size != (offset + len)) {
printf(Error: fallocate succeeded, but the file size did not 
change, where it should have!\n);
ret = 1;
}
  } else if (statbuf1.st_size != statbuf2.st_size) {
printf(Error : File size changed, when it should not have!\n);
ret = 1;
  }

  blocks = ((statbuf2.st_blocks - statbuf1.st_blocks) * 512)/ 
statbuf2.st_blksize;

  /* Print report */
  printf(# FALLOCATE TEST REPORT #\n);
  printf(\tNew blocks preallocated = %d.\n, blocks);
  printf(\tNumber of bytes preallocated = %d\n, blocks * statbuf2.st_blksize);
  printf(\tOld file size = %d, New file size %d.\n,
  statbuf1.st_size, statbuf2.st_size);
  printf(\tOld num blocks = %d, New num blocks %d.\n,
  (statbuf1.st_blocks * 512)/1024, (statbuf2.st_blocks * 512)/1024);

  return ret;
}


int do_write(int fd, loff_t offset, loff_t len)
{
  int ret;
  char *buf;

  buf = (char *)malloc(len);
  if (!buf) {
printf(error: malloc failed.\n);
return(-1);
  }

  if (VERBOSE)
printf(Trying to write to file (offset=%llu, len=%llu)\n, 
offset, len);

  ret = lseek(fd, offset, SEEK_SET);
  if (ret != offset) {
printf(lseek() failed error=%d, ret=%d\n, errno, ret);
close(fd); 
return(-1);
  }

  ret = write(fd, buf, len);
  if (ret != len) {
 printf(write() failed error=%d, ret=%d\n, errno, ret);

[PATCH 1/5][TAKE8] manpage for fallocate

2007-07-13 Thread Amit K. Arora

Following is the modified version of the manpage originally submitted by
David Chinner. Please use `nroff -man fallocate.2 | less` to view.

Following changed from TAKE7:
* Removed FALLOC_ALLOCATE and FALLOCATE_RESV_SPACE modes.
* Described only single flag for mode, i.e. FALLOC_FL_KEEP_SIZE.
* s/zero blocks/zeroed blocks/ as suggested by Dave.
* Included linux/falloc.h instead of fcntl.h.

Following changed from TAKE6 to TAKE7:
Included changes suggested by Heikki Orsila and Barry Naujok.


.TH fallocate 2
.SH NAME
fallocate \- manipulate file space
.SH SYNOPSIS
.nf
.B #include linux/falloc.h
.PP
.BI long fallocate(int  fd , int  mode , loff_t  offset , loff_t  len 
);
.SH DESCRIPTION
The
.B fallocate
syscall allows a user to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.I offset
and continuing for
.I len
bytes.
The
.I mode
parameter determines the operation to be performed on the given range.
Currently there is only one flag supported for the mode argument.
.TP
.B FALLOC_FL_KEEP_SIZE
allocates and initialises to zero the disk space within the given range.
After a successful call, subsequent writes are guaranteed not to fail because
of lack of disk space.  Even if the size of the file is less than
.IR offset + len ,
the file size is not changed. This allows allocation of zeroed blocks beyond
the end of file and is useful for optimising append workloads.
.PP
If
.B FALLOC_FL_KEEP_SIZE
flag is not specified in the mode argument, the default behavior of this system
call is almost same as when this flag is passed. The only difference is that
on success, the file size will be changed if the
.IR offset + len
is greater than the file size. This default behavior closely resembles
.BR posix_fallocate (3)
and is intended as a method of optimally implementing this function.
.PP
.B fallocate
may allocate a larger range than that was specified.
.SH RETURN VALUE
.B fallocate
returns zero on success, or an error number on failure.
Note that
.I errno
is not set.
.SH ERRORS
.TP
.B EBADF
.I fd
is not a valid file descriptor, or is not opened for writing.
.TP
.B EFBIG
.IR offset + len
exceeds the maximum file size.
.TP
.B EINVAL
.I offset
was less than 0, or
.I len
was less than or equal to 0.
.TP
.B ENODEV
.I fd
does not refer to a regular file or a directory.
.TP
.B ENOSPC
There is not enough space left on the device containing the file
referred to by
.IR fd .
.TP
.B ESPIPE
.I fd
refers to a pipe of file descriptor.
.TP
.B ENOSYS
The filesystem underlying the file descriptor does not support this
operation.
.TP
.B EINTR
A signal was caught during execution
.TP
.B EIO
An I/O error occurred while reading from or writing to a file system.
.TP
.B EOPNOTSUPP
The mode is not supported on the file descriptor.
.SH AVAILABILITY
The
.B fallocate
system call is available since 2.6.XX
.SH SEE ALSO
.BR posix_fallocate (3),
.BR posix_fadvise (3),
.BR ftruncate (3).
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/5][TAKE8] fallocate() implementation in i386, x86_64 and powerpc

2007-07-13 Thread Amit K. Arora

From: Amit Arora [EMAIL PROTECTED]

sys_fallocate() implementation on i386, x86_64 and powerpc

fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called -fallocate().
Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.
ToDos:
1. Implementation on other architectures (other than i386, x86_64,
   and ppc). Patches for s390(x) and ia64 are already available from
   previous posts, but it was decided that they should be added later
   once fallocate is in the mainline. Hence not including those patches
   in this take.
2. Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()

CHANGELOG:
-
Following changed from TAKE7:
1. Added linux/falloc.h and moved FALLOC_FL_KEEP_SIZE flag to it.
2. Removed the two modes (FALLOC_ALLOCATE and FALLOC_RESV_SPACE).
3. Merged revalidate write permissions patch from David P. Quigley
   to this patch.
4. Deleted comment above sys_fallocate definition, as suggested by Christoph.


Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.22.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+   .long sys_fallocate
Index: linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
===
--- linux-2.6.22.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
return sys_truncate(path, (high  32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+u32 lenhi, u32 lenlo)
+{
+   return sys_fallocate(fd, mode, ((loff_t)offhi  32) | offlo,
+((loff_t)lenhi  32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long 
high,
 unsigned long low)
 {
Index: linux-2.6.22/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.22.orig/arch/x86_64/ia32/ia32entry.S
+++ linux-2.6.22/arch/x86_64/ia32/ia32entry.S
@@ -719,4 +719,5 @@ ia32_sys_call_table:
.quad compat_sys_signalfd
.quad compat_sys_timerfd
.quad sys_eventfd
+   .quad sys32_fallocate
 ia32_syscall_end:
Index: linux-2.6.22/fs/open.c
===
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -26,6 +26,7 @@
 #include linux/syscalls.h
 #include linux/rcupdate.h
 #include linux/audit.h
+#include linux/falloc.h
 
 int vfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
@@ -352,6 +353,64 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+
+   if (offset  0 || len = 0)
+   goto out;
+
+   /* Return error if mode is not supported */
+   ret = -EOPNOTSUPP;
+   if (mode  !(mode  FALLOC_FL_KEEP_SIZE))
+   goto out;
+
+   ret = -EBADF;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   if (!(file-f_mode  FMODE_WRITE))
+   goto out_fput;
+   /*
+* Revalidate the write permissions, in case security policy has
+* changed since the files were opened.
+*/
+   ret = security_file_permission(file, MAY_WRITE);
+   if (ret)
+   goto out_fput;
+
+   inode = file-f_path.dentry-d_inode;
+
+   ret = -ESPIPE;
+   if (S_ISFIFO(inode-i_mode))
+   goto out_fput;
+
+   ret = -ENODEV;
+   /*
+* Let individual file system

[PATCH 3/5][TAKE8] ext4: fallocate support in ext4

2007-07-13 Thread Amit K. Arora

From: Amit Arora [EMAIL PROTECTED]

fallocate support in ext4

This patch implements -fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation. Current implementation only supports
preallocation for regular files (directories not supported as of date)
with extent maps. This patch does not support block-mapped files currently.
Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of
now.

CHANGELOG:
-
Following changed from TAKE7:
1. Removed usage of FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes and
   used FALLOC_FL_KEEP_SIZE mode flag instead.
2. Included  linux/falloc.h new header file, which defines above flag.


Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -39,6 +39,7 @@
 #include linux/quotaops.h
 #include linux/string.h
 #include linux/slab.h
+#include linux/falloc.h
 #include linux/ext4_fs_extents.h
 #include asm/uaccess.h
 
@@ -282,7 +283,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -305,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -425,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -686,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1106,7 +1107,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1115,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1144,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int ret = 0;
 
b1 = le32_to_cpu(newext-ee_block);
-   len1 = le16_to_cpu(newext-ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1191,8 +1204,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
int depth, len, err, next;
+   unsigned uninitialized = 0;
 
-   BUG_ON(newext-ee_len == 0);
+   BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);

[PATCH 4/5][TAKE8] ext4: write support for preallocated blocks

2007-07-13 Thread Amit K. Arora

From:  Amit Arora [EMAIL PROTECTED]

write support for preallocated blocks

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

CHANGELOG:
-
This patch did not change from TAKE7 (besides offsets ;).


Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1141,6 +1141,53 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the ex extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass ex - 1 as argument instead of ex.
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done = 0;
+   int uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex  EXT_LAST_EXTENT(eh)) {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1  EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries) - 1);
+   merge_done = 1;
+   WARN_ON(eh-eh_entries == 0);
+   if (!eh-eh_entries)
+   ext4_error(inode-i_sb, ext4_ext_try_to_merge,
+  inode#%lu, eh-eh_entries = 0!, inode-i_ino);
+   }
+
+   return merge_done;
+}
+
+/*
  * check if a portion of the newext extent overlaps with an
  * existing extent.
  *
@@ -1328,25 +1375,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex  EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1  EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
-   BUG_ON(eh-eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -2012,15 +2041,158 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a There is no split required: Entire extent should be initialized
+ *   b Splits in two extents: Write is happening at either end of the extent
+ *   c Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+   struct ext4_ext_path *path,
+   ext4_fsblk_t iblock,
+   unsigned long max_blocks)
+{
+   struct ext4_extent *ex, newex;
+   struct ext4_extent *ex1 = NULL;
+   struct ext4_extent *ex2 = NULL;
+   struct ext4_extent *ex3 = NULL;
+   struct ext4_extent_header *eh;
+   unsigned int allocated, ee_block, ee_len, depth;
+   ext4_fsblk_t newblock;
+   int err = 0;
+   int ret = 0;
+
+

[PATCH 5/5][TAKE8] ext4: change for better extent-to-group alignment

2007-07-13 Thread Amit K. Arora

From: Amit Arora [EMAIL PROTECTED]

Change on-disk format for extent to represent uninitialized/initialized extents

This change was suggested by Andreas Dilger. 
This patch changes the EXT_MAX_LEN value and extent code which marks/checks
uninitialized extents. With this change it will be possible to have
initialized extents with 2^15 blocks (earlier the max blocks we could have
was 2^15 - 1). This way we can have better extent-to-block alignment.
Now, maximum number of blocks we can have in an initialized extent is 2^15
and in an uninitialized extent is 2^15 - 1.


CHANGELOG:
-
This patch did not change from TAKE7 (besides offsets ;).

Following changed from TAKE6 to TAKE7:
1. Taken care of Andreas's suggestion of using EXT_INIT_MAX_LEN instead of
   0x8000 at some places.

Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1107,7 +1107,7 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   unsigned short ext1_ee_len, ext2_ee_len;
+   unsigned short ext1_ee_len, ext2_ee_len, max_len;
 
/*
 * Make sure that either both extents are uninitialized, or
@@ -1116,6 +1116,11 @@ ext4_can_extents_be_merged(struct inode 
if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
return 0;
 
+   if (ext4_ext_is_uninitialized(ex1))
+   max_len = EXT_UNINIT_MAX_LEN;
+   else
+   max_len = EXT_INIT_MAX_LEN;
+
ext1_ee_len = ext4_ext_get_actual_len(ex1);
ext2_ee_len = ext4_ext_get_actual_len(ex2);
 
@@ -1128,7 +1133,7 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  max_len)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
@@ -1815,7 +1820,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
ex-ee_block = cpu_to_le32(block);
ex-ee_len = cpu_to_le16(num);
-   if (uninitialized)
+   /*
+* Do not mark uninitialized if all the blocks in the
+* extent have been removed.
+*/
+   if (uninitialized  num)
ext4_ext_mark_uninitialized(ex);
 
err = ext4_ext_dirty(handle, inode, path + depth);
@@ -2308,6 +2317,19 @@ int ext4_ext_get_blocks(handle_t *handle
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
 
+   /*
+* See if request is beyond maximum number of blocks we can have in
+* a single extent. For an initialized extent this limit is
+* EXT_INIT_MAX_LEN and for an uninitialized extent this limit is
+* EXT_UNINIT_MAX_LEN.
+*/
+   if (max_blocks  EXT_INIT_MAX_LEN 
+   create != EXT4_CREATE_UNINITIALIZED_EXT)
+   max_blocks = EXT_INIT_MAX_LEN;
+   else if (max_blocks  EXT_UNINIT_MAX_LEN 
+create == EXT4_CREATE_UNINITIALIZED_EXT)
+   max_blocks = EXT_UNINIT_MAX_LEN;
+
/* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
newex.ee_block = cpu_to_le32(iblock);
newex.ee_len = cpu_to_le16(max_blocks);
Index: linux-2.6.22/include/linux/ext4_fs_extents.h
===
--- linux-2.6.22.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22/include/linux/ext4_fs_extents.h
@@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru
 
 #define EXT_MAX_BLOCK  0x
 
-#define EXT_MAX_LEN((1UL  15) - 1)
+/*
+ * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an
+ * initialized extent. This is 2^15 and not (2^16 - 1), since we use the
+ * MSB of ee_len field in the extent datastructure to signify if this
+ * particular extent is an initialized extent or an uninitialized (i.e.
+ * preallocated).
+ * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an
+ * uninitialized extent.
+ * If ee_len is = 0x8000, it is an initialized extent. Otherwise, it is an
+ * uninitialized one. In other words, if MSB of ee_len is set, it is an
+ * uninitialized extent with only one special scenario when ee_len = 0x8000.
+ * In this case we can not have an uninitialized extent of zero length and
+ * thus we make it as a special case of initialized extent with 0x8000 length.
+ * This way we get better extent-to-group alignment for initialized extents.
+ * Hence, the maximum number of blocks we can have in an *initialized*
+ * extent is 2^15 (32768) and in an *uninitialized* extent is 2^15-1

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-12 Thread Amit K. Arora

On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote:
 On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote:
  On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
   Well, if you see the modes proposed using above flags :
   
   #define FA_ALLOCATE   0
   #define FA_DEALLOCATE FA_FL_DEALLOC
   #define FA_RESV_SPACE FA_FL_KEEP_SIZE
   #define FA_UNRESV_SPACE   (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
   FA_FL_DEL_DATA)
   
   FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
   for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
   flag. Hence prealloction will never delete data.
   This mode is required only for FA_UNRESV_SPACE, which is a deallocation
   mode, to support any existing XFS aware applications/usage-scenarios.
  
  Sorry, but this doesn't make any sense.  There is no need to put every
  feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
  be supported forever anyway - as I suggested before they really should
  be moved to generic code.
  
  What needs to be supported is what makes sense as an interface.
  A punch a hole interface does make sense, but trying to hack this into
  a preallocation system call is just madness.  We're not IRIX or windows
  that fit things into random subcall just because there was some space
  left to squeeze them in.
  
  FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on 
  size, data change) */
  FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on 
  size, data change) */

NACK to these aswell.  If i_size changes c/mtime need updates, if the 
size
doesn't chamge they don't.  No need to add more flags for this.
   
   This requirement was from the point of view of HSM applications. Hope
   you saw Andreas previous post and are keeping that in mind.
  
  HSMs needs this basically for every system call, which screams for an
  open flag like O_INVISIBLE anyway.  Adding this in a generic way is
  a good idea, but hacking bits and pieces that won't fit into the global
  design is completely wrong.
 
 Why don't we just merge the interface for preallocation (essentially
 enough to satisfy posix_fallocate() and the simple XFS requirement for 
 space reservation without changing file size), which there is clear agreement
 on (I hope :)).  After all, this was all that we set out to do when we
 started.

As you suggest, let us just have two modes for the time being:

#define FALLOC_ALLOCATE 0x1
#define FALLOC_ALLOCATE_KEEP_SIZE   0x2

As the name suggests, when FALLOC_ALLOCATE_KEEP_SIZE mode is passed it
will result in file size not being changed even if the preallocation is
beyond EOF.

 And leave all the dealloc/punch/hsm type features for separate future patches/
 debates, those really shouldn't hold up the basic fallocate interface.

I agree.

 I agree with Christoph that we are just diverging too much in trying to
 club those decisions here.
 
 Dave, Andreas, Ted ?
 
 Regards
 Suparna

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-11 Thread Amit K. Arora

On Wed, Jul 11, 2007 at 12:10:34PM +1000, Stephen Rothwell wrote:
 On Wed, 11 Jul 2007 01:50:00 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  --- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c
  +++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c
  @@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd, 
  return sys_fadvise64_64(fd, ((u64)offset_hi  32) | offset_lo,
  len, advice);
   }
  +
  +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo,
  +   unsigned offset_hi, unsigned len_lo,
  +   unsigned len_hi)
 
 Please call this compat_sys_fallocate in line with the powerpc version -
 it gives us a hint that maybe we should think about how to consolidate
 them.  I know other stuff in that file is called sys32_ ... but it is time
 for a change :-)

I think this can be handled as a separate patch once this patchset
is in mainline. Since, anyhow we will need to do this for other sys32_
calls which are already there...

--
Regards,
Amit Arora


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ext4-patch-queue rebased to 2.6.22

2007-07-10 Thread Amit K. Arora

On Tue, Jul 10, 2007 at 11:09:39AM -0600, Andreas Dilger wrote:
 On Jul 10, 2007  20:24 +0530, Amit K. Arora wrote:
  On Mon, Jul 09, 2007 at 01:37:56PM -0400, Theodore Ts'o wrote:
   So we're just waiting for Amit to make the minor on-disk format change
   Andreas suggested before we push to Linus.
  
  2. Added a new patch ext4-fallocate-8-new-ondisk-format and updated
 the series file. This patch, as suggested by Andreas, will allow
 an initialized extent to be of max 2^15 length. Main purpose of this
 change is to have a better extent-to-group alignment.
 For uninitialized extents the max length remains same - i.e. 2^15 - 1.
 
 One tiny change I'd ask for in this patch (it isn't critical to get in
 before the upstream submission as it is only code style) is instead of
 using (EXT_MAX_LEN - 1) for uninitialized extents, instead use a separate
 #define EXT_UNINIT_MAX_LEN (EXT_MAX_LEN - 1) and use that in the code.
 While a minor change, this localizes the knowledge of the maximum length
 of uninitialized extents into just one place - right after the maximum
 length of initialized extents.
 
 It might even make sense to change the other #define to be called
 EXT_INIT_MAX_LEN so people have to think about this when using the #define.

Done. Changes are in ext4 patch queue.
Can you please have a quick look and see if this is what you preferred ?

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/7][TAKE6] fallocate system call

2007-07-10 Thread Amit K. Arora

This is the latest fallocate patchset and is rebased to 2.6.22.

Following are the changes from TAKE5:
1) Rebased to 2.6.22
2) Added compat wrapper for x86_64
3) Dropped s390 and ia64 patches, since the platform maintaners can
   add the support for fallocate once it is in mainline.
4) Added a change suggested by Andreas for better extent-to-group
   alignment in ext4 (Patch 6/6). Please refer following post:
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg02445.html
5) Renamed mode flags and values from FA_ to FALLOC_
6) Added manpage (updated version of the one initially submitted by
   David Chinner).


Todos:
-
1 Implementation on other architectures (other than i386, x86_64,
   and ppc64). s390(x) and ia64 patches are ready and will be pushed
   by platform maintaners when the fallocate is in mainline.
2 A generic file system operation to handle fallocate
   (generic_fallocate), for filesystems that do _not_ have the fallocate
   inode operation implemented.
3 Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()
4 A testcase to test the system call. Will post it soon.


Following patches follow:
Patch 1/7 : manpage for fallocate
Patch 2/7 : fallocate() implementation in i386, x86_64 and powerpc
Patch 3/7 : support new modes in fallocate
Patch 4/7 : ext4: fallocate support in ext4
Patch 5/7 : ext4: write support for preallocated blocks
Patch 6/7 : ext4: support new modes in ext4
Patch 7/7 : ext4: change for better extent-to-group alignment


--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/7] manpage for fallocate

2007-07-10 Thread Amit K. Arora

Following is the modified version of the manpage originally submitted by
David Chinner. Please use `nroff -man fallocate.2 | less` to view.


.TH fallocate 2
.SH NAME
fallocate \- allocate or remove file space
.SH SYNOPSIS
.nf
.B #include sys/syscall.h
.PP
.BI int syscall(int, int fd, int mode, loff_t offset, loff_t len);
.Op
.SH DESCRIPTION
The
.BR fallocate
syscall allows a user to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.IR offset
and continuing for
.IR len
bytes.
The
.I mode
parameter determines the operation to be performed on the given range.
Currently there are four modes:
.TP
.B FALLOC_ALLOCATE
allocates and initialises to zero the disk space within the given range.
After a successful call, subsequent writes are guaranteed not to fail because
of lack of disk space.  If the size of the file is less than
.IR offset + len ,
then the file is increased to this size; otherwise the file size is left
unchanged.
.B FALLOC_ALLOCATE
closely resembles
.B posix_fallocate(3)
and is intended as a method of optimally implementing this function.
.B FALLOC_ALLOCATE
may allocate a larger range that was specified.
.TP
.B FALLOC_RESV_SPACE
provides the same functionality as
.B FALLOC_ALLOCATE
except it does not ever change the file size. This allows allocation
of zero blocks beyond the end of file and is useful for optimising
append workloads.
.TP
.B FALLOC_DEALLOCATE
removes any preallocated space within the given range. The file size
may change if deallocation is towards the end of the file.
.TP
.B FALLOC_UNRESV_SPACE
removes the underlying disk space within the given range. The disk space
shall be removed regardless of it's contents so both allocated space
from
.B FALLOC_ALLOCATE
and
.B FALLOC_RESV_SPACE
as well as from
.B write(3)
will be removed.
.B FALLOC_UNRESV_SPACE
shall never remove disk blocks outside the range specified.
.B FALLOC_UNRESV_SPACE
shall never change the file size. If changing the file size
is required when deallocating blocks from an offset to end
of file (or beyond end of file) is required,
.B ftuncate64(3)
or
.B FALLOC_DEALLOCATE
should be used.

.SH RETURN VALUE
.BR fallocate()
returns zero on success, or an error number on failure.
Note that
.IR errno
is not set.
.SH ERRORS
.TP
.B EBADF
.I fd
is not a valid file descriptor, or is not opened for writing.
.TP
.B EFBIG
.I offset+len
exceeds the maximum file size.
.TP
.B EINVAL
.I offset
or
.I len
was less than 0.
.TP
.B ENODEV
.I fd
does not refer to a regular file or a directory.
.TP
.B ENOSPC
There is not enough space left on the device containing the file
referred to by
.IR fd.
.TP
.B ESPIPE
.I fd
refers to a pipe of file descriptor.
.B ENOSYS
The filesystem underlying the file descriptor does not support this
operation.
.SH AVAILABILITY
The
.BR fallocate ()
system call is available since 2.6.XX
.SH SEE ALSO
.BR syscall (2),
.BR posix_fadvise (3)
.BR ftruncate (3)
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/7] ext4: fallocate support in ext4

2007-07-10 Thread Amit K. Arora

From: Amit Arora [EMAIL PROTECTED]

fallocate support in ext4

This patch implements -fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation. Current implementation only supports
preallocation for regular files (directories not supported as of date)
with extent maps. This patch does not support block-mapped files currently.
Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a ToDo item.


Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c 2007-07-09 15:24:33.0 -0700
+++ linux-2.6.22/fs/ext4/extents.c  2007-07-09 15:24:39.0 -0700
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int ret = 0;
 
b1 = le32_to_cpu(newext-ee_block);
-   len1 = le16_to_cpu(newext-ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
int depth, len, err, next;
+   unsigned uninitialized = 0;
 
-   BUG_ON(newext-ee_len == 0);
+   BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);
ex = path[depth].p_ext;
BUG_ON(path[depth].p_hdr == NULL);
@@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex  ext4_can_extents_be_merged(inode, ex, newext)) {
ext_debug(append %d block to %d:%d (from

[PATCH 5/7] ext4: write support for preallocated blocks

2007-07-10 Thread Amit K. Arora

From:  Amit Arora [EMAIL PROTECTED]

write support for preallocated blocks

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c 2007-07-09 15:24:39.0 -0700
+++ linux-2.6.22/fs/ext4/extents.c  2007-07-09 15:24:48.0 -0700
@@ -1140,6 +1140,53 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the ex extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass ex - 1 as argument instead of ex.
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done = 0;
+   int uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex  EXT_LAST_EXTENT(eh)) {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1  EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries) - 1);
+   merge_done = 1;
+   WARN_ON(eh-eh_entries == 0);
+   if (!eh-eh_entries)
+   ext4_error(inode-i_sb, ext4_ext_try_to_merge,
+  inode#%lu, eh-eh_entries = 0!, inode-i_ino);
+   }
+
+   return merge_done;
+}
+
+/*
  * check if a portion of the newext extent overlaps with an
  * existing extent.
  *
@@ -1327,25 +1374,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex  EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1  EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
-   BUG_ON(eh-eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -2011,15 +2040,158 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a There is no split required: Entire extent should be initialized
+ *   b Splits in two extents: Write is happening at either end of the extent
+ *   c Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+   struct ext4_ext_path *path,
+   ext4_fsblk_t iblock,
+   unsigned long max_blocks)
+{
+   struct ext4_extent *ex, newex;
+   struct ext4_extent *ex1 = NULL;
+   struct ext4_extent *ex2 = NULL;
+   struct ext4_extent *ex3 = NULL;
+   struct ext4_extent_header *eh;
+   unsigned int allocated, ee_block, ee_len, depth;
+   ext4_fsblk_t newblock;
+   int err = 0;
+   int ret = 0;
+
+

[PATCH 6/7] ext4: support new modes in ext4

2007-07-10 Thread Amit K. Arora

From: Amit Arora [EMAIL PROTECTED]
Support new values of mode in ext4.

This patch supports new mode values/flags in ext4. With this patch ext4
will be able to support FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes. Supporting
FALLOC_DEALLOCATE and FALLOC_UNRESV_SPACE fallocate modes in ext4 is a work for
future.

Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -2453,8 +2453,9 @@ int ext4_ext_writepage_trans_blocks(stru
 /*
  * preallocate space for a file. This implements ext4's fallocate inode
  * operation, which gets called from sys_fallocate system call.
- * Currently only FA_ALLOCATE mode is supported on extent based files.
- * We may have more modes supported in future - like FA_DEALLOCATE, which
+ * Currently only FALLOC_ALLOCATE  and FALLOC_RESV_SPACE modes are supported on
+ * extent based files.
+ * We may have more modes supported in future - like FALLOC_DEALLOCATE, which
  * tells fallocate to unallocate previously (pre)allocated blocks.
  * For block-mapped files, posix_fallocate should fall back to the method
  * of writing zeroes to the required new blocks (the same behavior which is
@@ -2475,7 +2476,8 @@ long ext4_fallocate(struct inode *inode,
 * currently supporting (pre)allocate mode for extent-based
 * files _only_
 */
-   if (mode != FA_ALLOCATE || !(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
+   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL) ||
+   !(mode == FALLOC_ALLOCATE || mode == FALLOC_RESV_SPACE))
return -EOPNOTSUPP;
 
/* preallocation to directories is currently not supported */
@@ -2548,9 +2550,11 @@ retry:
 
/*
 * Time to update the file size.
-* Update only when preallocation was requested beyond the file size.
+* Update only when preallocation was requested beyond the file size
+* and when FALLOC_FL_KEEP_SIZE mode is not specified!
 */
-   if ((offset + len)  i_size_read(inode)) {
+   if (!(mode  FALLOC_FL_KEEP_SIZE) 
+   (offset + len)  i_size_read(inode)) {
if (ret  0) {
/*
 * if no error, we assume preallocation succeeded
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/7] ext4: change for better extent-to-group alignment

2007-07-10 Thread Amit K. Arora

From: Amit Arora [EMAIL PROTECTED]
Change on-disk format for extent to represent uninitialized/initialized extents

This change was suggested by Andreas Dilger as part of the following
post:
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg02445.html

This patch changes the EXT_MAX_LEN value and extent code which marks/checks
uninitialized extents. With this change it will be possible to have
initialized extents with 2^15 blocks (earlier the max blocks we could have
was 2^15 - 1). This way we can have better extent-to-block alignment.
Now, maximum number of blocks we can have in an initialized extent is 2^15
and in an uninitialized extent is 2^15 - 1.

Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1106,7 +1106,7 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   unsigned short ext1_ee_len, ext2_ee_len;
+   unsigned short ext1_ee_len, ext2_ee_len, max_len;
 
/*
 * Make sure that either both extents are uninitialized, or
@@ -1115,6 +1115,11 @@ ext4_can_extents_be_merged(struct inode 
if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
return 0;
 
+   if (ext4_ext_is_uninitialized(ex1))
+   max_len = EXT_UNINIT_MAX_LEN;
+   else
+   max_len = EXT_INIT_MAX_LEN;
+
ext1_ee_len = ext4_ext_get_actual_len(ex1);
ext2_ee_len = ext4_ext_get_actual_len(ex2);
 
@@ -1127,7 +1132,7 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  max_len)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
@@ -1814,7 +1819,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
ex-ee_block = cpu_to_le32(block);
ex-ee_len = cpu_to_le16(num);
-   if (uninitialized)
+   /*
+* Do not mark uninitialized if all the blocks in the
+* extent have been removed.
+*/
+   if (uninitialized  num)
ext4_ext_mark_uninitialized(ex);
 
err = ext4_ext_dirty(handle, inode, path + depth);
@@ -2307,6 +2316,18 @@ int ext4_ext_get_blocks(handle_t *handle
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
 
+   /*
+* See if request is beyond maximum number of blocks we can have in
+* a single extent. For an initialized extent this limit is
+* EXT_INIT_MAX_LEN and for an uninitialized extent this limit is
+* EXT_UNINIT_MAX_LEN.
+*/
+   if (max_blocks  EXT_INIT_MAX_LEN  create != 
EXT4_CREATE_UNINITIALIZED_EXT)
+   max_blocks = EXT_INIT_MAX_LEN;
+   else if (max_blocks  EXT_UNINIT_MAX_LEN 
+create == EXT4_CREATE_UNINITIALIZED_EXT)
+   max_blocks = EXT_UNINIT_MAX_LEN;
+
/* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
newex.ee_block = cpu_to_le32(iblock);
newex.ee_len = cpu_to_le16(max_blocks);
Index: linux-2.6.22/include/linux/ext4_fs_extents.h
===
--- linux-2.6.22.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22/include/linux/ext4_fs_extents.h
@@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru
 
 #define EXT_MAX_BLOCK  0x
 
-#define EXT_MAX_LEN((1UL  15) - 1)
+/*
+ * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an
+ * initialized extent. This is 2^15 and not (2^16 - 1), since we use the
+ * MSB of ee_len field in the extent datastructure to signify if this
+ * particular extent is an initialized extent or an uninitialized (i.e.
+ * preallocated).
+ * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an
+ * uninitialized extent.
+ * If ee_len is = 0x8000, it is an initialized extent. Otherwise, it is an
+ * uninitialized one. In other words, if MSB of ee_len is set, it is an
+ * uninitialized extent with only one special scenario when ee_len = 0x8000.
+ * In this case we can not have an uninitialized extent of zero length and
+ * thus we make it as a special case of initialized extent with 0x8000 length.
+ * This way we get better extent-to-group alignment for initialized extents.
+ * Hence, the maximum number of blocks we can have in an *initialized*
+ * extent is 2^15 (32768) and in an *uninitialized* extent is 2^15-1 (32767).
+ */
+#define EXT_INIT_MAX_LEN   (1UL  15)
+#define EXT_UNINIT_MAX_LEN (EXT_INIT_MAX_LEN - 1)
 
 
 #define

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-03 Thread Amit K. Arora

On Sat, Jun 30, 2007 at 12:52:46PM -0400, Andreas Dilger wrote:
 The @mode flags that are currently under consideration are (AFAIK):
 
 FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default allocate) 
 */
 FA_FL_KEEP_SIZE   0x02 /* keep size for EOF {pre,de}alloc (default change 
 size) */
 FA_FL_DEL_DATA0x04 /* delete existing data in alloc range (default 
 keep) */

We now have two sets of flags - 
1) the above three with which I think no one has any issues with, and
2) the ones below, for which we need some discussions before finalizing
on them.

I will prefer fallocate going in mainline with the above three modes, and
rest of the modes can be debated upon and discussed parallely. And, each
new mode/flag can be pushed as a separate patch. This will not hold
fallocate feature indefinitely...

Please confirm if you find this approach ok. Otherwise, please object.
Thanks!

 FA_FL_ERR_FREE0x08 /* free preallocation on error (default keep 
 prealloc) */
 FA_FL_NO_MTIME0x10 /* keep same mtime (default change on size, data 
 change) */
 FA_FL_NO_CTIME0x20 /* keep same ctime (default change on size, data 
 change) */

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-03 Thread Amit K. Arora

On Tue, Jul 03, 2007 at 11:31:07AM +0100, Christoph Hellwig wrote:
 On Tue, Jul 03, 2007 at 03:38:48PM +0530, Amit K. Arora wrote:
   FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default 
   allocate) */
   FA_FL_KEEP_SIZE   0x02 /* keep size for EOF {pre,de}alloc (default change 
   size) */
   FA_FL_DEL_DATA0x04 /* delete existing data in alloc range (default 
   keep) */
  
  We now have two sets of flags - 
  1) the above three with which I think no one has any issues with, and
 
 Yes, I do.  FA_FL_DEL_DATA is plain stupid, a preallocation call should
 never delete data.  FA_FL_DEALLOC should probably be a separate syscall
 because it's very different functionality.

Well, if you see the modes proposed using above flags :

#define FA_ALLOCATE 0
#define FA_DEALLOCATE   FA_FL_DEALLOC
#define FA_RESV_SPACE   FA_FL_KEEP_SIZE
#define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA)

FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
flag. Hence prealloction will never delete data.
This mode is required only for FA_UNRESV_SPACE, which is a deallocation
mode, to support any existing XFS aware applications/usage-scenarios.

And, regarding FA_FL_DEALLOC being a separate syscall - I think then the
very purpose of @mode argument is not justified. We have this mode so
that we can provide more features like this. That said, I don't say that
we should make things very complicated; but, atleast we should provide
some basic features which we expect most of the applications wanting
preallocation to use. To start with, we need to cater to already
existing applications/user base who use XFS preallocation feature.

And further advanced features, like goal based preallocation, can be
implemented as a separate syscall.

 While we're at it I also dislike the FA_ prefix becuase it doesn't say
 anything and is far too generic.  FALLOC_ is much better.

Ok. This can be changed in the next take.
 
   FA_FL_ERR_FREE0x08 /* free preallocation on error (default keep 
   prealloc) */
 
 NACK on this one.  We should have just one behaviour, and from the thread
 that not freeing the allocation on error.

I agree on this one. 
 
   FA_FL_NO_MTIME0x10 /* keep same mtime (default change on size, data 
   change) */
   FA_FL_NO_CTIME0x20 /* keep same ctime (default change on size, data 
   change) */
 
 NACK to these aswell.  If i_size changes c/mtime need updates, if the size
 doesn't chamge they don't.  No need to add more flags for this.

This requirement was from the point of view of HSM applications. Hope
you saw Andreas previous post and are keeping that in mind.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-02 Thread Amit K. Arora

On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote:
 On Sat, Jun 30, 2007 at 11:21:11AM +0100, Christoph Hellwig wrote:
  On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
Can you clarify - what is the current behaviour when ENOSPC (or some 
other
error) is hit?  Does it keep the current fallocate() or does it free it?
   
   Currently it is left on the file system implementation. In ext4, we do
   not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
   end up with partial (pre)allocation. This is inline with dd and
   posix_fallocate, which also do not free the partially allocated space.
  
  I can't find anything in the specification of posix_fallocate
  (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
  that tells what should happen to allocate blocks on error.
 
 Yeah, and AFAICT glibc leaves them behind ATM.

Yes, it does.
 
  But common sense would be to not leak disk space on failure of this
  syscall, and this definitively should not be left up to the filesystem,
  either we always leak it or always free it, and I'd strongly favour
  the latter variant.

I would not call it a leak, since the blocks which got allocated as
part of the partial success of the fallocate syscall can be strictly
accounted for (i.e. they are assigned to a particular inode). And these
can be freed by the application, using a suitable @mode of fallocate.
 
 We can't simply walk the range an remove unwritten extents, as some
 of them may have been present before the fallocate() call. That
 makes it extremely difficult to undo a failed call and not remove
 more pre-existing pre-allocations.

Same is true for ext4 too. It is very difficult to keep track of which
uninitialized (unwritten) extents got allocated as part of the current
syscall. This is because, as David mentions, some of them might be
already present; and also because some of the older ones may have got
merged with the *new* uninitialized/unwritten extents as part of the
current syscall. 
 
 Given the current behaviour for posix_fallocate() in glibc, I think
 that retaining the same error semantic and punting the cleanup to
 userspace (where the app will fail with ENOSPC anyway) is the only
 sane thing we can do here. Trying to undo this in the kernel leads
 to lots of extra rarely used code in error handling paths...

Right. This gives applications the free hand if they really want to use
the partially preallocated space, OR they want to free it; without
introducing additional complexity in the kernel.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-28 Thread Amit K. Arora

On Thu, Jun 28, 2007 at 02:55:43AM -0700, Andrew Morton wrote:
 On Mon, 25 Jun 2007 18:58:10 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  N O T E: 
  ---
  1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part
 of ext4 patch queue git tree hosted by Ted.
 
 Why the heck are replacements for these things being sent out again when
 they're already in -mm and they're already in Ted's queue (from which I
 need to diligently drop them each time I remerge)?
 
 Are we all supposed to re-review the entire patchset (or at least #4 and
 #7) again?

As I mentioned in the note above, only patches #4 and #7 were new and
thus these needed to be reviewed. Other patches are _not_ replacements
of any of the patches which are already part of -mm and/or in Ted's
patch queue. They were posted again as just placeholders so that the
two new patches (#4  #7) could be reviewed. Sorry for any confusion.
 
 Please drop the non-ext4 patches from the ext4 tree and send incremental
 patches against the (non-ext4) fallocate patches in -mm.

Please let us know what you think of Mingming's suggestion of posting
all the fallocate patches including the ext4 ones as incremental ones
against the -mm.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/7][TAKE5] ext4: support new modes

2007-06-28 Thread Amit K. Arora

On Wed, Jun 27, 2007 at 10:04:56AM +1000, David Chinner wrote:
 On Wed, Jun 27, 2007 at 12:59:08AM +0530, Amit K. Arora wrote:
  On Tue, Jun 26, 2007 at 12:14:00PM -0400, Andreas Dilger wrote:
   On Jun 26, 2007  17:37 +0530, Amit K. Arora wrote:
I think, modifying ctime/mtime should be dependent on the other flags.
E.g., if we do not zero out data blocks on allocation/deallocation,
update only ctime. Otherwise, update ctime and mtime both.
   
   I'm only being the advocate for requirements David Chinner has put
   forward due to existing behaviour in XFS.  This is one of the reasons
   why I think the flags mechanism we now have - we can encode the
   various different behaviours in any way we want and leave it to the
   caller.
  
  I understand. May be we can confirm once more with David Chinner if this
  is really required. Will it really be a compatibility issue if new XFS
  preallocations (ie. via fallocate) update mtime/ctime?
 
 It should be left up to the filesystem to decide. Only the
 filesystem knows whether something changed and the timestamp should
 or should not be updated.

Since Andreas had suggested FA_FL_NO_MTIME flag thinking it as a
requirement from XFS (whereas XFS does not need this flag), I don't think
we need to add this new flag.

Please let know if someone still feels FA_FL_NO_MTIME flag can be
useful.

--
Regards,
Amit Arora

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-28 Thread Amit K. Arora

On Wed, Jun 27, 2007 at 09:18:04AM +1000, David Chinner wrote:
 On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
  On Jun 26, 2007  16:02 +0530, Amit K. Arora wrote:
   On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
Can you clarify - what is the current behaviour when ENOSPC (or some 
other
error) is hit?  Does it keep the current fallocate() or does it free it?
   
   Currently it is left on the file system implementation. In ext4, we do
   not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
   end up with partial (pre)allocation. This is inline with dd and
   posix_fallocate, which also do not free the partially allocated space.
  
  Since I believe the XFS allocation ioctls do it the opposite way (free
  preallocated space on error) this should be encoded into the flags.
  Having it filesystem dependent just means that nobody will be happy.
 
 No, XFs does not free preallocated space on error. it is up to the
 application to clean up.

Since XFS also does not free preallocated space on error and this
behavior is inline with dd, posix_fallocate() and the current ext4
implementation, do we still need FA_FL_FREE_ENOSPC flag ?
 
  What I mean is that any data read from the file should have the appearance
  of being zeroed (whether zeroes are actually written to disk or not).  What
  I _think_ David is proposing is to allow fallocate() to return without
  marking the blocks even uninitialized and subsequent reads would return
  the old data from the disk.
 
 Correct, but for swap files that's not an issue - no user should be able
 too read them, and FA_MKSWAP would really need root privileges to execute.

Will the FA_MKSWAP mode still be required with your suggested change of
teaching do_mpage_readpage() about unwritten extents being in place ?
Or, will you still like to have FA_MKSWAP mode ?

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Amit K. Arora

On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
 On Jun 25, 2007  20:33 +0530, Amit K. Arora wrote:
  I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
  *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323  post.
  If it is decided that these flags are also needed, I will update this
  patch. Thanks!
 
 Can you clarify - what is the current behaviour when ENOSPC (or some other
 error) is hit?  Does it keep the current fallocate() or does it free it?

Currently it is left on the file system implementation. In ext4, we do
not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
end up with partial (pre)allocation. This is inline with dd and
posix_fallocate, which also do not free the partially allocated space.
 
 For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
 don't want to expose uninitialized disk blocks to userspace.  I'm not
 sure if this makes sense at all.

I don't think we need to make it default - atleast for filesystems which
have a mechanism to distinguish preallocated blocks from regular ones.
In ext4, for example, we will have a way to mark uninitialized extents.
All the preallocated blocks will be part of these uninitialized extents.
And any read on these extents will treat them as a hole, returning
zeroes to user land. Thus any existing data on uninitialized blocks will
not be exposed to the userspace.

--
Regards,
Amit Arora 
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Amit K. Arora

On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
 On Jun 25, 2007  19:15 +0530, Amit K. Arora wrote:
  +#define FA_FL_DEALLOC  0x01 /* default is allocate */
  +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */
  +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC 
  */
 
 In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
 For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
 each extent.  For some workloads this would be much faster than truncate
 and reallocate of all the blocks in a file.

In ext4, we already mark each extent having preallocated blocks as
uninitialized. This is done as part of following code (which is part of
patch 5/7) in ext4_ext_get_blocks() :  

@@ -2122,6 +2160,8 @@ int ext4_ext_get_blocks(handle_t *handle
/* try to insert new extent into found leaf and return */
ext4_ext_store_pblock(newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
+   if (create == EXT4_CREATE_UNINITIALIZED_EXT)  /* Mark uninitialized */
+   ext4_ext_mark_uninitialized(newex);
err = ext4_ext_insert_extent(handle, inode, path, newex);
if (err) {
/* free data blocks we just allocated */


 In that light, please change the comment to /* default is keep existing data 
 */
 so that it doesn't imply this is only for DEALLOC.

Ok. Will update the comment.

Thanks!
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/7][TAKE5] ext4: support new modes

2007-06-26 Thread Amit K. Arora

On Mon, Jun 25, 2007 at 03:56:25PM -0600, Andreas Dilger wrote:
 On Jun 25, 2007  19:20 +0530, Amit K. Arora wrote:
  @@ -2499,7 +2500,8 @@ long ext4_fallocate(struct inode *inode,
   * currently supporting (pre)allocate mode for extent-based
   * files _only_
   */
  -   if (mode != FA_ALLOCATE || !(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
  +   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL) ||
  +   !(mode == FA_ALLOCATE || mode == FA_RESV_SPACE))
  return -EOPNOTSUPP;
 
 This should probably just check for the individual flags it can support
 (e.g. no FA_FL_DEALLOC, no FA_FL_DEL_DATA).

Hmm.. I am thinking of a scenario when the file system supports some
individual flags, but does not support a particular combination of them.
Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a
file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and
FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported
for some reason). This means that although we support FA_FL_DEALLOC,
FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the
combination of all these flags (which is nothing but FA_UNRESV_SPACE).
 
 I also thought another proposed flag was to determine whether mtime (and
 maybe ctime) is changed when doing prealloc/dealloc space?  Default should
 probably be to change mtime/ctime, and have FA_FL_NO_MTIME.  Someone else
 should decide if we want to allow changing the file w/o changing ctime, if
 that is required even though the file is not visibly changing.  Maybe the
 ctime update should be implicit if the size or mtime are changing?

Is it really required ? I mean, why should we allow users not to update
ctime/mtime even if the file metadata/data gets updated ? It sounds
a bit unnatural to me.
Is there any application scenario in your mind, when you suggest of
giving this flexibility to userspace ?

I think, modifying ctime/mtime should be dependent on the other flags.
E.g., if we do not zero out data blocks on allocation/deallocation,
update only ctime. Otherwise, update ctime and mtime both.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Amit K. Arora

On Tue, Jun 26, 2007 at 11:42:50AM -0400, Andreas Dilger wrote:
 On Jun 26, 2007  16:15 +0530, Amit K. Arora wrote:
  On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
   In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
   For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
   each extent.  For some workloads this would be much faster than truncate
   and reallocate of all the blocks in a file.
  
  In ext4, we already mark each extent having preallocated blocks as
  uninitialized. This is done as part of following code (which is part of
  patch 5/7) in ext4_ext_get_blocks() :  
 
 What I meant is that with XFS_IOC_ALLOCSP the previously-written data
 is ZEROED OUT, unlike with fallocate() which leaves previously-written
 data alone and only allocates in holes.
 
 In order to specify this for allocation, FA_FL_DEL_DATA would need to make
 sense for allocations (as well as the deallocation).  This is farily easy
 to do - just mark all of the existing extents as unallocated, and their
 data disappears.

Ok, agreed. Will add the FA_ZERO_SPACE mode too.
Thanks!

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/7][TAKE5] ext4: support new modes

2007-06-26 Thread Amit K. Arora

On Tue, Jun 26, 2007 at 12:14:00PM -0400, Andreas Dilger wrote:
 On Jun 26, 2007  17:37 +0530, Amit K. Arora wrote:
  Hmm.. I am thinking of a scenario when the file system supports some
  individual flags, but does not support a particular combination of them.
  Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a
  file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and
  FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported
  for some reason). This means that although we support FA_FL_DEALLOC,
  FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the
  combination of all these flags (which is nothing but FA_UNRESV_SPACE).
 
 That is up to the filesystem to determine then.  I just thought it should
 be clear to return an error for flags (or as you say combinations thereof)
 that the filesystem doesn't understand.
 
 That said, I'd think in most cases the flags are orthogonal, so if you
 support some combination of the flags (e.g. FA_FL_DEL_DATA, FA_FL_DEALLOC)
 then you will also support other combinations of those flags just from
 the way it is coded.

Ok. 
 
   I also thought another proposed flag was to determine whether mtime (and
   maybe ctime) is changed when doing prealloc/dealloc space?  Default should
   probably be to change mtime/ctime, and have FA_FL_NO_MTIME.  Someone else
   should decide if we want to allow changing the file w/o changing ctime, if
   that is required even though the file is not visibly changing.  Maybe the
   ctime update should be implicit if the size or mtime are changing?
  
  Is it really required ? I mean, why should we allow users not to update
  ctime/mtime even if the file metadata/data gets updated ? It sounds
  a bit unnatural to me.
  Is there any application scenario in your mind, when you suggest of
  giving this flexibility to userspace ?
 
 One reason is that XFS does NOT update the mtime/ctime when doing the
 XFS_IOC_* allocation ioctls.

Hmm.. I personally will call it a bug in XFS code then. :)

  I think, modifying ctime/mtime should be dependent on the other flags.
  E.g., if we do not zero out data blocks on allocation/deallocation,
  update only ctime. Otherwise, update ctime and mtime both.
 
 I'm only being the advocate for requirements David Chinner has put
 forward due to existing behaviour in XFS.  This is one of the reasons
 why I think the flags mechanism we now have - we can encode the
 various different behaviours in any way we want and leave it to the
 caller.

I understand. May be we can confirm once more with David Chinner if this
is really required. Will it really be a compatibility issue if new XFS
preallocations (ie. via fallocate) update mtime/ctime ? Will old
applications really get affected ? If yes, then it might be worth
implementing - even though I personally don't like it.

David, can you please confirm ? Thanks!

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/7][TAKE5] fallocate() implementation on i386, x86_64 and powerpc

2007-06-25 Thread Amit K. Arora

This patch implements sys_fallocate() and adds support on i386, x86_64
and powerpc platforms.

Changelog:
-
Changes from Take3 to Take4:
 1) Do not update c/mtime. Let each filesystem update ctime (update of
mtime will not be required for allocation since we touch only
metadata/inode and not blocks), if required.
Changes from Take2 to Take3:
 1) Patches now based on 2.6.22-rc1 kernel.
Changes from Take1(initial post on 26th April, 2007) to Take2:
 1) Added description before sys_fallocate() definition.
 2) Return EINVAL for len=0 (With new draft that Ulrich pointed to,
posix_fallocate should return EINVAL for len = 0.
 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
 4) Do not return ENODEV for dirs (let individual file systems decide if
they want to support preallocation to directories or not.
 5) Check for wrap through zero.
 6) Update c/mtime if fallocate() succeeds.
 7) Added mode descriptions in fs.h
 8) Added variable names to function definition (fallocate inode op)


Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22-rc4/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.22-rc4.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22-rc4/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+   .long sys_fallocate
Index: linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c
===
--- linux-2.6.22-rc4.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
return sys_truncate(path, (high  32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+u32 lenhi, u32 lenlo)
+{
+   return sys_fallocate(fd, mode, ((loff_t)offhi  32) | offlo,
+((loff_t)lenhi  32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long 
high,
 unsigned long low)
 {
Index: linux-2.6.22-rc4/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.22-rc4.orig/arch/x86_64/ia32/ia32entry.S
+++ linux-2.6.22-rc4/arch/x86_64/ia32/ia32entry.S
@@ -719,4 +719,5 @@ ia32_sys_call_table:
.quad compat_sys_signalfd
.quad compat_sys_timerfd
.quad sys_eventfd
+   .quad sys_fallocate
 ia32_syscall_end:
Index: linux-2.6.22-rc4/fs/open.c
===
--- linux-2.6.22-rc4.orig/fs/open.c
+++ linux-2.6.22-rc4/fs/open.c
@@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ *   (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ *   FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ * requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset  len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * It is expected that the -fallocate() inode operation implemented by the
+ * individual file systems will update the file size and/or ctime/mtime
+ * depending on the mode and also on the success of the operation.
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ * 0   : On SUCCESS a value of zero is returned.
+ * error   : On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ * TBD Generic fallocate to be added for file systems that do not
+ *  support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+
+   if (offset  0 || len = 0)
+   goto out;
+
+   /* Return error if mode is not supported */
+   ret =

[PATCH 3/7][TAKE5] fallocate() on ia64

2007-06-25 Thread Amit K. Arora

fallocate() on ia64

ia64 fallocate syscall support.

Signed-off-by: Dave Chinner [EMAIL PROTECTED]

Index: linux-2.6.22-rc4/arch/ia64/kernel/entry.S
===
--- linux-2.6.22-rc4.orig/arch/ia64/kernel/entry.S  2007-06-11 
17:22:15.0 -0700
+++ linux-2.6.22-rc4/arch/ia64/kernel/entry.S   2007-06-11 17:30:37.0 
-0700
@@ -1588,5 +1588,6 @@
data8 sys_signalfd
data8 sys_timerfd
data8 sys_eventfd
+   data8 sys_fallocate // 1310
 
.org sys_call_table + 8*NR_syscalls // guard against failures to 
increase NR_syscalls
Index: linux-2.6.22-rc4/include/asm-ia64/unistd.h
===
--- linux-2.6.22-rc4.orig/include/asm-ia64/unistd.h 2007-06-11 
17:22:15.0 -0700
+++ linux-2.6.22-rc4/include/asm-ia64/unistd.h  2007-06-11 17:30:37.0 
-0700
@@ -299,11 +299,12 @@
 #define __NR_signalfd  1307
 #define __NR_timerfd   1308
 #define __NR_eventfd   1309
+#define __NR_fallocate 1310
 
 #ifdef __KERNEL__
 
 
-#define NR_syscalls286 /* length of syscall table */
+#define NR_syscalls287 /* length of syscall table */
 
 /*
  * The following defines stop scripts/checksyscalls.sh from complaining about
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/6][TAKE5] fallocate system call

2007-06-25 Thread Amit K. Arora

N O T E: 
---
1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part
   of ext4 patch queue git tree hosted by Ted.
2) The above new patches (4/7 and 7/7) are based on the dicussion
   between Andreas Dilger and David Chinner on the mode argument,
   when later posted a man page on fallocate.
3) All of these patches are based on 2.6.22-rc4 kernel and apply to
   2.6.22-rc5 too (with some successfull hunks, though  - since the
   ext4 patch queue git tree has some other patches as well before
   fallocate patches in the patch series).

Changelog:
-
Changes from Take4 to Take5:
1) New Patch 4/7 implements new flags and values for mode
   argument of fallocate system call.
2) New Patch 7/7 implements 2 (out of 4) modes in ext4.
   Implementation of rest of the (two) modes is yet to be done.
3) Updated the interface description below to mention new modes
   being supported.
4) Removed extent overlap check bugfix (patch 4/6 in TAKE4,
   since it is now part of mainline.
5) Corrected format of couple of multi-line comments, which got
   missed in earlier take.

Changes from Take2 to Take3:
1) Return type is now described in the interface description
   above.
2) Patches rebased to 2.6.22-rc1 kernel.

** Each post will have an individual changelog for a particular patch.


Description:
---
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
-
The system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports four modes - FA_ALLOCATE, FA_DEALLOCATE, 
  FA_RESV_SPACE and FA_UNRESV_SPACE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
a given file (specified by fd). This mode changes the file size if
the preallocation is done beyond the EOF. It also updates the
ctime in the inode of the corresponding file, marking a
successfull allocation.
  FA_FA_RESV_SPACE: This mode is quite same as FA_ALLOCATE. The only
difference being that the file size will not be changed.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
previously preallocated blocks. This also may change the file size
and the ctime/mtime. This is reverse of FA_ALLOCATE mode.
  FA_UNRESV_SPACE: This mode is quite same as FA_DEALLOCATE. The
difference being that the file size is not changed and the data is
also deleted.
* New modes might get added in future.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).

RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate().

sys_fallocate() on s390:
---
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
-page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583

Since there is a talk of -fault() replacing -page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198

ToDos:
-
1 Implementation on other architectures (other than i386, x86_64,
ia64, ppc64 and s390(x)).
2

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-12 Thread Amit K. Arora

On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote:
 On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote:
  On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
   All I'm really interested in right now is that the fallocate
   _interface_ can be used as a *complete replacement* for the
   pre-existing XFS-specific ioctls that are already used by
   applications.  What ext4 can or can't do right now is irrelevant to
   this discussion - the interface definition needs to take priority
   over implementation
  
  Would you like to write up an interface definition description (likely
  man page) and post it for review, possibly with a mention of apps using
  it today ?
 
 Yeah, I started doing that yesterday as i figured it was the only way
 to cut the discussion short
 
  One reason for introducing the mode parameter was to allow the interface to
  evolve incrementally as more options / semantic questions are proposed, so
  that we don't have to make all the decisions right now. 
  So it would be good to start with a *minimal* definition, even just one 
  mode.
  The rest could follow as subsequent patches, each being reviewed and debated
  separately. Otherwise this discussion can drag on for a long time.
 
 Minimal definition to replace what applicaitons use on XFS and to
 support poasix_fallocate are the thre that have been mentioned so
 far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
 all in a man page...

Hi Dave,

Did you get time to write the above man page ? It will help to push
further patches in time (eg. for FA_PREALLOCATE mode).

The idea I had was to push the patch with bare minimum functionality
(FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other
new mode(s) based on the man page you planned to provide.

Thanks!
--
Regards,
Amit Arora

 
 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 Principal Engineer
 SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-17 Thread Amit K. Arora

On Thu, May 17, 2007 at 09:40:36AM +1000, David Chinner wrote:
 On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote:
  On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:
   On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:
  
Following changes were made to the previous version:
 1) Added description before sys_fallocate() definition.
 2) Return EINVAL for len=0 (With new draft that Ulrich pointed to,
posix_fallocate should return EINVAL for len = 0.
 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
 4) Do not return ENODEV for dirs (let individual file systems decide if
they want to support preallocation to directories or not.
 5) Check for wrap through zero.
 6) Update c/mtime if fallocate() succeeds.
   
   Please don't make this always happen. c/mtime updates should be dependent
   on the mode being used and whether there is visible change to the file. 
   If no
   userspace visible changes to the file occurred, then timestamps should not
   be changed.
  
  i_blocks will be updated, so it seems reasonable to update ctime.  mtime
  shouldn't be changed, though, since the contents of the file will be
  unchanged.
 
 That's assuming blocks were actually allocated - if the prealloc range already
 has underlying blocks there is no change and so we should not be changing
 mtime either. Only the filesystem will know if it has changed the file, so I
 think that timestamp updates need to be driven down to that level, not done
 blindy at the highest layer

Ok. Will make this change in the next post.

--
Regards,
Amit Arora
 
 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 Principal Engineer
 SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/6][TAKE4] fallocate system call

2007-05-17 Thread Amit K. Arora

Description:
---
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
-
The proposed system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
a given file (specified by fd). This mode changes the file size if
the preallocation is done beyond the EOF. It also updates the
ctime in the inode of the corresponding file, marking a
successfull allocation.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
previously preallocated blocks. This also may change the file size
and the ctime/mtime.
* New modes might get added in future. One such new mode which is
  already under discussion is FA_PREALLOCATE, which when used will
  preallocate space but will not change the filesize and [cm]time.
  Since the semantics of this new mode is not clear and agreed upon yet,
  this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).
 
RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate(). 

sys_fallocate() on s390:
---
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
-page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583

Since there is a talk of -fault() replacing -page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198

ToDos:
-
1 Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2 A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3 Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
-
Changes from Take2 to Take3:
1) Return type is now described in the interface description
   above.
2) Patches rebased to 2.6.22-rc1 kernel.

** Each post will have an individual changelog for a particular patch.


Following patches follow:
Patch 1/6 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/6 : fallocate() on s390
Patch 3/6 : fallocate() on ia64
Patch 4/6 : ext4: Extent overlap bugfix
Patch 5/6 : ext4: fallocate support in ext4
Patch 6/6 : ext4: write support for preallocated blocks

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/6][TAKE4] fallocate() implementation on i86, x86_64 and powerpc

2007-05-17 Thread Amit K. Arora

This patch implements sys_fallocate() and adds support on i386, x86_64
and powerpc platforms.

Changelog:
-
Changes from Take3 to Take4:
 1) Do not update c/mtime. Let each filesystem update ctime (update of
mtime will not be required for allocation since we touch only
metadata/inode and not blocks), if required.
Changes from Take2 to Take3:
 1) Patches now based on 2.6.22-rc1 kernel.
Changes from Take1(initial post on 26th April, 2007) to Take2:
 1) Added description before sys_fallocate() definition.
 2) Return EINVAL for len=0 (With new draft that Ulrich pointed to,
posix_fallocate should return EINVAL for len = 0.
 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
 4) Do not return ENODEV for dirs (let individual file systems decide if
they want to support preallocation to directories or not.
 5) Check for wrap through zero.
 6) Update c/mtime if fallocate() succeeds.
 7) Added mode descriptions in fs.h
 8) Added variable names to function definition (fallocate inode op)

Here is the new patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 arch/i386/kernel/syscall_table.S |1 
 arch/powerpc/kernel/sys_ppc32.c  |7 +++
 arch/x86_64/ia32/ia32entry.S |1 
 fs/open.c|   86 +++
 include/asm-i386/unistd.h|3 -
 include/asm-powerpc/systbl.h |1 
 include/asm-powerpc/unistd.h |3 -
 include/asm-x86_64/unistd.h  |2 
 include/linux/fs.h   |   13 +
 include/linux/syscalls.h |1 
 10 files changed, 116 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.22-rc1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+   .long sys_fallocate
Index: linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c
===
--- linux-2.6.22-rc1.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
return sys_truncate(path, (high  32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+u32 lenhi, u32 lenlo)
+{
+   return sys_fallocate(fd, mode, ((loff_t)offhi  32) | offlo,
+((loff_t)lenhi  32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long 
high,
 unsigned long low)
 {
Index: linux-2.6.22-rc1/fs/open.c
===
--- linux-2.6.22-rc1.orig/fs/open.c
+++ linux-2.6.22-rc1/fs/open.c
@@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ *   (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ *   FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ * requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset  len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * It is expected that the -fallocate() inode operation implemented by the
+ * individual file systems will update the file size and/or ctime/mtime
+ * depending on the mode and also on the success of the operation.
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ * 0   : On SUCCESS a value of zero is returned.
+ * error   : On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ * TBD Generic fallocate to be added for file systems that do not
+ *  support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret =

[PATCH 2/6][TAKE4] fallocate() on s390

2007-05-17 Thread Amit K. Arora

This is the patch suggested by Martin Schwidefsky to support
sys_fallocate() on s390(x) platform.

He also suggested a wrapper in glibc to handle this system call on
s390. Posting it here so that we get feedback for this too.

.globl __fallocate
ENTRY(__fallocate)
stm %r6,%r7,28(%r15)/* save %r6/%r7 on stack */
cfi_offset (%r7, -68)
cfi_offset (%r6, -72)
lm  %r6,%r7,96(%r15)/* load loff_t len from stack */
svc SYS_ify(fallocate)
lm  %r6,%r7,28(%r15)/* restore %r6/%r7 from stack */
br  %r14
PSEUDO_END(__fallocate)


Here are the comments and the patch to linux kernel from him.

-
From: Martin Schwidefsky [EMAIL PROTECTED]

This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with the arguments of this system call.

Signed-off-by: Martin Schwidefsky [EMAIL PROTECTED]
---
 arch/s390/kernel/compat_wrapper.S |   10 ++
 arch/s390/kernel/sys_s390.c   |   29 +
 arch/s390/kernel/syscalls.S   |1 +
 include/asm-s390/unistd.h |3 ++-
 4 files changed, 42 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S
===
--- linux-2.6.22-rc1.orig/arch/s390/kernel/compat_wrapper.S
+++ linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S
@@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
llgtr   %r2,%r2 # char *
llgtr   %r3,%r3 # struct compat_timeval *
jg  compat_sys_utimes
+
+   .globl  sys_fallocate_wrapper
+sys_fallocate_wrapper:
+   lgfr%r2,%r2 # int
+   lgfr%r3,%r3 # int
+   sllg%r4,%r4,32  # get high word of 64bit loff_t
+   lr  %r4,%r5 # get low word of 64bit loff_t
+   sllg%r5,%r6,32  # get high word of 64bit loff_t
+   l   %r5,164(%r15)   # get low word of 64bit loff_t
+   jg  sys_fallocate
Index: linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c
===
--- linux-2.6.22-rc1.orig/arch/s390/kernel/sys_s390.c
+++ linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c
@@ -265,3 +265,32 @@ s390_fadvise64_64(struct fadvise64_64_ar
return -EFAULT;
return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice);
 }
+
+#ifndef CONFIG_64BIT
+/*
+ * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
+ * 64 bit argument len is split into the upper and lower 32 bits. The
+ * system call wrapper in the user space loads the value to %r6/%r7.
+ * The code in entry.S keeps the values in %r2 - %r6 where they are and
+ * stores %r7 to 96(%r15). But the standard C linkage requires that
+ * the whole 64 bit value for len is stored on the stack and doesn't
+ * use %r6 at all. So s390_fallocate has to convert the arguments from
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
+ * to
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
+ */
+asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
+  u32 len_high, u32 len_low)
+{
+   union {
+   u64 len;
+   struct {
+   u32 high;
+   u32 low;
+   };
+   } cv;
+   cv.high = len_high;
+   cv.low = len_low;
+   return sys_fallocate(fd, mode, offset, cv.len);
+}
+#endif
Index: linux-2.6.22-rc1/arch/s390/kernel/syscalls.S
===
--- linux-2.6.22-rc1.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.22-rc1/arch/s390/kernel/syscalls.S
@@ -322,3 +322,4 @@ NI_SYSCALL  
/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
Index: linux-2.6.22-rc1/include/asm-s390/unistd.h
===
--- linux-2.6.22-rc1.orig/include/asm-s390/unistd.h
+++ linux-2.6.22-rc1/include/asm-s390/unistd.h
@@ -251,8 +251,9 @@
 #define __NR_getcpu311
 #define __NR_epoll_pwait   312
 #define __NR_utimes313
+#define __NR_fallocate 314
 
-#define NR_syscalls 314
+#define NR_syscalls 315
 
 /* 
  * There are some system calls that are not present on 64 bit, some
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/6][TAKE4] fallocate() on ia64

2007-05-17 Thread Amit K. Arora

Here is the 2.6.22-rc1 version of David's patch: add fallocate() on ia64

From: David Chinner [EMAIL PROTECTED]
Subject: [PATCH] ia64 fallocate syscall
Cc: Amit K. Arora [EMAIL PROTECTED], 
[EMAIL PROTECTED], linux-ext4@vger.kernel.org,
[EMAIL PROTECTED], [EMAIL PROTECTED]

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner [EMAIL PROTECTED]

---
 arch/ia64/kernel/entry.S  |1 +
 include/asm-ia64/unistd.h |3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S
===
--- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S  2007-05-12 
18:45:56.0 -0700
+++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S   2007-05-15 15:36:48.0 
-0700
@@ -1585,5 +1585,6 @@
data8 sys_getcpu
data8 sys_epoll_pwait   // 1305
data8 sys_utimensat
+   data8 sys_fallocate

.org sys_call_table + 8*NR_syscalls // guard against failures to 
increase NR_syscalls
Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h
===
--- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-12 
18:45:56.0 -0700
+++ linux-2.6.22-rc1/include/asm-ia64/unistd.h  2007-05-15 15:37:51.0 
-0700
@@ -296,6 +296,7 @@
 #define __NR_getcpu1304
 #define __NR_epoll_pwait   1305
 #define __NR_utimensat 1306
+#define __NR_fallocate 1307

 #ifdef __KERNEL__


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/6][TAKE4] ext4: fallocate support in ext4

2007-05-17 Thread Amit K. Arora

This patch implements -fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.

Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.

Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a ToDo item.

Changelog:
-
Changes from Take3 to Take4:
 1) Changed ext4_fllocate() declaration and definition to return a long
and not an int, to match with -fallocate() inode op.
 2) Update ctime if new blocks get allocated.
Changes from Take2 to Take3:
 1) Patch rebased to 2.6.22-rc1 kernel version.
 2) Removed unnecessary EXPORT_SYMBOL(ext4_fallocate);.
Changes from Take1 to Take2:
 1) Added more description for ext4_fallocate().
 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
 3) Moved journal_start  journal_stop inside the while loop.
 4) Replaced BUG_ON with WARN_ON  ext4_error.
 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
 6) Added variable names in the function declaration of ext4_fallocate()
 7) Converted macros that handle uninitialized extents into inline
functions.

Here is the updated patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  249 +---
 fs/ext4/file.c  |1 
 include/linux/ext4_fs.h |8 +
 include/linux/ext4_fs_extents.h |   12 +
 4 files changed, 229 insertions(+), 41 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1144,7 +1156,7 @@

[PATCH 6/6][TAKE4] ext4: write support for preallocated blocks

2007-05-17 Thread Amit K. Arora

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Changelog:
-
Changes from Take3 to Take4:
 - no change -
Changes from Take2 to Take3:
 1) Patch now rebased to 2.6.22-rc1 kernel.
Changes from Take1 to Take2:
 1) Replaced BUG_ON with WARN_ON  ext4_error.
 2) Added variable names to the function declaration of
ext4_ext_try_to_merge().
 3) Updated variable declarations to use multiple-definitions-per-line.
 4) if((a=foo())).. was broken into a=foo(); if(a)..
 5) Removed extra spaces.

Here is the updated patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  234 +++-
 include/linux/ext4_fs_extents.h |3 
 2 files changed, 210 insertions(+), 27 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -1140,6 +1140,54 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the ex extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass ex - 1 as argument instead of ex.
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done = 0;
+   int uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex  EXT_LAST_EXTENT(eh))
+   {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1  EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries) - 1);
+   merge_done = 1;
+   WARN_ON(eh-eh_entries == 0);
+   if (!eh-eh_entries)
+   ext4_error(inode-i_sb, ext4_ext_try_to_merge,
+  inode#%lu, eh-eh_entries = 0!, inode-i_ino);
+   }
+
+   return merge_done;
+}
+
+/*
  * check if a portion of the newext extent overlaps with an
  * existing extent.
  *
@@ -1327,25 +1375,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex  EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1  EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
-   BUG_ON(eh-eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -2011,15 +2041,152 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a There is no split required: Entire extent should be initialized
+ *   b Splits in two extents: Write is happening at either end of the extent
+ *   c Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+

Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-16 Thread Amit K. Arora

On Tue, May 15, 2007 at 05:42:46PM -0700, Mingming Cao wrote:
 On Wed, 2007-05-16 at 01:33 +0530, Amit K. Arora wrote:
  This patch implements sys_fallocate() and adds support on i386, x86_64
  and powerpc platforms.
 
  @@ -1137,6 +1148,8 @@ struct inode_operations {
  ssize_t (*listxattr) (struct dentry *, char *, size_t);
  int (*removexattr) (struct dentry *, const char *);
  void (*truncate_range)(struct inode *, loff_t, loff_t);
  +   long (*fallocate)(struct inode *inode, int mode, loff_t offset,
  + loff_t len);
   };
 
 Does the return value from fallocate inode operation has to be *long*?
 It's not consistent with the ext4_fallocate() define in patch 4/5, 

I think -fallocate() should return a long, since sys_fallocate() has
to return what -fallocate() returns and hence their return type should
ideally match.
 
 +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t
 len)

I will change the ext4_fallocate() to return a long (in patch 4/5)
in the next post.

Agree ?

Thanks!
--
Regards,
Amit Arora

 
 thus cause compile warnings.
 
 
 
 Mingming
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/5][TAKE3] ext4: Extent overlap bugfix

2007-05-15 Thread Amit K. Arora

This patch adds a check for overlap of extents and cuts short the
new extent to be inserted, if there is a chance of overlap.

Changelog:
-
Note: The changes below are from the initial post (dated 26th April,
2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel
version on which this patch is based. TAKE2 was based on 2.6.21 and this
is based on 2.6.22-rc1.
As suggested by Andrew, a check for wrap though zero has been added.

Here is the new patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |   60 ++--
 include/linux/ext4_fs_extents.h |1 
 2 files changed, 59 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -1128,6 +1128,55 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * check if a portion of the newext extent overlaps with an
+ * existing extent.
+ *
+ * If there is an overlap discovered, it updates the length of the newext
+ * such that there will be no overlap, and then returns 1.
+ * If there is no overlap found, it returns 0.
+ */
+unsigned int ext4_ext_check_overlap(struct inode *inode,
+   struct ext4_extent *newext,
+   struct ext4_ext_path *path)
+{
+   unsigned long b1, b2;
+   unsigned int depth, len1;
+   unsigned int ret = 0;
+
+   b1 = le32_to_cpu(newext-ee_block);
+   len1 = le16_to_cpu(newext-ee_len);
+   depth = ext_depth(inode);
+   if (!path[depth].p_ext)
+   goto out;
+   b2 = le32_to_cpu(path[depth].p_ext-ee_block);
+
+   /*
+* get the next allocated block if the extent in the path
+* is before the requested block(s) 
+*/
+   if (b2  b1) {
+   b2 = ext4_ext_next_allocated_block(path);
+   if (b2 == EXT_MAX_BLOCK)
+   goto out;
+   }
+
+   /* check for wrap through zero */
+   if (b1 + len1  b1) {
+   len1 = EXT_MAX_BLOCK - b1;
+   newext-ee_len = cpu_to_le16(len1);
+   ret = 1;
+   }
+
+   /* check for overlap */
+   if (b1 + len1  b2) {
+   newext-ee_len = cpu_to_le16(b2 - b1);
+   ret = 1;
+   }
+out:
+   return ret;
+}
+
+/*
  * ext4_ext_insert_extent:
  * tries to merge requsted extent into the existing extent or
  * inserts requested extent as new one into the tree,
@@ -2031,7 +2080,15 @@ int ext4_ext_get_blocks(handle_t *handle
 
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
-   allocated = max_blocks;
+
+   /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
+   newex.ee_block = cpu_to_le32(iblock);
+   newex.ee_len = cpu_to_le16(max_blocks);
+   err = ext4_ext_check_overlap(inode, newex, path);
+   if (err)
+   allocated = le16_to_cpu(newex.ee_len);
+   else
+   allocated = max_blocks;
newblock = ext4_new_blocks(handle, inode, goal, allocated, err);
if (!newblock)
goto out2;
@@ -2039,7 +2096,6 @@ int ext4_ext_get_blocks(handle_t *handle
goal, newblock, allocated);
 
/* try to insert new extent into found leaf and return */
-   newex.ee_block = cpu_to_le32(iblock);
ext4_ext_store_pblock(newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
err = ext4_ext_insert_extent(handle, inode, path, newex);
Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
===
--- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
@@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct 
ext4_ext_path *);
+extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent 
*, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct 
ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, 
ext_prepare_callback, void *);
 extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct 
ext4_ext_path *);
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/5][TAKE3] ext4: fallocate support in ext4

2007-05-15 Thread Amit K. Arora

This patch implements -fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.

Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.

Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a ToDo item.

Changelog:
-
Note: The changes below are from the initial post (dated 26th April,
2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel
version on which this patch is based and point 8) below.
TAKE2 was based on 2.6.21 and this is based on 2.6.22-rc1.

Here are the changes from the previous post:
 1) Added more description for ext4_fallocate().
 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
 3) Moved journal_start  journal_stop inside the while loop.
 4) Replaced BUG_ON with WARN_ON  ext4_error.
 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
 6) Added variable names in the function declaration of ext4_fallocate()
 7) Converted macros that handle uninitialized extents into inline
functions.
 8) Removed unnecessary EXPORT_SYMBOL(ext4_fallocate);.

Here is the updated patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  240 +---
 fs/ext4/file.c  |1 
 include/linux/ext4_fs.h |8 +
 include/linux/ext4_fs_extents.h |   12 ++
 4 files changed, 220 insertions(+), 41 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1144,7 +1156,7 @@

[PATCH 0/5][TAKE2] fallocate system call

2007-05-14 Thread Amit K. Arora

This is the new set of patches which take care of the review comments
received from the community (mainly from Andrew).

Description:
---
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
-
The proposed system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
a given file (specified by fd). This mode changes the file size if
the preallocation is done beyond the EOF. It also updates the
ctime/mtime in the inode of the corresponding file, marking a
successfull allocation.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
previously preallocated blocks. This also may change the file size
and the ctime/mtime.
* New modes might get added in future. One such new mode which is
  already under discussion is FA_PREALLOCATE, which when used will
  preallocate space but will not change the filesize and [cm]time.
  Since the semantics of this new mode is not clear and agreed upon yet,
  this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).
  

sys_fallocate() on s390:
---
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
-page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583

Since there is a talk of -fault() replacing -page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198

ToDos:
-
1 Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2 A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3 Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
-
Each post will have an individual changelog for the particular patch.
Following posts with patches follow:

Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/5 : fallocate() on s390
Patch 3/5 : ext4: Extent overlap bugfix
Patch 4/5 : ext4: fallocate support in ext4
Patch 5/5 : ext4: write support for preallocated blocks

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc

2007-05-14 Thread Amit K. Arora

This patch implements sys_fallocate() and adds support on i386, x86_64
and powerpc platforms.

Changelog:
-
Following changes were made to the previous version:
 1) Added description before sys_fallocate() definition.
 2) Return EINVAL for len=0 (With new draft that Ulrich pointed to,
posix_fallocate should return EINVAL for len = 0.
 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
 4) Do not return ENODEV for dirs (let individual file systems decide if
they want to support preallocation to directories or not.
 5) Check for wrap through zero.
 6) Update c/mtime if fallocate() succeeds.
 7) Added mode descriptions in fs.h
 8) Added variable names to function definition (fallocate inode op)

Here is the new patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 arch/i386/kernel/syscall_table.S |1 
 arch/powerpc/kernel/sys_ppc32.c  |7 +++
 arch/x86_64/kernel/functionlist  |1 
 fs/open.c|   89 +++
 include/asm-i386/unistd.h|3 -
 include/asm-powerpc/systbl.h |1 
 include/asm-powerpc/unistd.h |3 -
 include/asm-x86_64/unistd.h  |4 +
 include/linux/fs.h   |   13 +
 include/linux/syscalls.h |1 
 10 files changed, 120 insertions(+), 3 deletions(-)

Index: linux-2.6.21/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.21/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_fallocate /* 320 */
Index: linux-2.6.21/arch/x86_64/kernel/functionlist
===
--- linux-2.6.21.orig/arch/x86_64/kernel/functionlist
+++ linux-2.6.21/arch/x86_64/kernel/functionlist
@@ -931,6 +931,7 @@
 *(.text.sys_getitimer)
 *(.text.sys_getgroups)
 *(.text.sys_ftruncate)
+*(.text.sys_fallocate)
 *(.text.sysfs_lookup)
 *(.text.sys_exit_group)
 *(.text.stub_fork)
Index: linux-2.6.21/fs/open.c
===
--- linux-2.6.21.orig/fs/open.c
+++ linux-2.6.21/fs/open.c
@@ -351,6 +351,95 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ *   (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ *   FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ * requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset  len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ * 0   : On SUCCESS a value of zero is returned.
+ * error   : On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ * TBD Generic fallocate to be added for file systems that do not
+ *  support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+
+   if (offset  0 || len = 0)
+   goto out;
+
+   /* Return error if mode is not supported */
+   ret = -EOPNOTSUPP;
+   if (mode != FA_ALLOCATE  mode !=FA_DEALLOCATE)
+   goto out;
+
+   ret = -EBADF;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   if (!(file-f_mode  FMODE_WRITE))
+   goto out_fput;
+
+   inode = file-f_path.dentry-d_inode;
+
+   ret = -ESPIPE;
+   if (S_ISFIFO(inode-i_mode))
+   goto out_fput;
+
+   ret = -ENODEV;
+   /*
+* Let individual file system decide if it supports preallocation
+* for directories or not.
+*/
+   if (!S_ISREG(inode-i_mode)  !S_ISDIR(inode-i_mode))
+   goto out_fput;
+
+   ret = -EFBIG;
+   /* Check for wrap through zero too */
+   if (((offset +

[PATCH 2/5][TAKE2] fallocate() on s390

2007-05-14 Thread Amit K. Arora

This is the patch suggested by Martin Schwidefsky. Here are the comments
and patch from him.

-
From: Martin Schwidefsky [EMAIL PROTECTED]

This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with the arguments of this system call.

Signed-off-by: Martin Schwidefsky [EMAIL PROTECTED]
---

 arch/s390/kernel/compat_wrapper.S |   10 ++
 arch/s390/kernel/sys_s390.c   |   29 +
 arch/s390/kernel/syscalls.S   |1 +
 include/asm-s390/unistd.h |3 ++-
 4 files changed, 42 insertions(+), 1 deletion(-)

Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S
===
--- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S
+++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S
@@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
llgtr   %r2,%r2 # char *
llgtr   %r3,%r3 # struct compat_timeval *
jg  compat_sys_utimes
+
+   .globl  sys_fallocate_wrapper
+sys_fallocate_wrapper:
+   lgfr%r2,%r2 # int
+   lgfr%r3,%r3 # int
+   sllg%r4,%r4,32  # get high word of 64bit loff_t
+   lr  %r4,%r5 # get low word of 64bit loff_t
+   sllg%r5,%r6,32  # get high word of 64bit loff_t
+   l   %r5,164(%r15)   # get low word of 64bit loff_t
+   jg  sys_fallocate
Index: linux-2.6.21/arch/s390/kernel/syscalls.S
===
--- linux-2.6.21.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.21/arch/s390/kernel/syscalls.S
@@ -322,3 +322,4 @@ NI_SYSCALL  
/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
Index: linux-2.6.21/arch/s390/kernel/sys_s390.c
===
--- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c
+++ linux-2.6.21/arch/s390/kernel/sys_s390.c
@@ -286,3 +286,32 @@ int kernel_execve(const char *filename, 
  d (__arg3) : memory);
return __svcres;
 }
+
+#ifndef CONFIG_64BIT
+/*
+ * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
+ * 64 bit argument len is split into the upper and lower 32 bits. The
+ * system call wrapper in the user space loads the value to %r6/%r7.
+ * The code in entry.S keeps the values in %r2 - %r6 where they are and
+ * stores %r7 to 96(%r15). But the standard C linkage requires that
+ * the whole 64 bit value for len is stored on the stack and doesn't
+ * use %r6 at all. So s390_fallocate has to convert the arguments from
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
+ * to
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
+ */
+asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
+  u32 len_high, u32 len_low)
+{
+   union {
+   u64 len;
+   struct {
+   u32 high;
+   u32 low;
+   };
+   } cv;
+   cv.high = len_high;
+   cv.low = len_low;
+   return sys_fallocate(fd, mode, offset, cv.len);
+}
+#endif
Index: linux-2.6.21/include/asm-s390/unistd.h
===
--- linux-2.6.21.orig/include/asm-s390/unistd.h
+++ linux-2.6.21/include/asm-s390/unistd.h
@@ -251,8 +251,9 @@
 #define __NR_getcpu311
 #define __NR_epoll_pwait   312
 #define __NR_utimes313
+#define __NR_fallocate 314
 
-#define NR_syscalls 314
+#define NR_syscalls 315
 
 /* 
  * There are some system calls that are not present on 64 bit, some
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/5][TAKE2] ext4: Extent overlap bugfix

2007-05-14 Thread Amit K. Arora

This patch adds a check for overlap of extents and cuts short the
new extent to be inserted, if there is a chance of overlap.

Changelog:
-
As suggested by Andrew, a check for wrap though zero has been added.

Here is the new patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |   60 ++--
 include/linux/ext4_fs_extents.h |1 
 2 files changed, 59 insertions(+), 2 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1129,6 +1129,55 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * check if a portion of the newext extent overlaps with an
+ * existing extent.
+ *
+ * If there is an overlap discovered, it updates the length of the newext
+ * such that there will be no overlap, and then returns 1.
+ * If there is no overlap found, it returns 0.
+ */
+unsigned int ext4_ext_check_overlap(struct inode *inode,
+   struct ext4_extent *newext,
+   struct ext4_ext_path *path)
+{
+   unsigned long b1, b2;
+   unsigned int depth, len1;
+   unsigned int ret = 0;
+
+   b1 = le32_to_cpu(newext-ee_block);
+   len1 = le16_to_cpu(newext-ee_len);
+   depth = ext_depth(inode);
+   if (!path[depth].p_ext)
+   goto out;
+   b2 = le32_to_cpu(path[depth].p_ext-ee_block);
+
+   /*
+* get the next allocated block if the extent in the path
+* is before the requested block(s) 
+*/
+   if (b2  b1) {
+   b2 = ext4_ext_next_allocated_block(path);
+   if (b2 == EXT_MAX_BLOCK)
+   goto out;
+   }
+
+   /* check for wrap through zero */
+   if (b1 + len1  b1) {
+   len1 = EXT_MAX_BLOCK - b1;
+   newext-ee_len = cpu_to_le16(len1);
+   ret = 1;
+   }
+
+   /* check for overlap */
+   if (b1 + len1  b2) {
+   newext-ee_len = cpu_to_le16(b2 - b1);
+   ret = 1;
+   }
+out:
+   return ret;
+}
+
+/*
  * ext4_ext_insert_extent:
  * tries to merge requsted extent into the existing extent or
  * inserts requested extent as new one into the tree,
@@ -2032,7 +2081,15 @@ int ext4_ext_get_blocks(handle_t *handle
 
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
-   allocated = max_blocks;
+
+   /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
+   newex.ee_block = cpu_to_le32(iblock);
+   newex.ee_len = cpu_to_le16(max_blocks);
+   err = ext4_ext_check_overlap(inode, newex, path);
+   if (err)
+   allocated = le16_to_cpu(newex.ee_len);
+   else
+   allocated = max_blocks;
newblock = ext4_new_blocks(handle, inode, goal, allocated, err);
if (!newblock)
goto out2;
@@ -2040,7 +2097,6 @@ int ext4_ext_get_blocks(handle_t *handle
goal, newblock, allocated);
 
/* try to insert new extent into found leaf and return */
-   newex.ee_block = cpu_to_le32(iblock);
ext4_ext_store_pblock(newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
err = ext4_ext_insert_extent(handle, inode, path, newex);
Index: linux-2.6.21/include/linux/ext4_fs_extents.h
===
--- linux-2.6.21.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.21/include/linux/ext4_fs_extents.h
@@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct 
ext4_ext_path *);
+extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent 
*, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct 
ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, 
ext_prepare_callback, void *);
 extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct 
ext4_ext_path *);
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/5][TAKE2] ext4: fallocate support in ext4

2007-05-14 Thread Amit K. Arora

This patch implements -fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.

Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.

Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a To Do item.

Changelog:
-
Here are the changes from the previous post:
 1) Added more description for ext4_fallocate().
 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
 3) Moved journal_start  journal_stop inside the while loop.
 4) Replaced BUG_ON with WARN_ON  ext4_error.
 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
 6) Added variable names in the function declaration of ext4_fallocate()
 7) Converted macros that handle uninitialized extents into inline
functions.

Here is the updated patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  241 +---
 fs/ext4/file.c  |1 
 include/linux/ext4_fs.h |8 +
 include/linux/ext4_fs_extents.h |   12 +
 4 files changed, 221 insertions(+), 41 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1107,7 +1107,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int ret = 0;
 
b1 = le32_to_cpu(newext-ee_block);
-   len1 = le16_to_cpu(newext-ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1192,8 +1204,9 @@ int

[PATCH 5/5][TAKE2] ext4: write support for preallocated blocks

2007-05-14 Thread Amit K. Arora

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Changelog:
-
 1) Replaced BUG_ON with WARN_ON  ext4_error.
 2) Added variable names to the function declaration of
ext4_ext_try_to_merge().
 3) Updated variable declarations to use multiple-definitions-per-line.
 4) if((a=foo())).. was broken into a=foo(); if(a)..
 5) Removed extra spaces.

Here is the updated patch:

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  234 +++-
 include/linux/ext4_fs_extents.h |3 
 2 files changed, 210 insertions(+), 27 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1141,6 +1141,54 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the ex extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass ex - 1 as argument instead of ex.
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done = 0;
+   int uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex  EXT_LAST_EXTENT(eh))
+   {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1  EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries) - 1);
+   merge_done = 1;
+   WARN_ON(eh-eh_entries == 0);
+   if (!eh-eh_entries)
+   ext4_error(inode-i_sb, ext4_ext_try_to_merge,
+  inode#%lu, eh-eh_entries = 0!, inode-i_ino);
+   }
+
+   return merge_done;
+}
+
+/*
  * check if a portion of the newext extent overlaps with an
  * existing extent.
  *
@@ -1328,25 +1376,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex  EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1  EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
-   BUG_ON(eh-eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -2012,15 +2042,152 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a There is no split required: Entire extent should be initialized
+ *   b Splits in two extents: Write is happening at either end of the extent
+ *   c Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+   struct ext4_ext_path *path,
+   ext4_fsblk_t iblock,
+

Re: [PATCH 2/5][TAKE2] fallocate() on s390 - glibc wrapper

2007-05-14 Thread Amit K. Arora

On Mon, May 14, 2007 at 08:18:34PM +0530, Amit K. Arora wrote:
 This is the patch suggested by Martin Schwidefsky. Here are the comments
 and patch from him.

Martin also suggested a wrapper in glibc to handle this system call on
s390. Posting it here so that we get feedback for this too.
Here it is:

.globl __fallocate
ENTRY(__fallocate)
stm %r6,%r7,28(%r15)/* save %r6/%r7 on stack */
cfi_offset (%r7, -68)
cfi_offset (%r6, -72)
lm  %r6,%r7,96(%r15)/* load loff_t len from stack */
svc SYS_ify(fallocate)
lm  %r6,%r7,28(%r15)/* restore %r6/%r7 from stack */
br  %r14
PSEUDO_END(__fallocate)

--
Regards,
Amit Arora
 
 -
 From: Martin Schwidefsky [EMAIL PROTECTED]
 
 This patch implements support of fallocate system call on s390(x)
 platform. A wrapper is added to address the issue which s390 ABI has
 with the arguments of this system call.
 
 Signed-off-by: Martin Schwidefsky [EMAIL PROTECTED]
 ---
 
  arch/s390/kernel/compat_wrapper.S |   10 ++
  arch/s390/kernel/sys_s390.c   |   29 +
  arch/s390/kernel/syscalls.S   |1 +
  include/asm-s390/unistd.h |3 ++-
  4 files changed, 42 insertions(+), 1 deletion(-)
 
 Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S
 ===
 --- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S
 +++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S
 @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
   llgtr   %r2,%r2 # char *
   llgtr   %r3,%r3 # struct compat_timeval *
   jg  compat_sys_utimes
 +
 + .globl  sys_fallocate_wrapper
 +sys_fallocate_wrapper:
 + lgfr%r2,%r2 # int
 + lgfr%r3,%r3 # int
 + sllg%r4,%r4,32  # get high word of 64bit loff_t
 + lr  %r4,%r5 # get low word of 64bit loff_t
 + sllg%r5,%r6,32  # get high word of 64bit loff_t
 + l   %r5,164(%r15)   # get low word of 64bit loff_t
 + jg  sys_fallocate
 Index: linux-2.6.21/arch/s390/kernel/syscalls.S
 ===
 --- linux-2.6.21.orig/arch/s390/kernel/syscalls.S
 +++ linux-2.6.21/arch/s390/kernel/syscalls.S
 @@ -322,3 +322,4 @@ NI_SYSCALL
 /* 310 sys_move_pages *
  SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
  SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
  SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
 +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
 Index: linux-2.6.21/arch/s390/kernel/sys_s390.c
 ===
 --- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c
 +++ linux-2.6.21/arch/s390/kernel/sys_s390.c
 @@ -286,3 +286,32 @@ int kernel_execve(const char *filename, 
 d (__arg3) : memory);
   return __svcres;
  }
 +
 +#ifndef CONFIG_64BIT
 +/*
 + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
 + * 64 bit argument len is split into the upper and lower 32 bits. The
 + * system call wrapper in the user space loads the value to %r6/%r7.
 + * The code in entry.S keeps the values in %r2 - %r6 where they are and
 + * stores %r7 to 96(%r15). But the standard C linkage requires that
 + * the whole 64 bit value for len is stored on the stack and doesn't
 + * use %r6 at all. So s390_fallocate has to convert the arguments from
 + *   %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
 + * to
 + *   %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
 + */
 +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
 +u32 len_high, u32 len_low)
 +{
 + union {
 + u64 len;
 + struct {
 + u32 high;
 + u32 low;
 + };
 + } cv;
 + cv.high = len_high;
 + cv.low = len_low;
 + return sys_fallocate(fd, mode, offset, cv.len);
 +}
 +#endif
 Index: linux-2.6.21/include/asm-s390/unistd.h
 ===
 --- linux-2.6.21.orig/include/asm-s390/unistd.h
 +++ linux-2.6.21/include/asm-s390/unistd.h
 @@ -251,8 +251,9 @@
  #define __NR_getcpu  311
  #define __NR_epoll_pwait 312
  #define __NR_utimes  313
 +#define __NR_fallocate   314
 
 -#define NR_syscalls 314
 +#define NR_syscalls 315
 
  /* 
   * There are some system calls that are not present on 64 bit, some
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-10 Thread Amit K. Arora

On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
 On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
  I have the updated patches ready which take care of Andrew's comments.
  Will run some tests and post them soon.
  
  But, before submitting these patches, I think it will be better to finalize
  on certain things which might be worth some discussion here:
  
  1) Should the file size change when preallocation is done beyond EOF ?
 - Andreas and Chris Wedgwood are in favor of not changing the
   file size in this case. I also tend to agree with them. Does anyone
   has an argument in favor of changing the filesize ?
   If not, I will remove the code which changes the filesize, before I
   resubmit the concerned ext4 patch.
 
 I think there needs to be both. If we don't have a mechanism to
 atomically change the file size with the preallocation, then
 applications that use stat() to work out if they need to preallocate
 more space will end up racing.

By both above, do you mean we should give user the flexibility if it
wants the filesize changed or not ? It can be done by having *two* modes
for preallocation in the system call - say FA_PREALLOCATE and
FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate
blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE
mode is used, fallocate() will change the filesize if required (i.e.
when allocation is beyond EOF) and also update [cm]time.
This way, the application can decide what it wants.

This will be helpfull for the partial allocation scenario also. Think of
the case when we do not change the filesize in fallocate() and expect
applications/posix_fallocate() to do ftruncate() after fallocate() for
this. Now if fallocate() results in a partial allocation with -ENOSPC
error returned, applications/posix_fallocate() will not know for what
length ftruncate() has to be called.  :(

Hence it may be a good idea to give user the flexibility if it wants to
atomically change the file size with preallocation or not. But, with
more flexibility there comes inconsistency in behavior, which is worth
considering.

 
  2) For FA_UNALLOCATE mode, should the file system allow unallocation
 of normal (non-preallocated) blocks (blocks allocated via
 regular write/truncate operations) also (i.e. work as punch()) ?
 
 Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and
 what i did for FA_UNALLOCATE as well.

Ok. But, some people may not expect/like this. I think, we can keep it
on the backburner for a while, till other issues are sorted out.
 
 - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
   we need to finalize on the convention here as a general guideline
   to all the filesystems that implement fallocate.
  
  3) If above is true, the file size will need to be changed
 for unallocation when block holding the EOF gets unallocated.
 
 No - we punch a hole. If you want the filesize to change, then
 you use ftruncate() to remove the blocks at EOF and change the
 file size atomically.

Ok.
 
  4) Should we update mtime  ctime on a successfull allocation/
 unallocation ?
 - David Chinner raised this question in following post:
   http://lkml.org/lkml/2007/4/29/407
   I think it makes sense to update the [mc]time for a successfull
   preallocation/unallocation. Does anyone feel otherwise ?
   It will be interesting to know how XFS behaves currently. Does XFS
   update [mc]time for preallocation ?
 
 No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size
 changes. If the filesize changes, it behaves exactly the same way that
 ftruncate() behaves.

Having additional mode (of FA_PREALLOCATE) might help here too. Please
see above.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Amit K. Arora

I have the updated patches ready which take care of Andrew's comments.
Will run some tests and post them soon.

But, before submitting these patches, I think it will be better to finalize
on certain things which might be worth some discussion here:

1) Should the file size change when preallocation is done beyond EOF ?
   - Andreas and Chris Wedgwood are in favor of not changing the
 file size in this case. I also tend to agree with them. Does anyone
 has an argument in favor of changing the filesize ?
 If not, I will remove the code which changes the filesize, before I
 resubmit the concerned ext4 patch.

2) For FA_UNALLOCATE mode, should the file system allow unallocation
   of normal (non-preallocated) blocks (blocks allocated via
   regular write/truncate operations) also (i.e. work as punch()) ?
   - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
 we need to finalize on the convention here as a general guideline
 to all the filesystems that implement fallocate.

3) If above is true, the file size will need to be changed
   for unallocation when block holding the EOF gets unallocated.
   - If we do not unallocate normal (non-preallocated) blocks and we
 do not change the file size on preallocation, then this is a
 non-issue.

4) Should we update mtime  ctime on a successfull allocation/
   unallocation ?
   - David Chinner raised this question in following post:
 http://lkml.org/lkml/2007/4/29/407
 I think it makes sense to update the [mc]time for a successfull
 preallocation/unallocation. Does anyone feel otherwise ?
 It will be interesting to know how XFS behaves currently. Does XFS
 update [mc]time for preallocation ?


--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-08 Thread Amit K. Arora

On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote:
 On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
  On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
   On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] 
   wrote:
 
+int ext4_fallocate(struct inode *inode, int mode, loff_t offset, 
loff_t len)
+{
+   handle_t *handle;
+   ext4_fsblk_t block, max_blocks;
+   int ret, ret2, nblocks = 0, retries = 0;
+   struct buffer_head map_bh;
+   unsigned int credits, blkbits = inode-i_blkbits;
+
+   /* Currently supporting (pre)allocate mode _only_ */
+   if (mode != FA_ALLOCATE)
+   return -EOPNOTSUPP;
+
+   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
+   return -ENOTTY;
   
   So we don't implement fallocate on bitmap-based files!  Well that's huge
   news.  The changelog would be an appropriate place to communicate this,
   along with reasons why, or a description of the plan to fix it.
  
  Ok. Will add this in the function description as well.
  
   Also, posix says nothing about fallocate() returning ENOTTY.
  
  Right. I don't seem to find any suitable error from posix description.
  Can you please suggest an error code which might make more sense here ?
  Will -ENOTSUPP be ok ? Since we want to say here that we don't support
  non-extent files.
 
 Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
 here, and fall back to the current library code to do preallocation?
 This way, the caller of fallocate() will never see this return code, so
 it won't violate posix.

You are right.

But, we still need to standardize (and limit) the error codes
which we should return from kernel when we want to fall back on the
library implementation. The posix_fallocate() library function will have
to look for a set of errors from fallocate() system call, upon receiving
which it will do preallocation from user level; or else, it will return
success/error-code returned by the system call to the user.

I think we can make it fall back to library implementation of fallocate,
whenever posix_fallocate() receives any of the following errors from
fallocate() system call:

1. ENOSYS
2. EOPNOTSUPP
3. ENOTTY(?)

Now the question is - should we limit the set of errors for this purpose
to just 1  2 above ? In that case I will need to change the error being
returned here to -EOPNOTSUPP (from current -ENOTTY).

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-07 Thread Amit K. Arora

Andrew,

Thanks for the review comments!

On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:33:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  This patch implements the fallocate() system call and adds support for
  i386, x86_64 and powerpc.
  
  ...
 
  +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
 
 Please add a comment over this function which specifies its behaviour. 
 Really it should be enough material from which a full manpage can be
 written.
 
 If that's all too much, this material should at least be spelled out in the
 changelog.  Because there's no way in which this change can be fully
 reviewed unless someone (ie: you) tells us what it is setting out to
 achieve.
 
 If we 100% implement some standard then a URL for what we claim to
 implement would suffice.  Given that we're at least using different types from
 posix I doubt if such a thing would be sufficient.
 
 And given the complexity and potential variability within the filesystem
 implementations of this, I'd expect that _something_ additional needs to be
 said?

Ok. I will add a detailed comment here.

 
  +{
  +   struct file *file;
  +   struct inode *inode;
  +   long ret = -EINVAL;
  +
  +   if (len == 0 || offset  0)
  +   goto out;
 
 The posix spec implies that negative `len' is permitted - presumably allocate
 ahead of `offset'.  How peculiar.

I think we should go ahead with current glibc implementation (which
Jakub poited at) of not allowing a negative 'len', since posix also
doesn't explicitly say anything about allowing negative 'len'.

 
  +   ret = -EBADF;
  +   file = fget(fd);
  +   if (!file)
  +   goto out;
  +   if (!(file-f_mode  FMODE_WRITE))
  +   goto out_fput;
  +
  +   inode = file-f_path.dentry-d_inode;
  +
  +   ret = -ESPIPE;
  +   if (S_ISFIFO(inode-i_mode))
  +   goto out_fput;
  +
  +   ret = -ENODEV;
  +   if (!S_ISREG(inode-i_mode))
  +   goto out_fput;
 
 So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
 seems a bit silly of them.

True. 
 
  +   ret = -EFBIG;
  +   if (offset + len  inode-i_sb-s_maxbytes)
  +   goto out_fput;
 
 This code does handle offset+len going negative, but only by accident, I
 suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
 here would settle the reader's mind.

Ok. I will add a check here for wrap though zero.
 
  +   if (inode-i_op  inode-i_op-fallocate)
  +   ret = inode-i_op-fallocate(inode, mode, offset, len);
  +   else
  +   ret = -ENOSYS;
 
 If we _are_ going to support negative `len', as posix suggests, I think we
 should perform the appropriate sanity conversions to `offset' and `len'
 right here, rather than expecting each filesystem to do it.
 
 If we're not going to handle negative `len' then we should check for it.

Will add a check for negative 'len' and return -EINVAL. This will be
done where currently we check for negative offset (i.e. at the start of
the function).
 
  +out_fput:
  +   fput(file);
  +out:
  +   return ret;
  +}
  +EXPORT_SYMBOL(sys_fallocate);
 
 I don't believe this needs to be exported to modules?

Ok. Will remove it.
 
  +/*
  + * fallocate() modes
  + */
  +#define FA_ALLOCATE0x1
  +#define FA_DEALLOCATE  0x2
 
 Now those aren't in posix.  They should be documented, along with their
 expected semantics.

Will add a comment describing the role of these modes.
 
   #ifdef __KERNEL__
   
   #include linux/linkage.h
  @@ -1125,6 +1131,7 @@ struct inode_operations {
  ssize_t (*listxattr) (struct dentry *, char *, size_t);
  int (*removexattr) (struct dentry *, const char *);
  void (*truncate_range)(struct inode *, loff_t, loff_t);
  +   long (*fallocate)(struct inode *, int, loff_t, loff_t);
 
 I really do think it's better to put the variable names in definitions such
 as this.  Especially when we have two identically-typed variables next to
 each other like that.  Quick: which one is the offset and which is the
 length?

Ok. Will add the variable names here.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] ext4: Extent overlap bugfix

2007-05-07 Thread Amit K. Arora

On Thu, May 03, 2007 at 09:30:02PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:41:01 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  +unsigned int ext4_ext_check_overlap(struct inode *inode,
  +   struct ext4_extent *newext,
  +   struct ext4_ext_path *path)
  +{
  +   unsigned long b1, b2;
  +   unsigned int depth, len1;
  +
  +   b1 = le32_to_cpu(newext-ee_block);
  +   len1 = le16_to_cpu(newext-ee_len);
  +   depth = ext_depth(inode);
  +   if (!path[depth].p_ext)
  +   goto out;
  +   b2 = le32_to_cpu(path[depth].p_ext-ee_block);
  +
  +   /* get the next allocated block if the extent in the path
  +* is before the requested block(s) */
  +   if (b2  b1) {
  +   b2 = ext4_ext_next_allocated_block(path);
  +   if (b2 == EXT_MAX_BLOCK)
  +   goto out;
  +   }
  +
  +   if (b1 + len1  b2) {
 
 Are we sure that b1+len cannot wrap through zero here?

No. Will add a check here for this. Thanks!
 
  +   newext-ee_len = cpu_to_le16(b2 - b1);
  +   return 1;
  +   }


--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Amit K. Arora

On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  This patch has the ext4 implemtation of fallocate system call.
  
  ...
 
  +   /* ext4_can_extents_be_merged should have checked that either
  +* both extents are uninitialized, or both aren't. Thus we
  +* need to check only one of them here.
  +*/
 
 Please always format multiline comments like this:
 
   /*
* ext4_can_extents_be_merged should have checked that either
* both extents are uninitialized, or both aren't. Thus we
* need to check only one of them here.
*/

Ok.
 
  ...
 
  +/*
  + * ext4_fallocate:
  + * preallocate space for a file
  + * mode is for future use, e.g. for unallocating preallocated blocks etc.
  + */
 
 This description is rather thin.  What is the filesystem's actual behaviour
 here?  If the file is using extents then the implementation will do
 something.  If the file is using bitmaps then we will do something else.
 
 But what?   Here is where it should be described.

Ok. Will expand the description.
 
  +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
  len)
  +{
  +   handle_t *handle;
  +   ext4_fsblk_t block, max_blocks;
  +   int ret, ret2, nblocks = 0, retries = 0;
  +   struct buffer_head map_bh;
  +   unsigned int credits, blkbits = inode-i_blkbits;
  +
  +   /* Currently supporting (pre)allocate mode _only_ */
  +   if (mode != FA_ALLOCATE)
  +   return -EOPNOTSUPP;
  +
  +   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
  +   return -ENOTTY;
 
 So we don't implement fallocate on bitmap-based files!  Well that's huge
 news.  The changelog would be an appropriate place to communicate this,
 along with reasons why, or a description of the plan to fix it.

Ok. Will add this in the function description as well.
 
 Also, posix says nothing about fallocate() returning ENOTTY.

Right. I don't seem to find any suitable error from posix description.
Can you please suggest an error code which might make more sense here ?
Will -ENOTSUPP be ok ? Since we want to say here that we don't support
non-extent files.
 
  +   block = offset  blkbits;
  +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  blkbits)
  +- block;
  +   mutex_lock(EXT4_I(inode)-truncate_mutex);
  +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
  +   mutex_unlock(EXT4_I(inode)-truncate_mutex);
 
 Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
 space, and that this disk space will require an arbitrary amount of
 metadata, how can we work out how much journal space we'll be needing
 without at least looking at `len'?

You are right to say that the credits can not be fixed here. But, 'len'
will not directly tell us how many extents might need to be inserted and
how many block groups (if any - think about the segment range already
being allocated case) the allocation request might touch.
One solution I have thought is to check the buffer credits after a call to
ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the
credits are falling short. Incase journal_extend fails, we call
journal_restart. This will automatically take care of how much journal
space we might need for any value of len.
 
  +   handle=ext4_journal_start(inode, credits +
 
 Please always put spaces around =A
Ok.
 
  +   EXT4_DATA_TRANS_BLOCKS(inode-i_sb)+1);
 
 And around +
Ok.
 
  +   if (IS_ERR(handle))
  +   return PTR_ERR(handle);
  +retry:
  +   ret = 0;
  +   while (ret = 0  ret  max_blocks) {
  +   block = block + ret;
  +   max_blocks = max_blocks - ret;
  +   ret = ext4_ext_get_blocks(handle, inode, block,
  + max_blocks, map_bh,
  + EXT4_CREATE_UNINITIALIZED_EXT, 0);
  +   BUG_ON(!ret);
 
 BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
 ext4_error() would be safer and more useful here.

Ok. Will do that.
 
  +   if (ret  0  test_bit(BH_New, map_bh.b_state)
 
 Use buffer_new() here.   A separate patch which fixes the three existing
 instances of open-coded BH_foo usage would be appreciated.

Ok.
 
  +((block + ret)  (i_size_read(inode)  blkbits)))
 
 Check for wrap though the sign bit and through zero please.
Ok.
 
  +   nblocks = nblocks + ret;
  +   }
  +
  +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, retries))
  +   goto retry;
  +
  +   /* Time to update the file size.
  +* Update only when preallocation was requested beyond the file size.
  +*/
 
 Fix comment layout.
Ok.
 
  +   if ((offset + len)  i_size_read(inode)) {
 
 Both the lhs and the rhs here are signed.  Please review for possible
 overflows through

Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-05-07 Thread Amit K. Arora

On Thu, May 03, 2007 at 09:32:38PM -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:46:23 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
  + */
  +int ext4_ext_try_to_merge(struct inode *inode,
  +   struct ext4_ext_path *path,
  +   struct ext4_extent *ex)
  +{
  +   struct ext4_extent_header *eh;
  +   unsigned int depth, len;
  +   int merge_done=0, uninitialized = 0;
 
 space around =, please.
 
 Many people prefer not to do the multiple-definitions-per-line, btw:
 
   int merge_done = 0;
   int uninitialized = 0;

Ok. Will make the change.

 
 reasons:
 
 - If gives you some space for a nice comment
 
 - It makes patches much more readable, and it makes rejects easier to fix
 
 - standardisation.
 
  +   depth = ext_depth(inode);
  +   BUG_ON(path[depth].p_hdr == NULL);
  +   eh = path[depth].p_hdr;
  +
  +   while (ex  EXT_LAST_EXTENT(eh)) {
  +   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
  +   break;
  +   /* merge with next extent! */
  +   if (ext4_ext_is_uninitialized(ex))
  +   uninitialized = 1;
  +   ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
  +   + ext4_ext_get_actual_len(ex + 1));
  +   if (uninitialized)
  +   ext4_ext_mark_uninitialized(ex);
  +
  +   if (ex + 1  EXT_LAST_EXTENT(eh)) {
  +   len = (EXT_LAST_EXTENT(eh) - ex - 1)
  +   * sizeof(struct ext4_extent);
  +   memmove(ex + 1, ex + 2, len);
  +   }
  +   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
 
 Kenrel convention is to put spaces around -

Will fix this.

 
  +   merge_done = 1;
  +   BUG_ON(eh-eh_entries == 0);
 
 eek, scary BUG_ON.  Do we really need to be that severe?  Would it be
 better to warn and run ext4_error() here?
Ok.
 
  +   }
  +
  +   return merge_done;
  +}
  +
  +
 
  ...
 
  +/*
  + * ext4_ext_convert_to_initialized:
  + * this function is called by ext4_ext_get_blocks() if someone tries to 
  write
  + * to an uninitialized extent. It may result in splitting the uninitialized
  + * extent into multiple extents (upto three). Atleast one initialized 
  extent
  + * and atmost two uninitialized extents can result.
 
 There are some typos here
 
  + * There are three possibilities:
  + *   a No split required: Entire extent should be initialized.
  + *   b Split into two extents: Only one end of the extent is being 
  written to.
  + *   c Split into three extents: Somone is writing in middle of the 
  extent.
 
 and here
 
Ok. Will fix them.
  + */
  +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
  +   struct ext4_ext_path *path,
  +   ext4_fsblk_t iblock,
  +   unsigned long max_blocks)
  +{
  +   struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex;
  +   struct ext4_extent_header *eh;
  +   unsigned int allocated, ee_block, ee_len, depth;
  +   ext4_fsblk_t newblock;
  +   int err = 0, ret = 0;
  +
  +   depth = ext_depth(inode);
  +   eh = path[depth].p_hdr;
  +   ex = path[depth].p_ext;
  +   ee_block = le32_to_cpu(ex-ee_block);
  +   ee_len = ext4_ext_get_actual_len(ex);
  +   allocated = ee_len - (iblock - ee_block);
  +   newblock = iblock - ee_block + ext_pblock(ex);
  +   ex2 = ex;
  +
  +   /* ex1: ee_block to iblock - 1 : uninitialized */
  +   if (iblock  ee_block) {
  +   ex1 = ex;
  +   ex1-ee_len = cpu_to_le16(iblock - ee_block);
  +   ext4_ext_mark_uninitialized(ex1);
  +   ex2 = newex;
  +   }
  +   /* for sanity, update the length of the ex2 extent before
  +* we insert ex3, if ex1 is NULL. This is to avoid temporary
  +* overlap of blocks.
  +*/
  +   if (!ex1  allocated  max_blocks)
  +   ex2-ee_len = cpu_to_le16(max_blocks);
  +   /* ex3: to ee_block + ee_len : uninitialised */
  +   if (allocated  max_blocks) {
  +   unsigned int newdepth;
  +   ex3 = newex;
  +   ex3-ee_block = cpu_to_le32(iblock + max_blocks);
  +   ext4_ext_store_pblock(ex3, newblock + max_blocks);
  +   ex3-ee_len = cpu_to_le16(allocated - max_blocks);
  +   ext4_ext_mark_uninitialized(ex3);
  +   err = ext4_ext_insert_extent(handle, inode, path, ex3);
  +   if (err)
  +   goto out;
  +   /* The depth, and hence eh  ex might change
  +* as part of the insert above.
  +*/
  +   newdepth = ext_depth(inode);
  +   if (newdepth != depth)
  +   {
 
 Use
 
   if (newdepth != depth) {

Ok.
 
  +   depth=newdepth;
 
 spaces
Ok.
 
  +   path = ext4_ext_find_extent(inode, iblock, NULL);
  +   if (IS_ERR(path

Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-05-07 Thread Amit K. Arora

On Mon, May 07, 2007 at 03:40:26PM +0300, Pekka Enberg wrote:
 On 4/26/07, Amit K. Arora [EMAIL PROTECTED] wrote:
  /*
 + * ext4_ext_try_to_merge:
 + * tries to merge the ex extent to the next extent in the tree.
 + * It always tries to merge towards right. If you want to merge towards
 + * left, pass ex - 1 as argument instead of ex.
 + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
 + * 1 if they got merged.
 + */
 +int ext4_ext_try_to_merge(struct inode *inode,
 +   struct ext4_ext_path *path,
 +   struct ext4_extent *ex)
 +{
 
 Please either use proper kerneldoc format or drop
 ext4_ext_try_to_merge from the comment.

Ok, Thanks.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/5] ext4: fallocate support in ext4

2007-04-26 Thread Amit K. Arora

This patch has the ext4 implemtation of fallocate system call.

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  201 +++-
 fs/ext4/file.c  |1 
 include/linux/ext4_fs.h |7 +
 include/linux/ext4_fs_extents.h |   13 ++
 4 files changed, 179 insertions(+), 43 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1107,7 +1107,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int depth, len1;
 
b1 = le32_to_cpu(newext-ee_block);
-   len1 = le16_to_cpu(newext-ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1181,9 +1193,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *ex, *fex;
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
-   int depth, len, err, next;
+   int depth, len, err, next, uninitialized = 0;
 
-   BUG_ON(newext-ee_len == 0);
+   BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);
ex = path[depth].p_ext;
BUG_ON(path[depth].p_hdr == NULL);
@@ -1191,14 +1203,23 @@ int ext4_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex  ext4_can_extents_be_merged(inode, ex, newext)) {
ext_debug(append %d block to %d:%d (from %llu)\n,
-   le16_to_cpu(newext-ee_len),
+   ext4_ext_get_actual_len(newext),
le32_to_cpu(ex-ee_block),
-

[PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-04-26 Thread Amit K. Arora

This patch adds write support for preallocated (using fallocate system
call) blocks/extents. The preallocated extents in ext4 are marked
uninitialized, hence they need special handling especially while
writing to them. This patch takes care of that.

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  228 +++-
 include/linux/ext4_fs_extents.h |1 
 2 files changed, 202 insertions(+), 27 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1141,6 +1141,51 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * ext4_ext_try_to_merge:
+ * tries to merge the ex extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass ex - 1 as argument instead of ex.
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+   struct ext4_ext_path *path,
+   struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done=0, uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex  EXT_LAST_EXTENT(eh)) {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1  EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
+   merge_done = 1;
+   BUG_ON(eh-eh_entries == 0);
+   }
+
+   return merge_done;
+}
+
+
+/*
  * ext4_ext_check_overlap:
  * check if a portion of the newext extent overlaps with an
  * existing extent.
@@ -1316,25 +1361,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex  EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1  EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
-   BUG_ON(eh-eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -1999,15 +2026,149 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * ext4_ext_convert_to_initialized:
+ * this function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three). Atleast one initialized extent
+ * and atmost two uninitialized extents can result.
+ * There are three possibilities:
+ *   a No split required: Entire extent should be initialized.
+ *   b Split into two extents: Only one end of the extent is being written to.
+ *   c Split into three extents: Somone is writing in middle of the extent.
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+   struct ext4_ext_path *path,
+   ext4_fsblk_t iblock,
+   unsigned long max_blocks)
+{
+   struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex;
+   struct ext4_extent_header *eh;
+   unsigned int allocated, ee_block, ee_len, depth;
+   ext4_fsblk_t newblock;
+   int err = 0, ret = 0;
+
+   depth = ext_depth(inode);
+   eh = path[depth].p_hdr;
+   ex = path[depth].p_ext;
+   ee_block = le32_to_cpu(ex-ee_block);
+

Re: Interface for the new fallocate() system call

2007-04-24 Thread Amit K. Arora

On Fri, Apr 20, 2007 at 10:59:18AM -0400, Jakub Jelinek wrote:
 On Fri, Apr 20, 2007 at 07:21:46PM +0530, Amit K. Arora wrote:
  Ok.
  In this case we may have to consider following things:
  
  1) Obviously, for this glibc will have to call fallocate() syscall with
  different arguments on s390, than other archs. I think this should be
  doable and should not be an issue with glibc folks (right?).
 
 glibc can cope with this easily, will just add
 sysdeps/unix/sysv/linux/s390/fallocate.c or something similar to override
 the generic Linux implementation.
 
  2) we also need to see how strace behaves in this case. With little
  knowledge that I have of strace, I don't think it should depend on
  argument ordering of a system call on different archs (since it uses
  ptrace internally and that should take care of it). But, it will be
  nice if someone can confirm this.
 
 strace would solve this with #ifdef mess, it already does that in many
 places so guess another few lines don't make it significantly worse.

I will work on the revised fallocate patchset and will post it soon.

Thanks!
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Interface for the new fallocate() system call

2007-04-20 Thread Amit K. Arora

On Wed, Apr 18, 2007 at 07:06:00AM -0600, Andreas Dilger wrote:
 On Apr 17, 2007  18:25 +0530, Amit K. Arora wrote:
  On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote:
   Wouldn't
   int fallocate(loff_t offset, loff_t len, int fd, int mode)
   work on both s390 and ppc/arm?  glibc will certainly wrap it and
   reorder the arguments as needed, so there is no need to keep fd first.
  
  I think more people are comfirtable with this approach.
 
 Really?  I thought from the last postings that fd first, wrap on s390
 was better.
 
  Since glibc
  will wrap the system call and export the conventional interface
  (with fd first) to applications, we may not worry about keeping fd first
  in kernel code. I am personally fine with this approach.
 
 It would seem to make more sense to wrap the syscall on those architectures
 that can't handle the conventional interface (fd first).

Ok.
In this case we may have to consider following things:

1) Obviously, for this glibc will have to call fallocate() syscall with
different arguments on s390, than other archs. I think this should be
doable and should not be an issue with glibc folks (right?).

2) we also need to see how strace behaves in this case. With little
knowledge that I have of strace, I don't think it should depend on
argument ordering of a system call on different archs (since it uses
ptrace internally and that should take care of it). But, it will be
nice if someone can confirm this.

Thanks!
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Interface for the new fallocate() system call

2007-04-17 Thread Amit K. Arora

On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote:
 Wouldn't
 int fallocate(loff_t offset, loff_t len, int fd, int mode)
 work on both s390 and ppc/arm?  glibc will certainly wrap it and
 reorder the arguments as needed, so there is no need to keep fd first.


I think more people are comfirtable with this approach. Since glibc
will wrap the system call and export the conventional interface
(with fd first) to applications, we may not worry about keeping fd first
in kernel code. I am personally fine with this approach.

Still, if people have major concerns, we can think of getting rid of the
mode argument itself. Anyhow we may, in future, need to have a policy
based system call (say, for providing the goal block by applications for
performance reasons). mode can then be made part of it.

Comments ?
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Interface for the new fallocate() system call

2007-04-05 Thread Amit K. Arora

On Thu, Apr 05, 2007 at 04:56:19PM +0530, Amit K. Arora wrote:

Correction below:

 asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int 
 mode)
 {
 return sys_fallocate(fd, offset, len, mode);
  return sys_fallocate(fd, mode, offset, len);
 }

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Interface for the new fallocate() system call

2007-04-05 Thread Amit K. Arora

On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote:
 Wouldn't
 int fallocate(loff_t offset, loff_t len, int fd, int mode)
 work on both s390 and ppc/arm?  glibc will certainly wrap it and
 reorder the arguments as needed, so there is no need to keep fd first.
 
This should work on all the platforms. The only concern I can think of
here is the convention being followed till now, where all the entities on
which the action has to be performed by the kernel (say fd, file/device
name, pid etc.) is the first argument of the system call. If we can live
with the small exception here, fine.

Or else, we may have to implement the 

  int fd, int mode, loff_t offset, loff_t len

as the layout of arguments here. I think only s390 will have a problem
with this, and we can think of a workaround for it (may be similar to
what ARM did to implement sync_file_range() system call)   :

asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode)
{
return sys_fallocate(fd, offset, len, mode);
}


To me both the approaches look slightly unconventional. But, we need to
compromise somewhere to make things work on all the platforms.

Any thoughts on which one of the above should we finalize on ?

Thanks!
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Patch 0/2] Persistent preallocation in ext4 (using fallocate inode op)

2007-03-21 Thread Amit K. Arora

On Mon, Mar 19, 2007 at 09:15:22AM -0800, Mingming Cao wrote:
 On Mon, 2007-03-19 at 10:48 -0500, Dave Kleikamp wrote:
  persistent_allocation_1_ioctl_and_unitialized_extents
 
 We could mention here that this patch is going to be replaced by a new
 patch to use the fallocate() operations.
 
  # Fixed an endian error
  persistent_allocation_2_support_for_writing_to_unitialized_extent
 
 I think Amit has an updated version of this patch in his place.

Hi Mingming,

Here are the new patches which use new fallocate inode interface. I have
made following changes to the previous patchset:

1. Removed ioctl portion of the code from the preallocation patch.
2. Added ext4_fallocate() to support the new fallocate inode operation,
which will be called from the sys_fallocate() system call code.
3. Fixed the endian error which you observed in the write support for
uninitialized extents patch.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch 1/2] preallocation patch (using fallocate inode op)

2007-03-21 Thread Amit K. Arora

This is the new preallocation patch, which implements ext4_fallocate() to do
the preallocation.

Signed-off-by: Amit K Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  201 +++-
 fs/ext4/file.c  |1 
 include/linux/ext4_fs.h |7 +
 include/linux/ext4_fs_extents.h |   13 ++
 4 files changed, 179 insertions(+), 43 deletions(-)

Index: linux-2.6.20.1/fs/ext4/extents.c
===
--- linux-2.6.20.1.orig/fs/ext4/extents.c
+++ linux-2.6.20.1/fs/ext4/extents.c
@@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1107,7 +1107,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int depth, len1;
 
b1 = le32_to_cpu(newext-ee_block);
-   len1 = le16_to_cpu(newext-ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1181,9 +1193,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *ex, *fex;
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
-   int depth, len, err, next;
+   int depth, len, err, next, uninitialized = 0;
 
-   BUG_ON(newext-ee_len == 0);
+   BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);
ex = path[depth].p_ext;
BUG_ON(path[depth].p_hdr == NULL);
@@ -1191,14 +1203,23 @@ int ext4_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex  ext4_can_extents_be_merged(inode, ex, newext)) {
ext_debug(append %d block to %d:%d (from %llu)\n,
-   le16_to_cpu(newext-ee_len),
+   ext4_ext_get_actual_len(newext),
le32_to_cpu

[Patch 2/2] write support for uninitialized extents

2007-03-21 Thread Amit K. Arora

Here is the patch which supports writing to uninitialized extents. There
are no major changes to this patch. But is being resubitted to make sure
that it applies cleanly on top of the new preallocation patch, which has
been modified to implement fallocate inode operation so that
preallocation can be done using sys_fallocate() system call.

Signed-off-by: Amit K Arora [EMAIL PROTECTED]

---
 fs/ext4/extents.c   |  228 +++-
 include/linux/ext4_fs_extents.h |1 
 2 files changed, 202 insertions(+), 27 deletions(-)

Index: linux-2.6.20.1/fs/ext4/extents.c
===
--- linux-2.6.20.1.orig/fs/ext4/extents.c
+++ linux-2.6.20.1/fs/ext4/extents.c
@@ -1141,6 +1141,51 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * ext4_ext_try_to_merge:
+ * tries to merge the ex extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass ex - 1 as argument instead of ex.
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+   struct ext4_ext_path *path,
+   struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done=0, uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex  EXT_LAST_EXTENT(eh)) {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1  EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
+   merge_done = 1;
+   BUG_ON(eh-eh_entries == 0);
+   }
+
+   return merge_done;
+}
+
+
+/*
  * ext4_ext_check_overlap:
  * check if a portion of the newext extent overlaps with an
  * existing extent.
@@ -1316,25 +1361,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex  EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1  EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
-   BUG_ON(eh-eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -1999,15 +2026,149 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * ext4_ext_convert_to_initialized:
+ * this function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three). Atleast one initialized extent
+ * and atmost two uninitialized extents can result.
+ * There are three possibilities:
+ *   a No split required: Entire extent should be initialized.
+ *   b Split into two extents: Only one end of the extent is being written to.
+ *   c Split into three extents: Somone is writing in middle of the extent.
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+   struct ext4_ext_path *path,
+   ext4_fsblk_t iblock,
+   unsigned long max_blocks)
+{
+   struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex;
+   struct ext4_extent_header *eh;
+   unsigned int allocated, ee_block, ee_len, depth;
+   ext4_fsblk_t newblock;
+   int err = 0, ret = 0;
+
+   depth = ext_depth(inode);
+   eh

Re: [RFC][PATCH] sys_fallocate() system call

2007-03-21 Thread Amit K. Arora

On Sat, Mar 17, 2007 at 05:10:37AM -0600, Matthew Wilcox wrote:
 How about:
 
 asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high,
   u32 len_low, u32 len_high);
 
 That way we all suffer equally ...

As suggested by you and Russel, I have made this change to the patch.
Here is how it looks like now. Please let me know if anyone has concerns
about passing arguments this way (breaking each loff_t into two u32s).

Signed-off-by: Amit K Arora [EMAIL PROTECTED]
---
 arch/i386/kernel/syscall_table.S |1 
 arch/x86_64/kernel/functionlist  |1 
 fs/open.c|   46 +++
 include/asm-i386/unistd.h|3 +-
 include/asm-powerpc/systbl.h |1 
 include/asm-powerpc/unistd.h |3 +-
 include/asm-x86_64/unistd.h  |4 ++-
 include/linux/fs.h   |7 +
 include/linux/syscalls.h |2 +
 9 files changed, 65 insertions(+), 3 deletions(-)

Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.20.1/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_fallocate /* 320 */
Index: linux-2.6.20.1/fs/open.c
===
--- linux-2.6.20.1.orig/fs/open.c
+++ linux-2.6.20.1/fs/open.c
@@ -350,6 +350,52 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high,
+   u32 len_low, u32 len_high)
+{
+   struct file *file;
+   struct inode *inode;
+   loff_t offset, len;
+   long ret = -EINVAL;
+
+   offset = (off_high  32) + off_low;
+   len = (len_high  32) + len_low;
+
+   if (len == 0 || offset  0)
+   goto out;
+
+   ret = -EBADF;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   if (!(file-f_mode  FMODE_WRITE))
+   goto out_fput;
+
+   inode = file-f_path.dentry-d_inode;
+
+   ret = -ESPIPE;
+   if (S_ISFIFO(inode-i_mode))
+   goto out_fput;
+
+   ret = -ENODEV;
+   if (!S_ISREG(inode-i_mode))
+   goto out_fput;
+
+   ret = -EFBIG;
+   if (offset + len  inode-i_sb-s_maxbytes)
+   goto out_fput;
+
+   if (inode-i_op  inode-i_op-fallocate)
+   ret = inode-i_op-fallocate(inode, mode, offset, len);
+   else
+   ret = -ENOSYS;
+out_fput:
+   fput(file);
+out:
+   return ret;
+}
+EXPORT_SYMBOL(sys_fallocate);
+
 /*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
Index: linux-2.6.20.1/include/asm-i386/unistd.h
===
--- linux-2.6.20.1.orig/include/asm-i386/unistd.h
+++ linux-2.6.20.1/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_fallocate 320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.20.1/include/linux/fs.h
===
--- linux-2.6.20.1.orig/include/linux/fs.h
+++ linux-2.6.20.1/include/linux/fs.h
@@ -263,6 +263,12 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WRITE  2
 #define SYNC_FILE_RANGE_WAIT_AFTER 4
 
+/*
+ * fallocate() modes
+ */
+#define FA_ALLOCATE0x1
+#define FA_DEALLOCATE  0x2
+
 #ifdef __KERNEL__
 
 #include linux/linkage.h
@@ -1124,6 +1130,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+   int (*fallocate)(struct inode *, int, loff_t, loff_t);
 };
 
 struct seq_file;
Index: linux-2.6.20.1/include/linux/syscalls.h
===
--- linux-2.6.20.1.orig/include/linux/syscalls.h
+++ linux-2.6.20.1/include/linux/syscalls.h
@@ -602,6 +602,8 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct 
getcpu_cache __user *cache);
+asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high,
+   u32 len_low, u32 len_high);
 
 int kernel_execve(const char *filename, char *const argv[], char *const 
envp[]);
 
Index

Re: [RFC][PATCH] sys_fallocate() system call

2007-03-19 Thread Amit K. Arora

On Sat, Mar 17, 2007 at 04:33:50PM +1100, Stephen Rothwell wrote:
 On Fri, 16 Mar 2007 20:01:01 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
 
  +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 
  --- linux-2.6.20.1.orig/include/asm-powerpc/systbl.h
  +++ linux-2.6.20.1/include/asm-powerpc/systbl.h
  @@ -305,3 +305,4 @@ SYSCALL_SPU(faccessat)
   COMPAT_SYS_SPU(get_robust_list)
   COMPAT_SYS_SPU(set_robust_list)
   COMPAT_SYS(move_pages)
  +SYSCALL(fallocate)
 
 It is going to need to be a COMPAT_SYS call in powerpc because 32 bit
 powerpc will pass the two loff_t's in pairs of registers while
 64bit passes them in one register each.

Ok. Will make that change, unless it is decided to pass each loff_t
argument as two u32s. Thanks!

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] Heads up on sys_fallocate()

2007-03-01 Thread Amit K. Arora

This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation fallocate, for persistent preallocation. The new
system call, as Andrew suggested, will look like:

  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

As we are developing and testing the required patches, we decided to
post a preliminary patch and get inputs from the community to give it
a right direction and shape. First, a little description on the feature.
 
Persistent preallocation is a file system feature using which an
application (say, relational database servers) can explicitly
preallocate blocks to a particular file. This feature can be used to
reserve space for a file to get mainly the following benefits:
1 contiguity - less defragmentation and thus faster access speed, and
2 guarantee for a minimum space availibility (depending on how many
blocks were preallocated) for the file, even if the filesystem becomes
full.

XFS already has an implementation for this, using an ioctl interface. And,
ext4 is now coming up with this feature. In coming time we may see a few
more file systems implementing this. Thus, it makes sense to have a more
standard interface for this, like this new system call.

Here is the initial and incomplete version of the patch, which can be
used for the discussion, till we come up with a set of more complete
patches.

---
 arch/i386/kernel/syscall_table.S |1 +
 fs/ext4/file.c   |1 +
 fs/open.c|   18 ++
 include/asm-i386/unistd.h|3 ++-
 include/linux/fs.h   |1 +
 include/linux/syscalls.h |1 +
 6 files changed, 24 insertions(+), 1 deletion(-)

Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.20.1/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_fallocate /* 320 */
Index: linux-2.6.20.1/fs/ext4/file.c
===
--- linux-2.6.20.1.orig/fs/ext4/file.c
+++ linux-2.6.20.1/fs/ext4/file.c
@@ -135,5 +135,6 @@ struct inode_operations ext4_file_inode_
.removexattr= generic_removexattr,
 #endif
.permission = ext4_permission,
+   .fallocate  = ext4_fallocate,
 };
 
Index: linux-2.6.20.1/fs/open.c
===
--- linux-2.6.20.1.orig/fs/open.c
+++ linux-2.6.20.1/fs/open.c
@@ -350,6 +350,24 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   inode = file-f_path.dentry-d_inode;
+   if (inode-i_op  inode-i_op-fallocate)
+   ret = inode-i_op-fallocate(inode, offset, len);
+   else
+   ret = -ENOTTY;
+   fput(file);
+out:
+return ret;
+}
+
 /*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
Index: linux-2.6.20.1/include/asm-i386/unistd.h
===
--- linux-2.6.20.1.orig/include/asm-i386/unistd.h
+++ linux-2.6.20.1/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_fallocate 320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.20.1/include/linux/fs.h
===
--- linux-2.6.20.1.orig/include/linux/fs.h
+++ linux-2.6.20.1/include/linux/fs.h
@@ -1124,6 +1124,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+   long (*fallocate)(struct inode *, loff_t, loff_t);
 };
 
 struct seq_file;
Index: linux-2.6.20.1/include/linux/syscalls.h
===
--- linux-2.6.20.1.orig/include/linux/syscalls.h
+++ linux-2.6.20.1/include/linux/syscalls.h
@@ -602,6 +602,7 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct 
getcpu_cache __user *cache);
+asmlinkage long

Testing ext4 persistent preallocation patches for 64 bit features

2007-02-06 Thread Amit K. Arora

I plan to test the persistent preallocation patches on a huge sparse
device, to know if 32 bit physical block numbers (upto 48bit) behave as
expected. I have following questions for this and will appreciate
suggestions here:

a) What should be the sparse device size which I should use for testing?
Should a size of  8TB (say, 100 TB) be enough ?
The physical device (backing store device) size I can have is upto 70GB.

b) How do I test allocation of 32 bit physical block numbers ? I can
not fill  8TB, since the physical storage available with me is just
70GB.

c) Do I need to put some hack in the filesystem code for above (to
allocate 32 bit physical block numbers) ?

Any further ideas on how to test this will help. Thanks!

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Patch 1/2] ioctl and uninitialized extents

2007-01-17 Thread Amit K. Arora

This patch implements the ioctl which may be used for persistent
preallocation of blocks to an extent enabled file in ext4.

Signed-off-by: Amit Arora [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |  125 ++--
 fs/ext4/ioctl.c |   69 ++
 include/linux/ext4_fs.h |   13 
 include/linux/ext4_fs_extents.h |   13 
 4 files changed, 177 insertions(+), 43 deletions(-)

Index: linux-2.6.20-rc5/fs/ext4/extents.c
===
--- linux-2.6.20-rc5.orig/fs/ext4/extents.c
+++ linux-2.6.20-rc5/fs/ext4/extents.c
@@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+   ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1107,7 +1107,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int depth, len1;
 
b1 = le32_to_cpu(newext-ee_block);
-   len1 = le16_to_cpu(newext-ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1181,9 +1193,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *ex, *fex;
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
-   int depth, len, err, next;
+   int depth, len, err, next, uninitialized = 0;
 
-   BUG_ON(newext-ee_len == 0);
+   BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);
ex = path[depth].p_ext;
BUG_ON(path[depth].p_hdr == NULL);
@@ -1191,14 +1203,23 @@ int ext4_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex  ext4_can_extents_be_merged(inode, ex, newext)) {
ext_debug(append %d block to %d:%d (from %llu)\n,
-   le16_to_cpu(newext-ee_len),
+

Re: [RFC][Patch 1/2] Persistent preallocation in ext4

2007-01-09 Thread Amit K. Arora

On Tue, Jan 02, 2007 at 04:34:09PM +0530, Amit K. Arora wrote:
 On Wed, Dec 27, 2006 at 03:30:44PM -0800, Mingming Cao wrote:
  Since the API takes the number of bytes to preallocate, at return time,
  shall we convert the blocks to bytes to the user?
 
  Here it returns the number of allocated blocks to the user.   Do we need
  to worry about the case when dealing with a range with partial hole and
  partial blocks already allocated? In that case nblocks(the new
  preallocated blocks) will less than the maxblocks (the number of blocks
  asked by application).  I am wondering what does other filesystem like
  xfs do? Maybe we should do the same thing.
 
 I think xfs just returns 0 on success, and errno on an error. Do we
 want to keep the same behavior here ? Or, should we return the number of
 bytes preallocated ?

We still need to decide on what the ioctl should return. Should it
return zero on success and errno on error, like how posix_fallocate and
xfs behave ?  If yes, then should we undo partial preallocation (if any)
in case of an error (say ENOSPC) ?

If no, then should we return the number of bytes preallocated ? In this
case we have to think about the situation Mingming mentioned above (i.e.
when the preallocation request partially spans through a hole and
partially through few already allocated blocks).

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1 version2] Extent overlap bugfix in ext4

2007-01-05 Thread Amit K. Arora

On Thu, Jan 04, 2007 at 10:50:00AM -0800, Mingming Cao wrote:
 Hi, Amit,
Hi Mingming,
 
 Have you looked at ext4_ext_walk_space()? It calculate the right extent
 length to allocate to avoid overlap before calling block allocation
 callback function is called.
Yes. More on this below...
 
 Amit K. Arora wrote:
  /*
 + * ext4_ext_check_overlap:
 + * check if a portion of the newext extent overlaps with an
 + * existing extent.
 + *
 + * If there is an overlap discovered, it returns the (logical) block
 + * number of the first block in the next extent (the existing extent
 + * which covers few of the new requested blocks)
 + * If there is no overlap found, it returns 0.
 + */
 
 What if the start logical block of the exisitng extent is 0 and there is
 overlap? I think that is possible. For example, the exisitng extent is
 (0,100) and you want to insert new extent (0,500), this will certainly
 fail to report the overlap.
As Alex mentioned, this case is taken care of by ext4_ext_get_blocks().
 
 +unsigned int ext4_ext_check_overlap(struct inode *inode,
 
 We shall be consistant with other data type used for logical block,
 right now is unsigned long. Probably replace that with ext4_fsblk_t type
 when that cleanup is introduced.
Ok, will use unsigned long.

 
 +struct ext4_extent *newext,
 +struct ext4_ext_path *path)
 +{
 +unsigned int depth, b1, len1, b2;
 +
 unsigned long type for b1 and b2.
Ok.
 
 +b1 = le32_to_cpu(newext-ee_block);
 +len1 = le16_to_cpu(newext-ee_len);
 +depth = ext_depth(inode);
 +if (!path[depth].p_ext)
 +goto out;
 +b2 = le32_to_cpu(path[depth].p_ext-ee_block);
 +
 +/* get the next allocated block if the extent in the path
 + * is before the requested block(s) */
 +if (b2  b1) {
 +b2 = ext4_ext_next_allocated_block(path);
 +if (b2 == EXT_MAX_BLOCK)
 +goto out;
 +}
 +
 +if (b1 + len1  b2)
 +return b2;
 +out:
 +return 0;
 +}
 +
 
 Since this overlap check function is called inside
 ext4_ext_insert_extent(), I think this function should check for all
 kinds of overlaps. Here you only check if the new extent is overlap with
 the next extent. Looking at ext4_ext_walk_space(), there are total three
 kinds of overlaps:
 1) righ port of new extent overlap with path-p_ext,
 2) left port of new extent overlap with path-p_ext
 3) right port of new extent overlap with next extent

I think all the three conditions above are being checked. The second
condition is taken care of by the ext4_ext_get_blocks(). And the rest
two checks are being made in the ext4_ext_check_overlap().
check_overlap() first checks if the right portion of the new extent
overlaps with the path-p_ext. If not, then only it checks for an
overlap with the next extent.

 
 I think we are almost repeating the same logic in ext4_ext_walk_space()
 here.

I understand that some portion of the logic in ext4_ext_walk_space() is
being duplicated here in check_overlap(). But, if we have to use
walk_space(), we will need to write a new helper function which will
result in some duplicate code in get_blocks() and
ext4_wb_handle_extent() (like, calling ext4_new_blocks and then
insert_extent()) as well. Unless, ext4_wb_handle_extent() is modified to match
our requirement of persistent preallocation. I am not sure how
complicated and worth that may be.

 
 +/*
   * ext4_ext_insert_extent:
   * tries to merge requsted extent into the existing extent or
   * inserts requested extent as new one into the tree,
 @@ -1133,12 +1170,25 @@ int ext4_ext_insert_extent(handle_t *han
  struct ext4_extent *nearex; /* nearest extent */
  struct ext4_ext_path *npath = NULL;
  int depth, len, err, next;
 +unsigned int oblock;
 
 unsigned long type for oblock
Ok.
 
  BUG_ON(newext-ee_len == 0);
  depth = ext_depth(inode);
  ex = path[depth].p_ext;
  BUG_ON(path[depth].p_hdr == NULL);
 
 +/* check for overlap */
 +oblock = ext4_ext_check_overlap(inode, newext, path);
 +if (oblock) {
 +printk(KERN_ERR ERROR: newext=%u/%u overlaps with an 
 +existing extent, which starts with %u\n,
 +le32_to_cpu(newext-ee_block),
 +le16_to_cpu(newext-ee_len),
 +oblock);
 +ext4_ext_show_leaf(inode, path);
 +BUG();
 +}
 
 How about return true or false from ext4_ext_check_overlap()? Inside
 that function put the correct new extent logical block number and extent
 length that safe to insert? Afterall the returning oblock is used in
 ext4_ext_get_blocks() to calculate the safe extent to allocate.
Ok.
 
 +
  /* try to insert block into found extent and return */
  if (ex  ext4_can_extents_be_merged(inode, ex, newext)) {
  ext_debug(append %d block to %d:%d (from %llu)\n,
 @@ -1984,6 +2034,10

Re: [PATCH 1/1] Extent overlap bugfix in ext4

2007-01-04 Thread Amit K. Arora

On Wed, Jan 03, 2007 at 10:07:01AM -0800, Mingming Cao wrote:
 Alex Tomas wrote:
 I think that stuff that converts uninitialized blocks
 to initialized ones should be a separate codepath and
 shouldn't be done in the insert path. and an insert
 (basic tree manipulation) should BUG_ON() one tries
 to add extent with a block which is already covered
 by the tree.
 
 IMHO, get_blocks() should look like:
 
   path = find_path()
   if (found extent covers request block(s)) {
 if (extent is uninitialized) {
   convert();
 }
   }
 
 where
function convert()
   {
 /* adopt existing extent so that it
  * doesn't cover requested blocks */
 
 /* insert head or tail of existing
  * extent, if necessary */
 
 /* insert new extent of initialized blocks */
   }
 
 thanks, Alex
 
 I was thing about the same thing. The current ext4_ext_get_blocks()
 function becomes very bulky. The code to convert uninitialized blocks to
 initialized ones is pretty selfcontained, and worth the effort to put it
 into a seperate function.

Ok. I will move this code to a new function.

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1 version2] Extent overlap bugfix in ext4

2007-01-04 Thread Amit K. Arora

On Thu, Jan 04, 2007 at 01:39:24PM +0300, Alex Tomas (AT) wrote:
  Amit K Arora (AKA) writes:
 
  AKA +int ext4_ext_check_overlap(struct inode *inode,
  AKA +   struct ext4_extent *newext,
  AKA +   unsigned long *block)
  AKA +{
  AKA +   struct ext4_ext_path *path;
  AKA +   unsigned int depth, b1, len1;
  AKA +   int ret = 0;
  AKA +
  AKA +   b1 = le32_to_cpu(newext-ee_block);
  AKA +   len1 = le16_to_cpu(newext-ee_len);
  AKA +   path = ext4_ext_find_extent(inode, b1, NULL);
  AKA +   if (IS_ERR(path)) {
  AKA +   ret = PTR_ERR(path);
  AKA +   goto out;
  AKA +   }
  AKA +   depth = ext_depth(inode);
  AKA +   BUG_ON(path[depth].p_ext == NULL  depth != 0);
  AKA +
  AKA +   *block = ext4_ext_next_allocated_block(path);
  AKA +   if (*block == EXT_MAX_BLOCK)
  AKA +   goto out;
  AKA +
  AKA +   if (b1 + len1  *block)
  AKA +   ret = 1;
  AKA +out:
  AKA +   return ret;
  AKA +}
 
 AT I'm also not sure we need ext4_ext_find_extent() here.
Do you mean ext4_ext_next_allocated_block() above ? We anyhow have to
call find_extent() to get the possible neighbouring extent.

 AT there are two possibilities:
 
 AT 1) extent in found path covers block(s) before requested ones
 ATthen ext4_ext_next_allocated_block(path) can be used
 
 AT 2) extent in found path covers block(s) after request ones
 ATthen ee_block from that extent can be used.

You are right. In the case the requested block(s) lie within a hole, when
this hole starts from the begining of the file, this will be true. i.e.,
find_blocks() will return the extent after the requested block(s). In all
other cases, it will return the extent before the requested block(s)
(assuming there is no existing extent which covers the start of the
requested blocks).

Will change the code accordingly to handle this corner case. Thanks for
pointing this out !

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1 version2] Extent overlap bugfix in ext4

2007-01-04 Thread Amit K. Arora

On Thu, Jan 04, 2007 at 11:47:36AM -0800, Mingming Cao (MC) wrote:
 Alex Tomas (AT) wrote:
 
 Amit K Arora (AKA) writes:
 
 
   AKA @@ -1984,6 +2034,10 @@ int ext4_ext_get_blocks(handle_t *handle
   AKA   */
   AKA  if (ee_len  EXT_MAX_LEN)
   AKA  goto out2;
   AKA +
   AKA +if (iblock  ee_block  iblock + max_blocks = 
   ee_block)
   AKA +allocated = ee_block - iblock;
   AKA +
   AKA  /* if found extent covers block, simply return it */
   AKA  if (iblock = ee_block  iblock  ee_block + 
   ee_len) {
   AKA  newblock = iblock - ee_block + ee_start;
  
  AT I thought existing code already does this:
 
  AT   /* if found extent covers block, simply return it */
  AT   if (iblock = ee_block  iblock  ee_block + ee_len) {
  AT   newblock = iblock - ee_block + ee_start;
  AT   /* number of remaining blocks in the extent */
  AT   allocated = ee_len - (iblock - ee_block);
 MC That's different: the existing code address the case when the left part
 MC of the new extent  overlaps with an exisitng extent, in that case I
 MC understand it just returns the allocated part of extent, and continue
 MC the block allocation in the next call of get_blocks().
Right.
 
 MC Well Amit's new code here trying to address the case when the right part
 MC of the new extent overlap with an exisitng extent. He was trying to
 MC update the new extent length to prevent that. As I mentioned ealier we
 MC could put this code into ext4_ext_check_overlap,let it judge whether
 MC there is overlap, and if so, what's the right start block number and 
 length
Yes, this check will no longer be required with the modified
ext4_ext_check_overlap, which will check for this condition as well.

--
Regards,
Amit Arora
 
 Thanks,
 Mingming
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/1] Extent overlap bugfix in ext4

2007-01-02 Thread Amit K. Arora

The ext4_ext_get_blocks() and ext4_ext_insert_extent() routines do not
check for extent overlap, when a new extent needs to be inserted in an
inode. An overlap is possible when the new extent being inserted has
ee_block that is not part of any of the existing extents, but the
tail/center portion of this new extent _is_. This is possible only when
we are writing/preallocating blocks across a hole.

This problem was first sighted while stress testing (using modified
fsx-linux stress test) persistent preallocation patches that I posted
earlier.  Though I am not able to reproduce this bug (extent overlap)
without the persistent preallocation patches (because a write through a
hole results in get_blocks() of a single block at a time), but I think
that it is an independant problem and should be solved with a separate
patch. Hence this patch.

Comments please. Thanks!

Signed-off-by: Amit Arora ([EMAIL PROTECTED])
---
 fs/ext4/extents.c   |   71 +---
 include/linux/ext4_fs_extents.h |1 
 2 files changed, 68 insertions(+), 4 deletions(-)

Index: linux-2.6.19.prealloc/fs/ext4/extents.c
===
--- linux-2.6.19.prealloc.orig/fs/ext4/extents.c2007-01-02 
14:21:57.0 +0530
+++ linux-2.6.19.prealloc/fs/ext4/extents.c 2007-01-02 14:22:00.0 
+0530
@@ -1119,6 +1119,44 @@
 }
 
 /*
+ * ext4_ext_check_overlap:
+ * check if a portion of the newext extent overlaps with an
+ * existing extent.
+ */
+struct ext4_extent * ext4_ext_check_overlap(struct inode *inode,
+   struct ext4_extent *newext)
+{
+   struct ext4_ext_path *path;
+   struct ext4_extent *ex;
+   unsigned int depth, b1, b2, len1;
+
+   b1 = le32_to_cpu(newext-ee_block);
+   len1 = le16_to_cpu(newext-ee_len);
+   path = ext4_ext_find_extent(inode, b1, NULL);
+   if (IS_ERR(path))
+   return NULL;
+
+   depth = ext_depth(inode);
+   ex = path[depth].p_ext;
+   if (!ex)
+   return NULL;
+
+   b2 = ext4_ext_next_allocated_block(path);
+   if (b2 == EXT_MAX_BLOCK)
+   return NULL;
+   path = ext4_ext_find_extent(inode, b2, path);
+   if (IS_ERR(path))
+   return NULL;
+   BUG_ON(path[depth].p_hdr == NULL);
+   ex = path[depth].p_ext;
+
+   if (b1 + len1  b2)
+   return ex;
+
+   return NULL;
+}
+
+/*
  * ext4_ext_insert_extent:
  * tries to merge requsted extent into the existing extent or
  * inserts requested extent as new one into the tree,
@@ -1129,7 +1167,7 @@
struct ext4_extent *newext)
 {
struct ext4_extent_header * eh;
-   struct ext4_extent *ex, *fex;
+   struct ext4_extent *ex, *fex, *rex;
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
int depth, len, err, next;
@@ -1139,6 +1177,18 @@
ex = path[depth].p_ext;
BUG_ON(path[depth].p_hdr == NULL);
 
+   /* check for overlap */
+   rex = ext4_ext_check_overlap(inode, newext);
+   if (rex) {
+   printk(KERN_ERR ERROR: ex=%u/%u overlaps newext=%u/%u\n,
+   le32_to_cpu(rex-ee_block),
+   le16_to_cpu(rex-ee_len),
+   le32_to_cpu(newext-ee_block),
+   le16_to_cpu(newext-ee_len));
+   ext4_ext_show_leaf(inode, path);
+   BUG();
+   }
+
/* try to insert block into found extent and return */
if (ex  ext4_can_extents_be_merged(inode, ex, newext)) {
ext_debug(append %d block to %d:%d (from %llu)\n,
@@ -1921,7 +1971,7 @@
int create, int extend_disksize)
 {
struct ext4_ext_path *path = NULL;
-   struct ext4_extent newex, *ex;
+   struct ext4_extent newex, *ex, *ex2;
ext4_fsblk_t goal, newblock;
int err = 0, depth;
unsigned long allocated = 0;
@@ -1984,6 +2034,10 @@
 */
if (ee_len  EXT_MAX_LEN)
goto out2;
+
+   if (iblock  ee_block  iblock + max_blocks = ee_block)
+   allocated = ee_block - iblock;
+
/* if found extent covers block, simply return it */
if (iblock = ee_block  iblock  ee_block + ee_len) {
newblock = iblock - ee_block + ee_start;
@@ -2016,7 +2070,17 @@
 
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
-   allocated = max_blocks;
+
+   /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
+   newex.ee_block = cpu_to_le32(iblock);
+   if (!allocated) {
+   newex.ee_len = cpu_to_le16(max_blocks);
+   ex2 = ext4_ext_check_overlap(inode, newex);
+   if (ex2)
+

Re: [PATCH 1/1] Extent overlap bugfix in ext4

2007-01-02 Thread Amit K. Arora

On Tue, Jan 02, 2007 at 12:25:21PM +0300, Alex Tomas (AT) wrote:
  Amit K Arora (AKA) writes:
 
  AKA The ext4_ext_get_blocks() and ext4_ext_insert_extent() routines do not
  AKA check for extent overlap, when a new extent needs to be inserted in an
  AKA inode. An overlap is possible when the new extent being inserted has
  AKA ee_block that is not part of any of the existing extents, but the
  AKA tail/center portion of this new extent _is_. This is possible only when
  AKA we are writing/preallocating blocks across a hole.
 
 AT not sure I understand ... you shouldn't insert an extent that overlap
 AT any existing extent. when you write block(s), you first check is
 AT it already allocated and insert new extent only if it's not.

You are right. That is what this patch does.
The current ext4 code is inserting an overlapped extent in a particular
scenario (explained above). The suggested patch fixes this by having a
check in get_blocks() for _not_ inserting an extent that may overlap
with an existing one.

 AT for preallocated block(s), you should adapt existing extent(s) so that
 AT they don't overlap new extent you're inserting. am I missing something?

The patch makes the new extent being inserted adjust its length based on any
existing extent that may overlap, so that the overlap does not happen at
all.

 AT also, I think that modification of existing extent(s) (not merging)
 AT isn't safe.

The existing extent(s) are not being modified in any way here. We check
if there is an overlap between the new extent being inserted by
get_blocks(), with an existing one. If there is, we update the new extent
(being inserted) accordingly. The existing extent is not touched (unless
the insert_extent() does a merge, if possible).

Please let me know if the intentions are still not clear here. Thanks!

Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][Patch 1/2] Persistent preallocation in ext4

2007-01-02 Thread Amit K. Arora

Hi Mingming,

On Wed, Dec 27, 2006 at 03:30:44PM -0800, Mingming Cao wrote:
 looks good to me, a few comments :)
Thanks for your comments!

 .
  +   ret = ext4_ext_get_blocks(handle, inode, block,
  +   max_blocks, map_bh,
  +   EXT4_CREATE_UNINITIALIZED_EXT, 0);
  +   if(ret  0  test_bit(BH_New, map_bh.b_state))
  +   nblocks = nblocks + ret;
  +   }
 
 
 ext4_ext_get_blocks() returns 0 when it is mapping (non allocating) a
 hole. In our case, we are doing allocating, so here it is not possible
 to returns a 0 from ext4_ext_get_blocks(). I think we should quit the
 loop and BUGON if ret == 0 here.

Okay. I will add BUG_ON(!ret); just after get_blocks, above.

 
  +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb,
  +   retries))
  +   goto retry;
  +
  +   if(nblocks) {
  +   mutex_lock(inode-i_mutex);
  +   inode-i_size = inode-i_size + (nblocks  blkbits);
  +   EXT4_I(inode)-i_disksize = inode-i_size;
  +   mutex_unlock(inode-i_mutex);
  +   }
 
 Hmm... We should not need to worry about the inode-i_size if we are
 preallocating blocks for holes.

You are right. Will take care of this.
 
 And, Looking at other places calling ext4_*_get_blocks() in the kernel,
 it seems not all of them protected by i_mutex lock. I think it probably
 okay to not holding i_mutex during calling ext4_ext4_get_blocks().

We are not holding i_mutex lock during ext4_ext_get_blocks() call.
Above, this lock is being held inorder to avoid race while updating the
filesize in inode (reference your comment in a previous mail I think we
should update i_size and i_disksize after preallocation. Oh,
to protect parallel updating i_size, we have to take i_mutex down.).
Perhaps, truncate_mutex lock should be used here, and not i_mutex.

 
  +
  +   ext4_mark_inode_dirty(handle, inode);
  +   ret2 = ext4_journal_stop(handle);
  +   if(ret  0)
  +   ret = ret2;
  +
  +   return ret  0 ? nblocks : ret;
  +   }
  +
 
 Since the API takes the number of bytes to preallocate, at return time,
 shall we convert the blocks to bytes to the user?
 
 Here it returns the number of allocated blocks to the user.   Do we need
 to worry about the case when dealing with a range with partial hole and
 partial blocks already allocated? In that case nblocks(the new
 preallocated blocks) will less than the maxblocks (the number of blocks
 asked by application).  I am wondering what does other filesystem like
 xfs do? Maybe we should do the same thing.

I think xfs just returns 0 on success, and errno on an error. Do we
want to keep the same behavior here ? Or, should we return the number of
bytes preallocated ?

Thanks!

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] Extent overlap bugfix in ext4

2007-01-02 Thread Amit K. Arora

On Tue, Jan 02, 2007 at 05:35:28PM -0800, Mingming Cao wrote:
  +struct ext4_extent * ext4_ext_check_overlap(struct inode *inode,
  +   struct ext4_extent *newext)
  +{
  +   struct ext4_ext_path *path;
  +   struct ext4_extent *ex;
  +   unsigned int depth, b1, b2, len1;
  +
  +   b1 = le32_to_cpu(newext-ee_block);
  +   len1 = le16_to_cpu(newext-ee_len);
  +   path = ext4_ext_find_extent(inode, b1, NULL);
  +   if (IS_ERR(path))
  +   return NULL;
  +
  +   depth = ext_depth(inode);
  +   ex = path[depth].p_ext;
  +   if (!ex)
  +   return NULL;
  +
 
 I am confused, when we come here, isn't we confirmed that we need block
 allocation, thus there is no extent start from b1?

Yes, we are sure here that there is no extent which covers b1 block.
Since I couldn't find a direct way to get the next extent (extent on the
right from the would be position of the new extent in the tree), we
make a call to ext4_ext_find_extent() to get the extent on the left, and
then use this to call ext4_ext_next_allocated_block() to get the logical
block number (LBN) of the next extent in the tree. This LBN is
compared with the LBN of the new extent plus its length, to detect an
overlap.

 
  +   b2 = ext4_ext_next_allocated_block(path);
  +   if (b2 == EXT_MAX_BLOCK)
  +
  return NULL;
  +   path = ext4_ext_find_extent(inode, b2, path);
  +   if (IS_ERR(path))
  +   return NULL;
  +   BUG_ON(path[depth].p_hdr == NULL);
  +   ex = path[depth].p_ext;
  +
 
 How useful to have the next extent pointer?It seems only used to print
 out warning messages. I am a little concerned about the expensive
 ext4_ext_find_extent(). After all ext4_ext_next_allocated_block()
 already returns the start block of next extent, isn't it?

Ok, agreed. Will get rid of this extra code.


--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][Patch 1/2] Persistent preallocation in ext4

2006-12-19 Thread Amit K. Arora

On Fri, Dec 15, 2006 at 06:05:28PM +0530, Amit K. Arora wrote:
 --- linux-2.6.19.prealloc.orig/fs/ext4/ioctl.c2006-12-15 
 16:44:35.0 +0530
 +++ linux-2.6.19.prealloc/fs/ext4/ioctl.c 2006-12-15 17:47:00.0 
 +0530
:
:
 + handle=ext4_journal_start(inode,
 + EXT4_DATA_TRANS_BLOCKS(inode-i_sb)+max_blocks);

The current way how buffer credits are passed to ext4_journal_start()
above, is not correct. The max. number of blocks that we might modify
here should be calculated using ext4_ext_calc_credits_for_insert().
Thus the above line should be replaced with :

mutex_lock(EXT4_I(inode)-truncate_mutex);
credits = ext4_ext_calc_credits_for_insert(inode, NULL);
mutex_unlock(EXT4_I(inode)-truncate_mutex);
handle=ext4_journal_start(inode, credits +
EXT4_DATA_TRANS_BLOCKS(inode-i_sb) + 1);

Following is the revised patch with the above change.

Signed-off-by: Amit Arora ([EMAIL PROTECTED])
---
 fs/ext4/extents.c   |  116 ++--
 fs/ext4/ioctl.c |   63 +
 include/linux/ext4_fs.h |   13 
 include/linux/ext4_fs_extents.h |   13 
 4 files changed, 167 insertions(+), 38 deletions(-)

Index: linux-2.6.19.prealloc/fs/ext4/extents.c
===
--- linux-2.6.19.prealloc.orig/fs/ext4/extents.c2006-12-19 
16:09:00.0 +0530
+++ linux-2.6.19.prealloc/fs/ext4/extents.c 2006-12-19 16:23:37.0 
+0530
@@ -282,7 +282,7 @@
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -305,7 +305,7 @@
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+   ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -425,7 +425,7 @@
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -686,7 +686,7 @@
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1097,7 +1097,19 @@
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1106,14 +1118,14 @@
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1132,9 +1144,9 @@
struct ext4_extent *ex, *fex;
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
-   int depth, len, err, next;
+   int depth, len, err, next, uninitialized = 0;
 
-   BUG_ON(newext-ee_len == 0);
+   BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);
ex = path[depth].p_ext

Re: [RFC][Patch 2/2] Persistent preallocation in ext4

2006-12-19 Thread Amit K. Arora

I wrote a simple tool to test these patches. The tool takes four
arguments:

* command: It may have either of the two values - prealloc or write
* filename: This is the filename with relative path
* offset: The offset within the file from where the preallocation, or
the write should start.
* length: Total number of bytes to be allocated/written from offset.

Following cases were tested :
1. * preallocation from 0 to 32MB
   * write to various parts of the preallocated space in sets
   * observed that the extents get split and also get merged

2. * preallocate with holes at various places in the file
   * write to blocks starting from a hole and ending into preallocated
  blocks and vice-versa
   * try to write to entire set of blocks (i.e. from 0 to the last
  preallocated block) which has holes in between.


I also tried some random preallocation and write operations. They seem
to work fine. There is a patch also ready for e2fsprogs utils to
recognize uninitialized extents, which I used to verify the results of
the above testcases. I will post that patch in the next mail.

Here is the code for the simple tool :


#includestdio.h
#includestdlib.h
#includefcntl.h
#includeerrno.h

#define EXT4_IOC_FALLOCATE  0x40106609

struct ext4_falloc_input {
unsigned long long offset;
unsigned long long len;
};

int do_prealloc(char* fname, struct ext4_falloc_input input)
{
  int ret, fd = open(fname, O_CREAT|O_RDWR, 0666);

  if(fd0) {
printf(Error opening file %s\n, fname);
return 1;
  }

  printf(%s : Trying to preallocate blocks (offset=%llu, len=%llu)\n, 
fname, input.offset, input.len);
  ret = ioctl(fd, EXT4_IOC_FALLOCATE, input);

  if(ret 0) {
printf(IOCTL: received error %d, ret=%d\n, errno, ret);
close(fd); 
exit(1);
  }
  printf(IOCTL succedded !  ret=%d\n, ret);
  close(fd); 
}

int do_write(char* fname, struct ext4_falloc_input input)
{
  int ret, fd;
  char *buf;

  buf = (char *)malloc(input.len);

  fd = open(fname, O_CREAT|O_RDWR, 0666);
  if(fd0) {
printf(Error opening file %s\n, fname);
return 1;
  }

  printf(%s : Trying to write to file (offset=%llu, len=%llu)\n, 
fname, input.offset, input.len);

  ret = lseek(fd, input.offset, SEEK_SET);
  if(ret != input.offset) {
printf(lseek() failed error=%d, ret=%d\n, errno, ret);
close(fd); 
return(1);
  }

  ret = write(fd, buf, input.len);
  if(ret != input.len) {
printf(write() failed error=%d, ret=%d\n, errno, ret);
close(fd); 
return(1);
  }

  printf(Write succedded ! Written %llu bytes ret=%d\n, input.len, ret);
  close(fd); 
}


int main(int argc, char **argv)
{
  struct ext4_falloc_input input;
  int ret = 1, fd;
  char *fname; 

  if(argc5) {
printf(%s CMD: prealloc/write filename-with-path offset 
length\n, argv[0]);
exit(1);
  }

  fname = argv[2];
  input.offset=(unsigned long long)atol(argv[3]);;
  input.len=(unsigned long long)atol(argv[4]);

  if(input.offset0 || input.len= 0) {
printf(%s: Invalid arguments.\n, argv[0]);
exit(1);
  }

  if(!strcmp(argv[1], prealloc))
ret = do_prealloc(fname, input);
  else if(!strcmp(argv[1], write))
ret = do_write(fname, input);
  else
printf(%s: Invalid arguments.\n, argv[0]);

  return ret;
}

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][Patch 1/2] Persistent preallocation in ext4

2006-12-19 Thread Amit K. Arora

On Tue, Dec 19, 2006 at 02:12:06PM -0700, Andreas Dilger wrote:
 Minor edits (not worth a resubmit by itself):

Thanks, Andreas ! I will take care of these comments in the next
submission.

Regards,
Amit Arora
 
 On Dec 19, 2006  16:35 +0530, Amit K. Arora wrote:
  +   /* ext4_can_extents_be_merged should have checked that either
  +* both extents are uninitialized, or both aren't. Thus we
  +* need to check any of them here.
 
 s/any/only one/
 
 
  +   case EXT4_IOC_PREALLOCATE: {
  +   if (IS_RDONLY(inode))
  +   return -EROFS;
  +
  +   if (copy_from_user(input,
  +   (struct ext4_falloc_input __user *) arg, sizeof(input)))
  +   return -EFAULT;
  +
  +   if (input.len == 0)
  +   return -EINVAL;
  +
  +   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
  +   return -ENOTTY;
 
 May as well put this check before copy_from_user(), since it doesn't need
 the user data and is much faster to check first.
 
  +retry:
  +   ret = 0;
  +   while(ret=0  retmax_blocks)
  +   {
 
 Opening brace always on same line, like while() {
 
  +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb,
  +   retries))
 
 retries should be aligned with start of (inode-i_sb, on previous line.
 
  +   if(nblocks) {
 
 Space between if ( everywhere.
 
 Cheers, Andreas
 --
 Andreas Dilger
 Principal Software Engineer
 Cluster File Systems, Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][Patch 2/2] Persistent preallocation in ext4

2006-12-15 Thread Amit K. Arora

On Fri, Dec 15, 2006 at 04:02:25PM -0700, Andreas Dilger wrote:
 On Dec 15, 2006  18:09 +0530, Amit K. Arora wrote:
  This patch makes writing to the unitialized extent possible. A write 
  operation on an unitialized extent *may* (depending on the relative block 
  location in the extent and number of blocks being written) result in 
  spliting the extent. There are three possibilities:
  1. The extent does not split : This will happen when the entire extent is 
  being written to. In this case the extent will be marked initialized and 
  merged (if possible) with the neighbouring extents in the tree.
 
 This should also be true if the write is at the beginning or the end of the
 uninitialized extent and the disk allocation matches the previous or next
 extent.  The newly-written part is merged with the adjacent extent, and the
 uninitialized extent is shrunk appropriately.

You are right. And the current patch takes care of that. If the write is
at the begining of the uninitialized extent, the first extent (from the
split) will be initialized (ex2 in this case), and we do call
try_to_merge() to merge this with the previous extent, if possible. This
scenario can be seen as ex1 == NULL  ex2 == ex  ex3 != NULL

(Please note that ex is the uninitialized extent, and ex2 is
_always_ the initialized extent being created, whether it is on left,
right or middle of the parent uninitialized extent)

If the initialized extent is the second one in the split (i.e. write is
happening on the later part of the uninitialized extent), it will result
in shirinking the existing uninitialized extent and inserting the new
initialized extent. insert_extent() will be called in this case, which
also tries to merge the extent with the neighbouring extents (both,
towards left and right side).
The following condition will hold true in this case:
ex1 != NULL  ex2 != ex  ex3 == NULL

 
 Doing this as a special case of #2 may result in extra tree rebalancing as
 the extra extent is added and removed repeatedly (consider the case of a
 large hole being overwritten in smaller chunks that is just at the limit
 of the number of extents in the parent block).

Yes, as I mentioned, the case #2 already handles this. I guess, I should
have been explicit about it in the description...

 
  2. The extent splits in two portions : This will happen when someone is 
  writing to any one end of the extent (i.e. not in the middle, and not to 
  the entire extent). This will result in breaking the extent in two 
  portions, an initialized extent (the set of blocks being written to) and an 
  uninitialized extent (rest of the blocks in the parent extent).
  3. The extent is split in three parts: This occurs when someone writes in 
  the middle of the extent. It will result into three extents, two 
  uninitialized (at the both ends) and one initialized (in middle).
 
  Since the extent merge logic was getting redundant, it has been put into a 
  new function ext4_ext_try_to_merge(). This gets called from 
  ext4_ext_insert_extent() and ext4_ext_get_blocks(), when required.
 
 Cheers, Andreas
 --
 Andreas Dilger
 Principal Software Engineer
 Cluster File Systems, Inc.


Regards,

Amit Arora ([EMAIL PROTECTED])
Linux Technology Center
IBM India

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][Patch 1/1] Persistent preallocation in ext4

2006-12-13 Thread Amit K. Arora

On Tue, Dec 12, 2006 at 04:20:38PM -0800, Mingming Cao wrote:
 On Tue, 2006-12-12 at 11:53 +0530, Amit K. Arora wrote:
  +
  +   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
  +   return -ENOTTY;
 
 
 Supporting preallocation for extent based files seems fairly
 straightforward.  I agree we should look at this first.  After get this
 done, it probably worth re-consider whether to support preallocation for
 non-extent based files on ext4. I could imagine user upgrade from ext3
 to ext4, and expecting to use preallocation on those existing files

I gave a thought on this initially. But, I was not sure how we should
implement preallocation in a non-extent based file. Using extents we can
mark a set of blocks as unitialized, but how will we do this for
non-extent based files ? If we do not have a way to mark blocks
uninitialized, when someone will try to read from a preallocated block,
it will return junk/stale data instead of zeroes.

But, if we can think of a solution here then it will be as simple as
removing the check above and replacing ext4_ext_get_blocks() with
ext4_get_blocks_wrap() in the while() loop.

 
  +
  +   block = EXT4_BLOCK_ALIGN(input.offset, blkbits)  blkbits;
  +   max_blocks = EXT4_BLOCK_ALIGN(input.len, blkbits)  blkbits;

I was wondering if I should change above lines to this :

+   block = input.offset  blkbits;
+   max_blocks = (EXT4_BLOCK_ALIGN(input.len+input.offset,
blkbits)  blkbits) - block;

Reason is that the block which contains the offset, should also be
preallocated. And the max_blocks should be calculated accordingly.

  +   while(ret=0  retmax_blocks)
  +   {
  +   block = block + ret;
  +   max_blocks = max_blocks - ret;
  +   ret = ext4_ext_get_blocks(handle, inode, block,
  +   max_blocks, map_bh,
  +   EXT4_CREATE_UNINITIALIZED_EXT, 1);
 
 Since the interface takes offset and number of blocks to allocate, I
 assuming we are going to handle holes in preallocation, thus, we cannot
 not mark the extend_size flag to 1 when calling ext4_ext_get_blocks.
 
 I think we should update i_size and i_disksize after preallocation. Oh,
 to protect parallel updating i_size, we have to take i_mutex down.

Okay. So, is this what you want to be done here :

+retry:
+ret = 0;
+while(ret=0  retmax_blocks)
+{
+block = block + ret;
+max_blocks = max_blocks - ret;
+ret = ext4_ext_get_blocks(handle, inode, block,
+max_blocks, map_bh,
+EXT4_CREATE_UNINITIALIZED_EXT,0);
+if(ret  0  test_bit(BH_New, map_bh.b_state))
+nblocks = nblocks + ret;
+}
+if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb,
+retries))
+goto retry;
+
+if(nblocks) {
+mutex_lock(inode-i_mutex);
+inode-i_size = inode-i_size + (nblocks  blkbits);
+EXT4_I(inode)-i_disksize = inode-i_size;
+mutex_unlock(inode-i_mutex);
+}

 
  +   }
  +   ext4_mark_inode_dirty(handle, inode);
  +   ext4_journal_stop(handle);
  +
 
 Error code returned by ext4_journal_stop() is being ignored here, is
 this right?
 Well, there are other places in ext34/ioctl.c which ignore the return
 returned by ext4_journal_stop(), maybe should fix this in a separate
 patch.

Agreed. I think following should take care of it:

+   ext4_mark_inode_dirty(handle, inode);
+   ret2 = ext4_journal_stop(handle);
+   if(ret  0)
+   ret = ret2;
+   return ret  0 ? nblocks : ret;

  +   return ret0?0:ret;
  +   }
 
 
 Oh, what if we failed to allocate the full amount of blocks? i.e, the
 ext4_ext_get_blocks() returns -ENOSPC error and exit the loop early. Are
 we going to return error, or try something like
 
 if (ret == -ENOSPC  ext3_should_retry_alloc(inode-i_sb, retries))
   goto retry
 
 I wonder it might be useful to return the actual number of blocks
 preallocated back to the application.

Ok. Yes, makes sense. We can return the number of new blocks like
this:
+   return ret  0 ? nblocks : ret;



Please let me know if you agree with the above set of changes, and any
further comments you have. I will then update and test the new patch and
post it again. Thanks!

Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http

Re: [RFC][Patch 1/1] Persistent preallocation in ext4

2006-12-11 Thread Amit K. Arora

Hi Mingming,

On Mon, Dec 11, 2006 at 05:28:15PM -0800, Mingming Cao wrote:
 On Wed, 2006-12-06 at 11:28 +0530, Amit K. Arora wrote:
 
  @@ -1142,13 +1155,22 @@
  /* try to insert block into found extent and return */
  if (ex  ext4_can_extents_be_merged(inode, ex, newext)) {
  ext_debug(append %d block to %d:%d (from %llu)\n,
  -   le16_to_cpu(newext-ee_len),
  +   ext4_ext_get_actual_len(newext),
  le32_to_cpu(ex-ee_block),
  -   le16_to_cpu(ex-ee_len), ext_pblock(ex));
  +   ext4_ext_get_actual_len(ex), ext_pblock(ex));
  if ((err = ext4_ext_get_access(handle, inode, path + depth)))
  return err;
  -   ex-ee_len = cpu_to_le16(le16_to_cpu(ex-ee_len)
  -+ le16_to_cpu(newext-ee_len));
  +
  +   /* ext4_can_extents_be_merged should have checked that either
  +* both extents are uninitialized, or both aren't. Thus we
  +* need to check any of them here.
  +*/
  +   if (ext4_ext_is_uninitialized(ex))
  +   uninitialized = 1;
  +   ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
  ++ ext4_ext_get_actual_len(newext));

Above line will remove the uninitialized bit from ex, if it was set.
We call ext4_ext_get_actual_len() to get the actual lengths of the two
extents, which removes this MSB in ee_len (MSB in ee_len is used to mark
an extent uninitialized). Now, we do this because if lengths of two
uninitialized extents will be added as it is (i.e. without masking out
the MSB in the length), it will result in removing the MSB in ee_len.
For example, 0x8002 + 0x8003 = 0x10005 = 0x5 (since ee_len is 16 bit).

That is why just before this line, we save the state of this extent,
whether it was uninitialized or not. And, we restore this state below.

  +   if(uninitialized)
  +   ext4_mark_uninitialized_ext(ex);
  eh = path[depth].p_hdr;
  nearex = ex;
  goto merge;
 
 Hmm, I missed the point to re-mark an uninitialized extent here. If ex
 is an uninitialized extent, the mark(the first bit the ee_len) shall
 still there after the update, isn't?  We already make sure that two
 large uninitialized extent can't get merged if the resulting length will
 take the first bit, which used as the mark of uninitialized extent.

Please get back if you do not agree with the explanation above and if I
am missing something here. Thanks!

Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

95 matches

Mail list logo