Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-13 Thread Kalpak Shah
On Fri, 2007-07-13 at 09:59 +0530, Aneesh Kumar K.V wrote:
 
 Kalpak Shah wrote:
  On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
  On Sun, 01 Jul 2007 03:36:56 -0400
  Mingming Cao [EMAIL PROTECTED] wrote:
 
  This patch is a spinoff of the old nanosecond patches.
  I don't know what the old nanosecond patches are.  A link to a suitable
  changlog for those patches would do in a pinch.  Preferable would be to
  write a proper changelog for this patch.
  
  The incremental patch contains a proper changelog describing the patch.
  
 
 
 Instead of  putting incremental patches it would be nice if we can have 
 replacement patches.
 for the already existing patches with the comments addressed. For example if 
 we have a 
 review comment on the patch message ( commit log ) then adding an incremental 
 patch doesn't help.

I think that it would be easier to review just the changes that have
been made to the patches instead of having people go through the entire
patch again. I was hoping that someone with write access to ext4-git
would update the commit logs.

If replacement patches are preferred, then I will send them again.

Thanks,
Kalpak.

 
 
 -aneesh
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-13 Thread Christoph Hellwig
On Fri, Jul 13, 2007 at 06:17:55PM +0530, Amit K. Arora wrote:
  /*
 + * sys_fallocate - preallocate blocks or free preallocated blocks
 + * @fd: the file descriptor
 + * @mode: mode specifies the behavior of allocation.
 + * @offset: The offset within file, from where allocation is being
 + *   requested. It should not have a negative value.
 + * @len: The amount of space in bytes to be allocated, from the offset.
 + *This can not be zero or a negative value.

kerneldoc comments are for in-kernel APIs which syscalls aren't.  I'd say
just temove this comment, the manpage is a much better documentation anyway.

 + * TBD Generic fallocate to be added for file systems that do not
 + *support fallocate.

Please remove the comment, adding a generic fallback in kernelspace is a
very dumb idea as we already discussed long time ago.

 --- linux-2.6.22.orig/include/linux/fs.h
 +++ linux-2.6.22/include/linux/fs.h
 @@ -266,6 +266,21 @@ extern int dir_notify_enable;
  #define SYNC_FILE_RANGE_WRITE2
  #define SYNC_FILE_RANGE_WAIT_AFTER   4
  
 +/*
 + * sys_fallocate modes
 + * Currently sys_fallocate supports two modes:
 + * FALLOC_ALLOCATE : This is the preallocate mode, using which an application
 + *   may request reservation of space for a particular file.
 + *   The file size will be changed if the allocation is
 + *   beyond EOF.
 + * FALLOC_RESV_SPACE :   This is same as the above mode, with only one 
 difference
 + *   that the file size will not be modified.
 + */
 +#define FALLOC_FL_KEEP_SIZE0x01 /* default is extend/shrink size */
 +
 +#define FALLOC_ALLOCATE0
 +#define FALLOC_RESV_SPACE  FALLOC_FL_KEEP_SIZE

Just remove FALLOC_ALLOCATE, 0 flags should be the default.  I'm also
not sure there is any point in having two namespace now that we have a flags-
based ABI.

Also please don't add this to fs.h.  fs.h is a complete mess and the
falloc flags are a new user ABI.  Add a linux/falloc.h instead which can
be added to headers-y so the ABI constant can be exported to userspace.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6][TAKE7] ext4: change for better extent-to-group alignment

2007-07-13 Thread Amit K. Arora
From: Amit Arora [EMAIL PROTECTED]

Change on-disk format for extent to represent uninitialized/initialized extents

This change was suggested by Andreas Dilger. 
This patch changes the EXT_MAX_LEN value and extent code which marks/checks
uninitialized extents. With this change it will be possible to have
initialized extents with 2^15 blocks (earlier the max blocks we could have
was 2^15 - 1). This way we can have better extent-to-block alignment.
Now, maximum number of blocks we can have in an initialized extent is 2^15
and in an uninitialized extent is 2^15 - 1.

This patch takes care of Andreas's suggestion of using EXT_INIT_MAX_LEN
instead of 0x8000 at some places.

Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1106,7 +1106,7 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   unsigned short ext1_ee_len, ext2_ee_len;
+   unsigned short ext1_ee_len, ext2_ee_len, max_len;
 
/*
 * Make sure that either both extents are uninitialized, or
@@ -1115,6 +1115,11 @@ ext4_can_extents_be_merged(struct inode 
if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
return 0;
 
+   if (ext4_ext_is_uninitialized(ex1))
+   max_len = EXT_UNINIT_MAX_LEN;
+   else
+   max_len = EXT_INIT_MAX_LEN;
+
ext1_ee_len = ext4_ext_get_actual_len(ex1);
ext2_ee_len = ext4_ext_get_actual_len(ex2);
 
@@ -1127,7 +1132,7 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  max_len)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
@@ -1814,7 +1819,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
ex-ee_block = cpu_to_le32(block);
ex-ee_len = cpu_to_le16(num);
-   if (uninitialized)
+   /*
+* Do not mark uninitialized if all the blocks in the
+* extent have been removed.
+*/
+   if (uninitialized  num)
ext4_ext_mark_uninitialized(ex);
 
err = ext4_ext_dirty(handle, inode, path + depth);
@@ -2307,6 +2316,19 @@ int ext4_ext_get_blocks(handle_t *handle
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
 
+   /*
+* See if request is beyond maximum number of blocks we can have in
+* a single extent. For an initialized extent this limit is
+* EXT_INIT_MAX_LEN and for an uninitialized extent this limit is
+* EXT_UNINIT_MAX_LEN.
+*/
+   if (max_blocks  EXT_INIT_MAX_LEN 
+   create != EXT4_CREATE_UNINITIALIZED_EXT)
+   max_blocks = EXT_INIT_MAX_LEN;
+   else if (max_blocks  EXT_UNINIT_MAX_LEN 
+create == EXT4_CREATE_UNINITIALIZED_EXT)
+   max_blocks = EXT_UNINIT_MAX_LEN;
+
/* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
newex.ee_block = cpu_to_le32(iblock);
newex.ee_len = cpu_to_le16(max_blocks);
Index: linux-2.6.22/include/linux/ext4_fs_extents.h
===
--- linux-2.6.22.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22/include/linux/ext4_fs_extents.h
@@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru
 
 #define EXT_MAX_BLOCK  0x
 
-#define EXT_MAX_LEN((1UL  15) - 1)
+/*
+ * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an
+ * initialized extent. This is 2^15 and not (2^16 - 1), since we use the
+ * MSB of ee_len field in the extent datastructure to signify if this
+ * particular extent is an initialized extent or an uninitialized (i.e.
+ * preallocated).
+ * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an
+ * uninitialized extent.
+ * If ee_len is = 0x8000, it is an initialized extent. Otherwise, it is an
+ * uninitialized one. In other words, if MSB of ee_len is set, it is an
+ * uninitialized extent with only one special scenario when ee_len = 0x8000.
+ * In this case we can not have an uninitialized extent of zero length and
+ * thus we make it as a special case of initialized extent with 0x8000 length.
+ * This way we get better extent-to-group alignment for initialized extents.
+ * Hence, the maximum number of blocks we can have in an *initialized*
+ * extent is 2^15 (32768) and in an *uninitialized* extent is 2^15-1 (32767).
+ */
+#define EXT_INIT_MAX_LEN   (1UL  15)
+#define EXT_UNINIT_MAX_LEN (EXT_INIT_MAX_LEN - 1)
 
 
 

[PATCH 3/6][TAKE7] revalidate write permissions for fallocate

2007-07-13 Thread Amit K. Arora
From: David P. Quigley [EMAIL PROTECTED]

Revalidate the write permissions for fallocate(2), in case security policy has
changed since the files were opened.

Acked-by: James Morris [EMAIL PROTECTED]
Signed-off-by: David P. Quigley [EMAIL PROTECTED]

---
 fs/open.c |3 +++
 1 files changed, 3 insertions(+)

Index: linux-2.6.22/fs/open.c
===
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -407,6 +407,9 @@ asmlinkage long sys_fallocate(int fd, in
goto out;
if (!(file-f_mode  FMODE_WRITE))
goto out_fput;
+   ret = security_file_permission(file, MAY_WRITE);
+   if (ret)
+   goto out_fput;
 
inode = file-f_path.dentry-d_inode;
 
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6][TAKE7] ext4: fallocate support in ext4

2007-07-13 Thread Amit K. Arora
From: Amit Arora [EMAIL PROTECTED]

fallocate support in ext4

This patch implements -fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation. Current implementation only supports
preallocation for regular files (directories not supported as of date)
with extent maps. This patch does not support block-mapped files currently.
Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of
now.


Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int ret = 0;
 
b1 = le32_to_cpu(newext-ee_block);
-   len1 = le16_to_cpu(newext-ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
int depth, len, err, next;
+   unsigned uninitialized = 0;
 
-   BUG_ON(newext-ee_len == 0);
+   BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);
ex = path[depth].p_ext;
BUG_ON(path[depth].p_hdr == NULL);
@@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex  ext4_can_extents_be_merged(inode, ex, newext)) {
ext_debug(append %d block to %d:%d (from %llu)\n,
-   le16_to_cpu(newext-ee_len),
+   

[PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-13 Thread Amit K. Arora
From: Amit Arora [EMAIL PROTECTED]

sys_fallocate() implementation on i386, x86_64 and powerpc

fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called -fallocate().
Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.
ToDos:
1. Implementation on other architectures (other than i386, x86_64,
   and ppc). Patches for s390(x) and ia64 are already available from
   previous posts, but it was decided that they should be added later
   once fallocate is in the mainline. Hence not including those patches
   in this take.
2. A generic file system operation to handle fallocate
   (generic_fallocate), for filesystems that do _not_ have the fallocate
   inode operation implemented.
3. Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.22.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+   .long sys_fallocate
Index: linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
===
--- linux-2.6.22.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
return sys_truncate(path, (high  32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+u32 lenhi, u32 lenlo)
+{
+   return sys_fallocate(fd, mode, ((loff_t)offhi  32) | offlo,
+((loff_t)lenhi  32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long 
high,
 unsigned long low)
 {
Index: linux-2.6.22/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.22.orig/arch/x86_64/ia32/ia32entry.S
+++ linux-2.6.22/arch/x86_64/ia32/ia32entry.S
@@ -719,4 +719,5 @@ ia32_sys_call_table:
.quad compat_sys_signalfd
.quad compat_sys_timerfd
.quad sys_eventfd
+   .quad sys32_fallocate
 ia32_syscall_end:
Index: linux-2.6.22/fs/open.c
===
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies the behavior of allocation.
+ * @offset: The offset within file, from where allocation is being
+ * requested. It should not have a negative value.
+ * @len: The amount of space in bytes to be allocated, from the offset.
+ *  This can not be zero or a negative value.
+ *
+ * This system call preallocates space for a file. The range of blocks
+ * allocated depends on the value of offset and len arguments provided
+ * by the user/application. With FALLOC_ALLOCATE or FALLOC_RESV_SPACE
+ * modes, if the system call succeeds, subsequent writes to the file in
+ * the given range (specified by offset  len) should not fail - even if
+ * the file system later becomes full. Hence the preallocation done is
+ * persistent (valid even after reopen of the file and remount/reboot).
+ *
+ * It is expected that the -fallocate() inode operation implemented by
+ * the individual file systems will update the file size and/or
+ * ctime/mtime depending on the mode and also on the success of the
+ * operation.
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ * 0   : On 

[PATCH 0/6][TAKE7] fallocate system call

2007-07-13 Thread Amit K. Arora
This is the latest fallocate patchset and is based on 2.6.22.

* Following are the changes from TAKE6:
1) We now just have two modes (and no deallocation modes).
2) Updated the man page
3) Added a new patch submitted by David P. Quigley  (Patch 3/6).
4) Used EXT_INIT_MAX_LEN instead of 0x8000 in Patch 6/6.
5) Included below in the end is a small testcase to test fallocate.

* Following are the changes from TAKE5 to TAKE6:
1) Rebased to 2.6.22
2) Added compat wrapper for x86_64
3) Dropped s390 and ia64 patches, since the platform maintaners can
   add the support for fallocate once it is in mainline.
4) Added a change suggested by Andreas for better extent-to-group
   alignment in ext4 (Patch 6/6). Please refer following post:
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg02445.html
5) Renamed mode flags and values from FA_ to FALLOC_
6) Added manpage (updated version of the one initially submitted by
   David Chinner).


Todos:
-
1 Implementation on other architectures (other than i386, x86_64,
   and ppc64). s390(x) and ia64 patches are ready and will be pushed
   by platform maintaners when the fallocate is in mainline.
2 A generic file system operation to handle fallocate
   (generic_fallocate), for filesystems that do _not_ have the fallocate
   inode operation implemented.
3 Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()
4 Patch to e2fsprogs to recognize and display uninitialized extents.


Following patches follow:
Patch 1/6 : manpage for fallocate
Patch 2/6 : fallocate() implementation in i386, x86_64 and powerpc
Patch 3/6 : revalidate write permissions for fallocate
Patch 4/6 : ext4: fallocate support in ext4
Patch 5/6 : ext4: write support for preallocated blocks
Patch 6/6 : ext4: change for better extent-to-group alignment

Note: Attached below is a small testcase to test fallocate. The __NR_fallocate
will need to be changed depending on the system call number in the kernel (it
may get changed due to merge) and also depending on the architecture.

--
Regards,
Amit Arora



#include stdio.h
#include stdlib.h
#include fcntl.h
#include errno.h

#include linux/unistd.h
#include sys/vfs.h
#include sys/stat.h

#define VERBOSE 0

#define __NR_fallocate324

#define FALLOC_FL_KEEP_SIZE 0x01
#define FALLOC_ALLOCATE 0x0
#define FALLOC_RESV_SPACE   FALLOC_FL_KEEP_SIZE


int do_fallocate(int fd, int mode, loff_t offset, loff_t len)
{
  int ret;

  if (VERBOSE)
printf(Trying to preallocate blocks (offset=%llu, len=%llu)\n,
offset, len);
  ret = syscall(__NR_fallocate, fd, mode, offset, len);

  if (ret 0) {
printf(SYSCALL: received error %d, ret=%d\n, errno, ret);
close(fd);
return(1);
  }

  if (VERBOSE)
printf(fallocate system call succedded !  ret=%d\n, ret);

  return ret;
}

int test_fallocate(int fd, int mode, loff_t offset, loff_t len)
{
  int ret, blocks;
  struct stat statbuf1, statbuf2;

  fstat(fd, statbuf1);

  ret = do_fallocate(fd, mode, offset, len);

  fstat(fd, statbuf2);

  /* check file size after preallocation */
  if (mode == FALLOC_ALLOCATE) {
if (!ret  statbuf1.st_size  (offset + len) 
statbuf2.st_size != (offset + len)) {
printf(Error: fallocate succeeded, but the file size did not 
change, where it should have!\n);
ret = 1;
}
  } else if (statbuf1.st_size != statbuf2.st_size) {
printf(Error : File size changed, when it should not have!\n);
ret = 1;
  }

  blocks = ((statbuf2.st_blocks - statbuf1.st_blocks) * 512)/ 
statbuf2.st_blksize;

  /* Print report */
  printf(# FALLOCATE TEST REPORT #\n);
  printf(\tNew blocks preallocated = %d.\n, blocks);
  printf(\tNumber of bytes preallocated = %d\n, blocks * statbuf2.st_blksize);
  printf(\tOld file size = %d, New file size %d.\n,
  statbuf1.st_size, statbuf2.st_size);
  printf(\tOld num blocks = %d, New num blocks %d.\n,
  (statbuf1.st_blocks * 512)/1024, (statbuf2.st_blocks * 512)/1024);

  return ret;
}


int do_write(int fd, loff_t offset, loff_t len)
{
  int ret;
  char *buf;

  buf = (char *)malloc(len);
  if (!buf) {
printf(error: malloc failed.\n);
return(-1);
  }

  if (VERBOSE)
printf(Trying to write to file (offset=%llu, len=%llu)\n, 
offset, len);

  ret = lseek(fd, offset, SEEK_SET);
  if (ret != offset) {
printf(lseek() failed error=%d, ret=%d\n, errno, ret);
close(fd); 
return(-1);
  }

  ret = write(fd, buf, len);
  if (ret != len) {
 printf(write() failed error=%d, ret=%d\n, errno, ret);
close(fd); 
return(-1);
  }

  if (VERBOSE)
printf(Write succedded ! Written %llu bytes ret=%d\n, len, ret);

  return ret;
}


int test_write(int fd, loff_t offset, loff_t len)
{
  int ret;

  ret = do_write(fd, offset, len);
  

Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.

2007-07-13 Thread Kalpak Shah
The updated patch is attached. comments inline...

On Tue, 2007-07-10 at 22:40 -0700, Andrew Morton wrote:
  If we exceed 65000 subdirectories in an htree directory it sets the
  inode link count to 1 and no longer counts subdirectories.  The
  directory link count is not actually used when determining if a
  directory is empty, as that only counts subdirectories and not regular
  files that might be in there. 
  
  A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if
  the subdir count for any directory crosses 65000.
  
 
 Would I be correct in assuming that a later fsck will clear
 EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any 65000 subdir
 directories?
 
 If so, that is worth a mention in the changelog, perhaps?

The changelog has been updated to include this.

   
  +static inline void ext4_inc_count(handle_t *handle, struct inode *inode)
  +{
  +   inc_nlink(inode);
  +   if (is_dx(inode)  inode-i_nlink  1) {
  +   /* limit is 16-bit i_links_count */
  +   if (inode-i_nlink = EXT4_LINK_MAX || inode-i_nlink == 2) {
  +   inode-i_nlink = 1;
  +   EXT4_SET_RO_COMPAT_FEATURE(inode-i_sb,
  + EXT4_FEATURE_RO_COMPAT_DIR_NLINK);
  +   }
  +   }
  +}
 
 Looks too big to be inlined.
 
 Why do we set EXT4_FEATURE_RO_COMPAT_DIR_NLINK if i_nlink==2?

I have added a comment for this. (since it indicates that nlinks==1
previously).

 
  +static inline void ext4_dec_count(handle_t *handle, struct inode *inode)
  +{
  +   drop_nlink(inode);
  +   if (S_ISDIR(inode-i_mode)  inode-i_nlink == 0)
  +   inc_nlink(inode);
  +}
 
 Probably too big to inline.

Removed the inline.

   
  -   if (inode-i_nlink = EXT4_LINK_MAX)
  +   if (EXT4_DIR_LINK_MAX(inode))
  return -EMLINK;
 
 argh.  WHY_IS_EXT4_FULL_OF_UPPER_CASE_MACROS_WHICH_COULD_BE_IMPLEMENTED
 as_lower_case_inlines?  Sigh.  It's all the old-timers, I guess.
 
 EXT4_DIR_LINK_MAX() is buggy: it evaluates its arg twice.

#define EXT4_DIR_LINK_MAX(dir) (!is_dx(dir)  (dir)-i_nlink = EXT4_LINK_MAX)

This just checks if directory has hash indexing in which case we need not worry 
about EXT4_LINK_MAX subdir limit. If directory is not hash indexed then we will 
need to enforce a max subdir limit. 

Sorry, I didn't understand what is the problem with this macro?

Thanks,
Kalpak.
This patch adds support to ext4 for allowing more than 65000
subdirectories. Currently the maximum number of subdirectories is capped
at 32000.

If we exceed 65000 subdirectories in an htree directory it sets the
inode link count to 1 and no longer counts subdirectories.  The
directory link count is not actually used when determining if a
directory is empty, as that only counts subdirectories and not regular
files that might be in there. 

A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if
the subdir count for any directory crosses 65000. A later fsck will clear
EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any directory
with 65000 subdirs.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Kalpak Shah [EMAIL PROTECTED]


---
 fs/ext4/namei.c |   52 +++-
 include/linux/ext4_fs.h |4 ++-
 2 files changed, 41 insertions(+), 15 deletions(-)

Index: linux-2.6.22/fs/ext4/namei.c
===
--- linux-2.6.22.orig/fs/ext4/namei.c
+++ linux-2.6.22/fs/ext4/namei.c
@@ -1617,6 +1617,35 @@ static int ext4_delete_entry (handle_t *
 	return -ENOENT;
 }
 
+/*
+ * DIR_NLINK feature is set if 1) nlinks  EXT4_LINK_MAX or 2) nlinks == 2,
+ * since this indicates that nlinks count was previously 1.
+ */
+static void ext4_inc_count(handle_t *handle, struct inode *inode)
+{
+	inc_nlink(inode);
+	if (is_dx(inode)  inode-i_nlink  1) {
+		/* limit is 16-bit i_links_count */
+		if (inode-i_nlink = EXT4_LINK_MAX || inode-i_nlink == 2) {
+			inode-i_nlink = 1;
+			EXT4_SET_RO_COMPAT_FEATURE(inode-i_sb,
+	  EXT4_FEATURE_RO_COMPAT_DIR_NLINK);
+		}
+	}
+}
+
+/*
+ * If a directory had nlink == 1, then we should let it be 1. This indicates
+ * directory has EXT4_LINK_MAX subdirs.
+ */
+static void ext4_dec_count(handle_t *handle, struct inode *inode)
+{
+	drop_nlink(inode);
+	if (S_ISDIR(inode-i_mode)  inode-i_nlink == 0)
+		inc_nlink(inode);
+}
+
+
 static int ext4_add_nondir(handle_t *handle,
 		struct dentry *dentry, struct inode *inode)
 {
@@ -1713,7 +1742,7 @@ static int ext4_mkdir(struct inode * dir
 	struct ext4_dir_entry_2 * de;
 	int err, retries = 0;
 
-	if (dir-i_nlink = EXT4_LINK_MAX)
+	if (EXT4_DIR_LINK_MAX(dir))
 		return -EMLINK;
 
 retry:
@@ -1736,7 +1765,7 @@ retry:
 	inode-i_size = EXT4_I(inode)-i_disksize = inode-i_sb-s_blocksize;
 	dir_block = ext4_bread (handle, inode, 0, 1, err);
 	if (!dir_block) {
-		drop_nlink(inode); /* is this nlink == 0? */
+		ext4_dec_count(handle, inode); /* is this nlink == 0? */

Re: [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-13 Thread Amit K. Arora
On Fri, Jul 13, 2007 at 02:21:19PM +0100, Christoph Hellwig wrote:
 On Fri, Jul 13, 2007 at 06:17:55PM +0530, Amit K. Arora wrote:
   /*
  + * sys_fallocate - preallocate blocks or free preallocated blocks
  + * @fd: the file descriptor
  + * @mode: mode specifies the behavior of allocation.
  + * @offset: The offset within file, from where allocation is being
  + * requested. It should not have a negative value.
  + * @len: The amount of space in bytes to be allocated, from the offset.
  + *  This can not be zero or a negative value.
 
 kerneldoc comments are for in-kernel APIs which syscalls aren't.  I'd say
 just temove this comment, the manpage is a much better documentation anyway.

Ok. I will remove this entire comment.
 
  + * TBD Generic fallocate to be added for file systems that do not
  + *  support fallocate.
 
 Please remove the comment, adding a generic fallback in kernelspace is a
 very dumb idea as we already discussed long time ago.

  --- linux-2.6.22.orig/include/linux/fs.h
  +++ linux-2.6.22/include/linux/fs.h
  @@ -266,6 +266,21 @@ extern int dir_notify_enable;
   #define SYNC_FILE_RANGE_WRITE  2
   #define SYNC_FILE_RANGE_WAIT_AFTER 4
   
  +/*
  + * sys_fallocate modes
  + * Currently sys_fallocate supports two modes:
  + * FALLOC_ALLOCATE :   This is the preallocate mode, using which an 
  application
  + * may request reservation of space for a particular file.
  + * The file size will be changed if the allocation is
  + * beyond EOF.
  + * FALLOC_RESV_SPACE : This is same as the above mode, with only one 
  difference
  + * that the file size will not be modified.
  + */
  +#define FALLOC_FL_KEEP_SIZE0x01 /* default is extend/shrink size */
  +
  +#define FALLOC_ALLOCATE0
  +#define FALLOC_RESV_SPACE  FALLOC_FL_KEEP_SIZE
 
 Just remove FALLOC_ALLOCATE, 0 flags should be the default.  I'm also
 not sure there is any point in having two namespace now that we have a flags-
 based ABI.

Ok. Since we have only one flag (FALLOC_FL_KEEP_SIZE) and we do not want
to declare the default mode (FALLOC_ALLOCATE), we can _just_ have this
flag and remove the other mode too (FALLOC_RESV_SPACE).
Is this what you are suggesting ?

 Also please don't add this to fs.h.  fs.h is a complete mess and the
 falloc flags are a new user ABI.  Add a linux/falloc.h instead which can
 be added to headers-y so the ABI constant can be exported to userspace.

Should we need a header file just to declare one flag - i.e.
FALLOC_FL_KEEP_SIZE (since now there is no point of declaring the two
modes) ? If linux/fs.h is not a good place, will asm-generic/fcntl.h
be a sane place for this flag ?

Thanks!
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-13 Thread Peter Zijlstra
On Fri, 2007-07-13 at 02:05 -0700, Andrew Morton wrote:

 Except lockdep doesn't know about journal_start(), which has ranking
 requirements similar to a semaphore.  

Something like so?

Or can journal_stop() be done by a different task than the one that did
journal_start()? - in which case nothing much can be done :-/

This seems to boot... albeit I did not push it hard.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/jbd/transaction.c |9 +
 include/linux/jbd.h  |5 +
 2 files changed, 14 insertions(+)

Index: linux-2.6/fs/jbd/transaction.c
===
--- linux-2.6.orig/fs/jbd/transaction.c
+++ linux-2.6/fs/jbd/transaction.c
@@ -233,6 +233,8 @@ out:
return ret;
 }
 
+static struct lock_class_key jbd_handle_key;
+
 /* Allocate a new handle.  This should probably be in a slab... */
 static handle_t *new_handle(int nblocks)
 {
@@ -243,6 +245,8 @@ static handle_t *new_handle(int nblocks)
handle-h_buffer_credits = nblocks;
handle-h_ref = 1;
 
+   lockdep_init_map(handle-h_lockdep_map, jbd_handle, jbd_handle_key, 
0);
+
return handle;
 }
 
@@ -286,6 +290,9 @@ handle_t *journal_start(journal_t *journ
current-journal_info = NULL;
handle = ERR_PTR(err);
}
+
+   lock_acquire(handle-h_lockdep_map, 0, 0, 0, 2, _THIS_IP_);
+
return handle;
 }
 
@@ -1411,6 +1418,8 @@ int journal_stop(handle_t *handle)
spin_unlock(journal-j_state_lock);
}
 
+   lock_release(handle-h_lockdep_map, 1, _THIS_IP_);
+
jbd_free_handle(handle);
return err;
 }
Index: linux-2.6/include/linux/jbd.h
===
--- linux-2.6.orig/include/linux/jbd.h
+++ linux-2.6/include/linux/jbd.h
@@ -30,6 +30,7 @@
 #include linux/bit_spinlock.h
 #include linux/mutex.h
 #include linux/timer.h
+#include linux/lockdep.h
 
 #include asm/semaphore.h
 #endif
@@ -405,6 +406,10 @@ struct handle_s
unsigned inth_sync: 1;  /* sync-on-close */
unsigned inth_jdata:1;  /* force data journaling */
unsigned inth_aborted:  1;  /* fatal error on handle */
+
+#ifdef CONFIG_LOCKDEP
+   struct lockdep_map  h_lockdep_map;
+#endif
 };
 
 


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6][TAKE7] revalidate write permissions for fallocate

2007-07-13 Thread Amit K. Arora
On Fri, Jul 13, 2007 at 02:21:37PM +0100, Christoph Hellwig wrote:
 On Fri, Jul 13, 2007 at 06:18:47PM +0530, Amit K. Arora wrote:
  From: David P. Quigley [EMAIL PROTECTED]
  
  Revalidate the write permissions for fallocate(2), in case security policy 
  has
  changed since the files were opened.
  
  Acked-by: James Morris [EMAIL PROTECTED]
  Signed-off-by: David P. Quigley [EMAIL PROTECTED]
 
 This should be merged into the main falloc patch.

Ok. Will merge it...

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-ENOSPC return from xattr functions

2007-07-13 Thread Andreas Dilger
Hello Andreas,
I noticed in ext3_xattr_block_set() that if i-value_len  sb-s_blocksize
it returns -ENOSPC.  However, in ext3_xattr_set_handle() it returns -ERANGE
when the name length is  255.

It seems a bit misleading to return -ENOSPC when the filesystem isn't
actually out of space.  I think it would probably make more sense to
return -ERANGE or -EOVERFLOW in this case.


Also, I don't know if you noticed in [EXT4 set 5][PATCH 1/1] expand inode
i_extra_isize to support features in larger inode the discussion about
GFP_KERNEL allocations under xattr_sem.  It seems there is risk of deadlock
in this case because we are inside a journal handle and might get blocked
waiting on a new journal_start() trying to flush memory.

Should these allocations be GFP_NOFS instead?  They shouldn't be a big
source of memory contention because the buffer is freed immediately at
the end of the function.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.

2007-07-13 Thread Andrew Morton
On Fri, 13 Jul 2007 16:00:48 +0530 Kalpak Shah [EMAIL PROTECTED] wrote:


   - if (inode-i_nlink = EXT4_LINK_MAX)
   + if (EXT4_DIR_LINK_MAX(inode))
 return -EMLINK;
  
  argh.  WHY_IS_EXT4_FULL_OF_UPPER_CASE_MACROS_WHICH_COULD_BE_IMPLEMENTED
  as_lower_case_inlines?  Sigh.  It's all the old-timers, I guess.
  
  EXT4_DIR_LINK_MAX() is buggy: it evaluates its arg twice.
 
 #define EXT4_DIR_LINK_MAX(dir) (!is_dx(dir)  (dir)-i_nlink = 
 EXT4_LINK_MAX)
 
 This just checks if directory has hash indexing in which case we need not 
 worry about EXT4_LINK_MAX subdir limit. If directory is not hash indexed then 
 we will need to enforce a max subdir limit. 
 
 Sorry, I didn't understand what is the problem with this macro?

Macros should never evaluate their argument more than once, because if they
do they will misbehave when someone passes them an
expression-with-side-effects:

struct inode *p = q;

EXT4_DIR_LINK_MAX(p++);

one expects `p' to have the value q+1 here.  But it might be q+2.

and

EXT4_DIR_LINK_MAX(some_function());

might cause some_function() to be called twice.


This is one of the many problems which gets fixed when we write code in C
rather than in cpp.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5][TAKE8] fallocate system call

2007-07-13 Thread Amit K. Arora
This is the latest fallocate patchset and is based on 2.6.22.

* Following are the changes from TAKE7:
1) Updated the man page.
2) Merged revalidate write permissions patch with the main falloc patch.
3) Added linux/falloc.h and moved FALLOC_FL_KEEP_SIZE flag to it.
   Also removed the two modes (FALLOC_ALLOCATE and FALLOC_RESV_SPACE).
4) Removed comment above sys_fallocate definition.
5) Updated the testcase below to use FALLOC_FL_KEEP_SIZE flag instead
   of previous two modes.

* Following are the changes from TAKE6:
1) We now just have two modes (and no deallocation modes).
2) Updated the man page
3) Added a new patch submitted by David P. Quigley  (Patch 3/6).
4) Used EXT_INIT_MAX_LEN instead of 0x8000 in Patch 6/6.
4) Included below in the end is a small testcase to test fallocate.


* Following are the changes from TAKE5 to TAKE6:
1) Rebased to 2.6.22
2) Added compat wrapper for x86_64
3) Dropped s390 and ia64 patches, since the platform maintaners can
   add the support for fallocate once it is in mainline.
4) Added a change suggested by Andreas for better extent-to-group
   alignment in ext4 (Patch 6/6). Please refer following post:
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg02445.html
5) Renamed mode flags and values from FA_ to FALLOC_
6) Added manpage (updated version of the one initially submitted by
   David Chinner).


Todos:
-
1 Implementation on other architectures (other than i386, x86_64,
   and ppc64). s390(x) and ia64 patches are ready and will be pushed
   by platform maintaners when the fallocate is in mainline.
2 A generic file system operation to handle fallocate
   (generic_fallocate), for filesystems that do _not_ have the fallocate
   inode operation implemented.
3 Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()
4 Patch to e2fsprogs to recognize and display uninitialized extents.


Following patches follow:
Patch 1/5 : manpage for fallocate
Patch 2/5 : fallocate() implementation in i386, x86_64 and powerpc
Patch 3/5 : ext4: fallocate support in ext4
Patch 4/5 : ext4: write support for preallocated blocks
Patch 5/5 : ext4: change for better extent-to-group alignment

**
Attached below is a small testcase to test fallocate. The __NR_fallocate will
need to be changed depending on the system call number in the kernel (it may
get changed due to merge) and also depending on the architecture.

--
Regards,
Amit Arora



#include stdio.h
#include stdlib.h
#include fcntl.h
#include errno.h

#include linux/unistd.h
#include sys/vfs.h
#include sys/stat.h

#define VERBOSE 0

#define __NR_fallocate324

#define FALLOC_FL_KEEP_SIZE 0x01

int do_fallocate(int fd, int mode, loff_t offset, loff_t len)
{
  int ret;

  if (VERBOSE)
printf(Trying to preallocate blocks (offset=%llu, len=%llu)\n,
offset, len);
  ret = syscall(__NR_fallocate, fd, mode, offset, len);

  if (ret 0) {
printf(SYSCALL: received error %d, ret=%d\n, errno, ret);
close(fd);
return(1);
  }

  if (VERBOSE)
printf(fallocate system call succedded !  ret=%d\n, ret);

  return ret;
}

int test_fallocate(int fd, int mode, loff_t offset, loff_t len)
{
  int ret, blocks;
  struct stat statbuf1, statbuf2;

  fstat(fd, statbuf1);

  ret = do_fallocate(fd, mode, offset, len);

  fstat(fd, statbuf2);

  /* check file size after preallocation */
  if (!mode) {
if (!ret  statbuf1.st_size  (offset + len) 
statbuf2.st_size != (offset + len)) {
printf(Error: fallocate succeeded, but the file size did not 
change, where it should have!\n);
ret = 1;
}
  } else if (statbuf1.st_size != statbuf2.st_size) {
printf(Error : File size changed, when it should not have!\n);
ret = 1;
  }

  blocks = ((statbuf2.st_blocks - statbuf1.st_blocks) * 512)/ 
statbuf2.st_blksize;

  /* Print report */
  printf(# FALLOCATE TEST REPORT #\n);
  printf(\tNew blocks preallocated = %d.\n, blocks);
  printf(\tNumber of bytes preallocated = %d\n, blocks * statbuf2.st_blksize);
  printf(\tOld file size = %d, New file size %d.\n,
  statbuf1.st_size, statbuf2.st_size);
  printf(\tOld num blocks = %d, New num blocks %d.\n,
  (statbuf1.st_blocks * 512)/1024, (statbuf2.st_blocks * 512)/1024);

  return ret;
}


int do_write(int fd, loff_t offset, loff_t len)
{
  int ret;
  char *buf;

  buf = (char *)malloc(len);
  if (!buf) {
printf(error: malloc failed.\n);
return(-1);
  }

  if (VERBOSE)
printf(Trying to write to file (offset=%llu, len=%llu)\n, 
offset, len);

  ret = lseek(fd, offset, SEEK_SET);
  if (ret != offset) {
printf(lseek() failed error=%d, ret=%d\n, errno, ret);
close(fd); 
return(-1);
  }

  ret = write(fd, buf, len);
  if (ret != len) {
 printf(write() failed error=%d, ret=%d\n, errno, ret);

[PATCH 1/5][TAKE8] manpage for fallocate

2007-07-13 Thread Amit K. Arora
Following is the modified version of the manpage originally submitted by
David Chinner. Please use `nroff -man fallocate.2 | less` to view.

Following changed from TAKE7:
* Removed FALLOC_ALLOCATE and FALLOCATE_RESV_SPACE modes.
* Described only single flag for mode, i.e. FALLOC_FL_KEEP_SIZE.
* s/zero blocks/zeroed blocks/ as suggested by Dave.
* Included linux/falloc.h instead of fcntl.h.

Following changed from TAKE6 to TAKE7:
Included changes suggested by Heikki Orsila and Barry Naujok.


.TH fallocate 2
.SH NAME
fallocate \- manipulate file space
.SH SYNOPSIS
.nf
.B #include linux/falloc.h
.PP
.BI long fallocate(int  fd , int  mode , loff_t  offset , loff_t  len 
);
.SH DESCRIPTION
The
.B fallocate
syscall allows a user to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.I offset
and continuing for
.I len
bytes.
The
.I mode
parameter determines the operation to be performed on the given range.
Currently there is only one flag supported for the mode argument.
.TP
.B FALLOC_FL_KEEP_SIZE
allocates and initialises to zero the disk space within the given range.
After a successful call, subsequent writes are guaranteed not to fail because
of lack of disk space.  Even if the size of the file is less than
.IR offset + len ,
the file size is not changed. This allows allocation of zeroed blocks beyond
the end of file and is useful for optimising append workloads.
.PP
If
.B FALLOC_FL_KEEP_SIZE
flag is not specified in the mode argument, the default behavior of this system
call is almost same as when this flag is passed. The only difference is that
on success, the file size will be changed if the
.IR offset + len
is greater than the file size. This default behavior closely resembles
.BR posix_fallocate (3)
and is intended as a method of optimally implementing this function.
.PP
.B fallocate
may allocate a larger range than that was specified.
.SH RETURN VALUE
.B fallocate
returns zero on success, or an error number on failure.
Note that
.I errno
is not set.
.SH ERRORS
.TP
.B EBADF
.I fd
is not a valid file descriptor, or is not opened for writing.
.TP
.B EFBIG
.IR offset + len
exceeds the maximum file size.
.TP
.B EINVAL
.I offset
was less than 0, or
.I len
was less than or equal to 0.
.TP
.B ENODEV
.I fd
does not refer to a regular file or a directory.
.TP
.B ENOSPC
There is not enough space left on the device containing the file
referred to by
.IR fd .
.TP
.B ESPIPE
.I fd
refers to a pipe of file descriptor.
.TP
.B ENOSYS
The filesystem underlying the file descriptor does not support this
operation.
.TP
.B EINTR
A signal was caught during execution
.TP
.B EIO
An I/O error occurred while reading from or writing to a file system.
.TP
.B EOPNOTSUPP
The mode is not supported on the file descriptor.
.SH AVAILABILITY
The
.B fallocate
system call is available since 2.6.XX
.SH SEE ALSO
.BR posix_fallocate (3),
.BR posix_fadvise (3),
.BR ftruncate (3).
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5][TAKE8] fallocate() implementation in i386, x86_64 and powerpc

2007-07-13 Thread Amit K. Arora
From: Amit Arora [EMAIL PROTECTED]

sys_fallocate() implementation on i386, x86_64 and powerpc

fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called -fallocate().
Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.
ToDos:
1. Implementation on other architectures (other than i386, x86_64,
   and ppc). Patches for s390(x) and ia64 are already available from
   previous posts, but it was decided that they should be added later
   once fallocate is in the mainline. Hence not including those patches
   in this take.
2. Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()

CHANGELOG:
-
Following changed from TAKE7:
1. Added linux/falloc.h and moved FALLOC_FL_KEEP_SIZE flag to it.
2. Removed the two modes (FALLOC_ALLOCATE and FALLOC_RESV_SPACE).
3. Merged revalidate write permissions patch from David P. Quigley
   to this patch.
4. Deleted comment above sys_fallocate definition, as suggested by Christoph.


Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.22.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+   .long sys_fallocate
Index: linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
===
--- linux-2.6.22.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
return sys_truncate(path, (high  32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+u32 lenhi, u32 lenlo)
+{
+   return sys_fallocate(fd, mode, ((loff_t)offhi  32) | offlo,
+((loff_t)lenhi  32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long 
high,
 unsigned long low)
 {
Index: linux-2.6.22/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.22.orig/arch/x86_64/ia32/ia32entry.S
+++ linux-2.6.22/arch/x86_64/ia32/ia32entry.S
@@ -719,4 +719,5 @@ ia32_sys_call_table:
.quad compat_sys_signalfd
.quad compat_sys_timerfd
.quad sys_eventfd
+   .quad sys32_fallocate
 ia32_syscall_end:
Index: linux-2.6.22/fs/open.c
===
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -26,6 +26,7 @@
 #include linux/syscalls.h
 #include linux/rcupdate.h
 #include linux/audit.h
+#include linux/falloc.h
 
 int vfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
@@ -352,6 +353,64 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+
+   if (offset  0 || len = 0)
+   goto out;
+
+   /* Return error if mode is not supported */
+   ret = -EOPNOTSUPP;
+   if (mode  !(mode  FALLOC_FL_KEEP_SIZE))
+   goto out;
+
+   ret = -EBADF;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   if (!(file-f_mode  FMODE_WRITE))
+   goto out_fput;
+   /*
+* Revalidate the write permissions, in case security policy has
+* changed since the files were opened.
+*/
+   ret = security_file_permission(file, MAY_WRITE);
+   if (ret)
+   goto out_fput;
+
+   inode = file-f_path.dentry-d_inode;
+
+   ret = -ESPIPE;
+   if (S_ISFIFO(inode-i_mode))
+   goto out_fput;
+
+   ret = -ENODEV;
+   /*
+* Let individual file system 

[PATCH 3/5][TAKE8] ext4: fallocate support in ext4

2007-07-13 Thread Amit K. Arora
From: Amit Arora [EMAIL PROTECTED]

fallocate support in ext4

This patch implements -fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation. Current implementation only supports
preallocation for regular files (directories not supported as of date)
with extent maps. This patch does not support block-mapped files currently.
Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of
now.

CHANGELOG:
-
Following changed from TAKE7:
1. Removed usage of FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes and
   used FALLOC_FL_KEEP_SIZE mode flag instead.
2. Included  linux/falloc.h new header file, which defines above flag.


Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -39,6 +39,7 @@
 #include linux/quotaops.h
 #include linux/string.h
 #include linux/slab.h
+#include linux/falloc.h
 #include linux/ext4_fs_extents.h
 #include asm/uaccess.h
 
@@ -282,7 +283,7 @@ static void ext4_ext_show_path(struct in
} else if (path-p_ext) {
ext_debug(  %d:%d:%llu ,
  le32_to_cpu(path-p_ext-ee_block),
- le16_to_cpu(path-p_ext-ee_len),
+ ext4_ext_get_actual_len(path-p_ext),
  ext_pblock(path-p_ext));
} else
ext_debug(  []);
@@ -305,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ex++) {
ext_debug(%d:%d:%llu , le32_to_cpu(ex-ee_block),
- le16_to_cpu(ex-ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug(\n);
 }
@@ -425,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug(  - %d:%llu:%d ,
le32_to_cpu(path-p_ext-ee_block),
ext_pblock(path-p_ext),
-   le16_to_cpu(path-p_ext-ee_len));
+   ext4_ext_get_actual_len(path-p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -686,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug(move %d:%llu:%d in new leaf %llu\n,
le32_to_cpu(path[depth].p_ext-ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext-ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1106,7 +1107,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1-ee_block) + le16_to_cpu(ex1-ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1-ee_block) + ext1_ee_len !=
le32_to_cpu(ex2-ee_block))
return 0;
 
@@ -1115,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1-ee_len) + le16_to_cpu(ex2-ee_len)  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1-ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1144,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int ret = 0;
 
b1 = le32_to_cpu(newext-ee_block);
-   len1 = le16_to_cpu(newext-ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1191,8 +1204,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
int depth, len, err, next;
+   unsigned uninitialized = 0;
 
-   BUG_ON(newext-ee_len == 0);
+   BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);

[PATCH 4/5][TAKE8] ext4: write support for preallocated blocks

2007-07-13 Thread Amit K. Arora
From:  Amit Arora [EMAIL PROTECTED]

write support for preallocated blocks

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

CHANGELOG:
-
This patch did not change from TAKE7 (besides offsets ;).


Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1141,6 +1141,53 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the ex extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass ex - 1 as argument instead of ex.
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done = 0;
+   int uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex  EXT_LAST_EXTENT(eh)) {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1  EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries) - 1);
+   merge_done = 1;
+   WARN_ON(eh-eh_entries == 0);
+   if (!eh-eh_entries)
+   ext4_error(inode-i_sb, ext4_ext_try_to_merge,
+  inode#%lu, eh-eh_entries = 0!, inode-i_ino);
+   }
+
+   return merge_done;
+}
+
+/*
  * check if a portion of the newext extent overlaps with an
  * existing extent.
  *
@@ -1328,25 +1375,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex  EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex-ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1  EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh-eh_entries = cpu_to_le16(le16_to_cpu(eh-eh_entries)-1);
-   BUG_ON(eh-eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -2012,15 +2041,158 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a There is no split required: Entire extent should be initialized
+ *   b Splits in two extents: Write is happening at either end of the extent
+ *   c Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+   struct ext4_ext_path *path,
+   ext4_fsblk_t iblock,
+   unsigned long max_blocks)
+{
+   struct ext4_extent *ex, newex;
+   struct ext4_extent *ex1 = NULL;
+   struct ext4_extent *ex2 = NULL;
+   struct ext4_extent *ex3 = NULL;
+   struct ext4_extent_header *eh;
+   unsigned int allocated, ee_block, ee_len, depth;
+   ext4_fsblk_t newblock;
+   int err = 0;
+   int ret = 0;
+
+ 

[PATCH 5/5][TAKE8] ext4: change for better extent-to-group alignment

2007-07-13 Thread Amit K. Arora
From: Amit Arora [EMAIL PROTECTED]

Change on-disk format for extent to represent uninitialized/initialized extents

This change was suggested by Andreas Dilger. 
This patch changes the EXT_MAX_LEN value and extent code which marks/checks
uninitialized extents. With this change it will be possible to have
initialized extents with 2^15 blocks (earlier the max blocks we could have
was 2^15 - 1). This way we can have better extent-to-block alignment.
Now, maximum number of blocks we can have in an initialized extent is 2^15
and in an uninitialized extent is 2^15 - 1.


CHANGELOG:
-
This patch did not change from TAKE7 (besides offsets ;).

Following changed from TAKE6 to TAKE7:
1. Taken care of Andreas's suggestion of using EXT_INIT_MAX_LEN instead of
   0x8000 at some places.

Signed-off-by: Amit Arora [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1107,7 +1107,7 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   unsigned short ext1_ee_len, ext2_ee_len;
+   unsigned short ext1_ee_len, ext2_ee_len, max_len;
 
/*
 * Make sure that either both extents are uninitialized, or
@@ -1116,6 +1116,11 @@ ext4_can_extents_be_merged(struct inode 
if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
return 0;
 
+   if (ext4_ext_is_uninitialized(ex1))
+   max_len = EXT_UNINIT_MAX_LEN;
+   else
+   max_len = EXT_INIT_MAX_LEN;
+
ext1_ee_len = ext4_ext_get_actual_len(ex1);
ext2_ee_len = ext4_ext_get_actual_len(ex2);
 
@@ -1128,7 +1133,7 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (ext1_ee_len + ext2_ee_len  EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len  max_len)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1-ee_len) = 4)
@@ -1815,7 +1820,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
ex-ee_block = cpu_to_le32(block);
ex-ee_len = cpu_to_le16(num);
-   if (uninitialized)
+   /*
+* Do not mark uninitialized if all the blocks in the
+* extent have been removed.
+*/
+   if (uninitialized  num)
ext4_ext_mark_uninitialized(ex);
 
err = ext4_ext_dirty(handle, inode, path + depth);
@@ -2308,6 +2317,19 @@ int ext4_ext_get_blocks(handle_t *handle
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
 
+   /*
+* See if request is beyond maximum number of blocks we can have in
+* a single extent. For an initialized extent this limit is
+* EXT_INIT_MAX_LEN and for an uninitialized extent this limit is
+* EXT_UNINIT_MAX_LEN.
+*/
+   if (max_blocks  EXT_INIT_MAX_LEN 
+   create != EXT4_CREATE_UNINITIALIZED_EXT)
+   max_blocks = EXT_INIT_MAX_LEN;
+   else if (max_blocks  EXT_UNINIT_MAX_LEN 
+create == EXT4_CREATE_UNINITIALIZED_EXT)
+   max_blocks = EXT_UNINIT_MAX_LEN;
+
/* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
newex.ee_block = cpu_to_le32(iblock);
newex.ee_len = cpu_to_le16(max_blocks);
Index: linux-2.6.22/include/linux/ext4_fs_extents.h
===
--- linux-2.6.22.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22/include/linux/ext4_fs_extents.h
@@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru
 
 #define EXT_MAX_BLOCK  0x
 
-#define EXT_MAX_LEN((1UL  15) - 1)
+/*
+ * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an
+ * initialized extent. This is 2^15 and not (2^16 - 1), since we use the
+ * MSB of ee_len field in the extent datastructure to signify if this
+ * particular extent is an initialized extent or an uninitialized (i.e.
+ * preallocated).
+ * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an
+ * uninitialized extent.
+ * If ee_len is = 0x8000, it is an initialized extent. Otherwise, it is an
+ * uninitialized one. In other words, if MSB of ee_len is set, it is an
+ * uninitialized extent with only one special scenario when ee_len = 0x8000.
+ * In this case we can not have an uninitialized extent of zero length and
+ * thus we make it as a special case of initialized extent with 0x8000 length.
+ * This way we get better extent-to-group alignment for initialized extents.
+ * Hence, the maximum number of blocks we can have in an *initialized*
+ * extent is 2^15 (32768) and in an *uninitialized* extent is 2^15-1 

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-13 Thread Zach Brown
 I fear the consequences of this change :(

I love it.  In the past I've lost time by working with patches which
didn't quite realize that ext3 holds a transaction open during
-direct_IO.

 Oh well, please keep it alive, maybe beat on it a bit, resend it
 later on?

I can test the patch to make sure that it catches mistakes I've made in
the past.  Peter, do you have any interest in seeing how far we can get
at tracking lock_page()?  I'm not holding my breath, but any little bit
would probably help.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html