Re: [EXT4 set 2][PATCH 2/5] cleanups: Add extent sanity checks

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:36:22 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> with the patch all headers are checked. the code should become
> more resistant to on-disk corruptions. needless BUG_ON() have
> been removed. please, review for inclusion.
> 
> ...

> @@ -269,6 +239,70 @@
>   return size;
>  }
>  
> +static inline int
> +ext4_ext_max_entries(struct inode *inode, int depth)

Please remove the `inline'.

`inline' is usually wrong.  It is wrong here.  If this function has a
single callsite (it has) then the compiler will inline it.  If the function
later gains more callsites then the compiler will know (correctly) not to
inline it.

We hope.  If we find the compiler still inlines a function as large as this
one then we'd need to use noinline and complain at the gcc developers.

> +{
> + int max;
> +
> + if (depth == ext_depth(inode)) {
> + if (depth == 0)
> + max = ext4_ext_space_root(inode);
> + else
> + max = ext4_ext_space_root_idx(inode);
> + } else {
> + if (depth == 0)
> + max = ext4_ext_space_block(inode);
> + else
> + max = ext4_ext_space_block_idx(inode);
> + }
> +
> + return max;
> +}
> +
> +static int __ext4_ext_check_header(const char *function, struct inode *inode,
> + struct ext4_extent_header *eh,
> + int depth)
> +{
> + const char *error_msg = NULL;

Unneeded initialisation.

> + int max = 0;
> +
> + if (unlikely(eh->eh_magic != EXT4_EXT_MAGIC)) {
> + error_msg = "invalid magic";
> + goto corrupted;
> + }
> + if (unlikely(le16_to_cpu(eh->eh_depth) != depth)) {
> + error_msg = "unexpected eh_depth";
> + goto corrupted;
> + }
> + if (unlikely(eh->eh_max == 0)) {
> + error_msg = "invalid eh_max";
> + goto corrupted;
> + }
> + max = ext4_ext_max_entries(inode, depth);
> + if (unlikely(le16_to_cpu(eh->eh_max) > max)) {
> + error_msg = "too large eh_max";
> + goto corrupted;
> + }
> + if (unlikely(le16_to_cpu(eh->eh_entries) > le16_to_cpu(eh->eh_max))) {
> + error_msg = "invalid eh_entries";
> + goto corrupted;
> + }
> + return 0;
> +
> +corrupted:
> + ext4_error(inode->i_sb, function,
> + "bad header in inode #%lu: %s - magic %x, "
> + "entries %u, max %u(%u), depth %u(%u)",
> + inode->i_ino, error_msg, le16_to_cpu(eh->eh_magic),
> + le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max),
> + max, le16_to_cpu(eh->eh_depth), depth);
> +
> + return -EIO;
> +}
> +
>
> ...
>
> + i = depth = ext_depth(inode);
>

Mulitple assignments like this are somewhat unpopular from the coding-style
POV.  

We have a local variable called `i' which is not used as a simple counter
and which has no explanatory comment.  This is plain bad.  Please find a
better name for this variable.  Review the code for other such lack of
clarity - I'm sure there's more.


> - BUG_ON(le16_to_cpu(eh->eh_entries) > le16_to_cpu(eh->eh_max));
> - BUG_ON(eh->eh_magic != EXT4_EXT_MAGIC);

Yeah, this patch improves things a lot.

How well-tested is it?  Have any of these errors actually been triggered?

>   path[i].p_hdr = ext_block_hdr(path[i].p_bh);
> - if (ext4_ext_check_header(__FUNCTION__, inode,
> - path[i].p_hdr)) {
> - err = -EIO;
> - goto out;
> - }
>   }
>  
> - BUG_ON(le16_to_cpu(path[i].p_hdr->eh_entries)
> -> le16_to_cpu(path[i].p_hdr->eh_max));
> - BUG_ON(path[i].p_hdr->eh_magic != EXT4_EXT_MAGIC);
> -
>   if (!path[i].p_idx) {
>   /* this level hasn't been touched yet */
>   path[i].p_idx = EXT_LAST_INDEX(path[i].p_hdr);
> @@ -1873,17 +1890,24 @@
>   i, EXT_FIRST_INDEX(path[i].p_hdr),
>   path[i].p_idx);
>   if (ext4_ext_more_to_rm(path + i)) {
> + struct buffer_head *bh;
>   /* go to the next level */
>   ext_debug("move to level %d (block %llu)\n",
> i + 1, idx_pblock(path[i].p_idx));
>   memset(path + i + 1, 0, sizeof(*path));
> - path[i+1].p_bh =
> - sb_bread(sb, idx_pblock(path[i].p_idx));
> - if (!path[i+1].p_bh) {
> + bh = sb_bread(sb, idx_pblock(path[i].p_idx));
> + if (!bh) {
>   /* should we reset i_size? */
>   

Re: [EXT4 set 4][PATCH 2/5] i_version: Add hi 32 bit inode version on ext4 on-disk inode

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:37:16 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> This patch adds a 32-bit i_version_hi field to ext4_inode, which can be used 
> for 64-bit inode versions. This field will store the higher 32 bits of the 
> version, while Jean Noel's patch has added support to store the lower 32-bits 
> in osd1.linux1.l_i_version.

Please wordwrap this changelog entry to less than 80 columns.  Well, less
than 258, anyway ;)

> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
> Signed-off-by: Kalpak Shah <[EMAIL PROTECTED]>
> ---
> Index: linux-2.6.21/include/linux/ext4_fs.h
> ===
> --- linux-2.6.21.orig/include/linux/ext4_fs.h
> +++ linux-2.6.21/include/linux/ext4_fs.h
> @@ -342,6 +342,7 @@ struct ext4_inode {
>   __le32  i_atime_extra;  /* extra Access time  (nsec << 2 | epoch) */
>   __le32  i_crtime;   /* File Creation time */
>   __le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
> + __le32  i_version_hi;   /* high 32 bits for 64-bit version */
>  };

Aren't there forward- backward-compatibility issues here?  How does the
filesystem driver work out whether this field is present and valid?

The changelog should describe this design issue.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:36:56 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> This patch is a spinoff of the old nanosecond patches.

I don't know what the "old nanosecond patches" are.  A link to a suitable
changlog for those patches would do in a pinch.  Preferable would be to
write a proper changelog for this patch.

> It includes some cleanups and addition of a creation timestamp. The
> EXT3_FEATURE_RO_COMPAT_EXTRA_ISIZE flag has also been added along with
> s_{min, want}_extra_isize fields in struct ext3_super_block.
> 
> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
> Signed-off-by: Kalpak Shah <[EMAIL PROTECTED]>
> Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
> Signed-off-by: Dave Kleikamp <[EMAIL PROTECTED]>
> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> 
> Index: linux-2.6.22-rc4/fs/ext4/ialloc.c

Please include diffstat output when preparing patches.

> +
> +#define EXT4_FITS_IN_INODE(ext4_inode, einode, field)\
> + ((offsetof(typeof(*ext4_inode), field) +\
> +   sizeof((ext4_inode)->field))  \
> + <= (EXT4_GOOD_OLD_INODE_SIZE +  \
> + (einode)->i_extra_isize))   \

Please add explanatory commentary to EXT4_FITS_IN_INODE(): tell readers
under what circumstances something will not fit in an inode and what the
consequences of this are.

> +static inline __le32 ext4_encode_extra_time(struct timespec *time)
> +{
> +   return cpu_to_le32((sizeof(time->tv_sec) > 4 ?
> +time->tv_sec >> 32 : 0) |
> +((time->tv_nsec << 2) & EXT4_NSEC_MASK));
> +}
> +
> +static inline void ext4_decode_extra_time(struct timespec *time, __le32 
> extra)
> +{
> +   if (sizeof(time->tv_sec) > 4)
> +time->tv_sec |= (__u64)(le32_to_cpu(extra) & EXT4_EPOCH_MASK)
> +<< 32;
> +   time->tv_nsec = (le32_to_cpu(extra) & EXT4_NSEC_MASK) >> 2;
> +}

Consider uninlining these functions.

> +#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode)
>\
> +do {\
> + (raw_inode)->xtime = cpu_to_le32((inode)->xtime.tv_sec);   \
> + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
> + (raw_inode)->xtime ## _extra = \
> + ext4_encode_extra_time(&(inode)->xtime);   \
> +} while (0)
> +
> +#define EXT4_EINODE_SET_XTIME(xtime, einode, raw_inode)  
>\
> +do {\
> + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))  \
> + (raw_inode)->xtime = cpu_to_le32((einode)->xtime.tv_sec);  \
> + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\
> + (raw_inode)->xtime ## _extra = \
> + ext4_encode_extra_time(&(einode)->xtime);  \
> +} while (0)
> +
> +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode)
>\
> +do {\
> + (inode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime);   \
> + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
> + ext4_decode_extra_time(&(inode)->xtime,\
> +raw_inode->xtime ## _extra);\
> +} while (0)
> +
> +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode)  
>\
> +do {\
> + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))  \
> + (einode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime);  \
> + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\
> + ext4_decode_extra_time(&(einode)->xtime,   \
> +raw_inode->xtime ## _extra);\
> +} while (0)

Ugly.  I expect these could be implemented as plain old C functions. 
Caller could pass in the address of the ext4_inode field which the function
is to operate upon.

>  #if defined(__KERNEL__) || defined(__linux__)

(What's the __linux__ for?)

>  #define i_reserved1  osd1.linux1.l_i_reserved1
>  #define i_frag   osd2.linux2.l_i_frag
> @@ -539,6 +603,13 @@
>   return container_of(inode, struct ext4_inode_info, vfs_inode);
>  }
>  
> +static inline struct timespec ext4_current_time(struct inode *inode)
> +{
> + return (inode->i_sb->s_time_gran < NSEC_PER_SEC) ?
> + current_fs_time(inode->i_sb) : CURRENT_TIME_SEC;
> +}

Now, I've forgotten how this works.  Remind me, please.  Can ext4
filesystems ever have one-second timestamp granularity?  If so, how 

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:37:04 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> This patch converts the 32-bit i_version in the generic inode to a 64-bit
> i_version field.
> 

That's obvious from the patch.  But what was the reason for making this
(unrelated to ext4) change?

Please update the changelog for this.

> Index: linux-2.6.21/include/linux/fs.h
> ===
> --- linux-2.6.21.orig/include/linux/fs.h
> +++ linux-2.6.21/include/linux/fs.h
> @@ -549,7 +549,7 @@ struct inode {
>   uid_t   i_uid;
>   gid_t   i_gid;
>   dev_t   i_rdev;
> - unsigned long   i_version;
> + u64 i_version;
>   loff_t  i_size;
>  #ifdef __NEED_I_SIZE_ORDERED
>   seqcount_t  i_size_seqcount;

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:36:48 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> > On Jun 07, 2007  23:45 -0500, Jose R. Santos wrote:
> > > The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but
> > > create_proc_entry() does not do lookups on file names with more that one
> > > directory deep.  This causes the entry creation to fail and hence, no proc
> > > file is created.  This patch moves the file to /proc/jbd2-degug.
> > > 
> > > The file could be move to /proc/fs/jbd2/jbd2-debug, but it would require
> > > some minor alterations to the jbd-stats patch.
> > 
> > I don't think we really want to be adding top-level files in /proc.
> > What about using the "debugfs" filesystem (not to be confused with
> > the e2fsprogs 'debugfs' command)?
> 
> How about this then?  Moved the file to use debugfs as well as having
> the nice effect of removing more lines than what it adds.
> 
> Signed-off-by: Jose R. Santos <[EMAIL PROTECTED]>

Please clean up the changelog.

The changelog should include information about the location and the content
of these debugfs files.  it should provide any instructions which users
will need to be able to create and use those files.

Alternatively (and preferably) do this via an update to
Documentation/filesystems/ext4.txt.

>  fs/jbd2/journal.c|   6220 +42 -0 !
>  include/linux/jbd2.h |21 + 1 - 0 !
>  2 files changed, 21 insertions(+), 43 deletions(-)

Again, this patch isn't in Ted's kernel.org directory and hasn't been in -mm.

Apart from the lack of testing and review which this causes, it means I
can't just do `pushpatch name-of-this-patch' and look at it in tkdiff.  So
I squint at the diff, but that's harder when the diff wasn't prepared with
`diff -p'.  Oh well.


> Index: linux-2.6.22-rc4/fs/jbd2/journal.c
> ===
> --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c   2007-06-11 16:16:18.0 
> -0700
> +++ linux-2.6.22-rc4/fs/jbd2/journal.c2007-06-11 16:36:10.0 
> -0700
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -1954,60 +1955,37 @@
>   * /proc tunables
>   */
>  #if defined(CONFIG_JBD2_DEBUG)
> -int jbd2_journal_enable_debug;
> +u16 jbd2_journal_enable_debug;
>  EXPORT_SYMBOL(jbd2_journal_enable_debug);
>  #endif
>  
> -#if defined(CONFIG_JBD2_DEBUG) && defined(CONFIG_PROC_FS)
> +#if defined(CONFIG_JBD2_DEBUG) && defined(CONFIG_DEBUG_FS)

Has this been compile-tested with CONFIG_DEBUGFS=n?

>  
> -#define create_jbd_proc_entry() do {} while (0)
> -#define jbd2_remove_jbd_proc_entry() do {} while (0)
> +#define jbd2_create_debugfs_entry() do {} while (0)
> +#define jbd2_remove_debugfs_entry() do {} while (0)

I suggest that these be converted to (preferable) inline functions while
you're there.

>  #endif
>  
> @@ -2067,7 +2045,7 @@
>   ret = journal_init_caches();
>   if (ret != 0)
>   jbd2_journal_destroy_caches();
> - create_jbd_proc_entry();
> + jbd2_create_debugfs_entry();
>   return ret;
>  }
>  
> @@ -2078,7 +2056,7 @@
>   if (n)
>   printk(KERN_EMERG "JBD: leaked %d journal_heads!\n", n);
>  #endif
> - jbd2_remove_jbd_proc_entry();
> + jbd2_remove_debugfs_entry();
>   jbd2_journal_destroy_caches();
>  }
>  
> Index: linux-2.6.22-rc4/include/linux/jbd2.h
> ===
> --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 
> 16:16:18.0 -0700
> +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 
> -0700
> @@ -57,7 +57,7 @@
>   * CONFIG_JBD2_DEBUG is on.
>   */
>  #define JBD_EXPENSIVE_CHECKING

JBD2?

> -extern int jbd2_journal_enable_debug;
> +extern u16 jbd2_journal_enable_debug;

Why was this made 16-bit?  To save 2 bytes?  Could have saved 3 if we're
going to do that.


Shoudln't all this debug info be a per-superblock thing rather than
kernel-wide?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for >16TB ext4 fs

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:36:32 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more
> than 32bit block sizes during mount time.  This ensure proper record
> lenth when writing to the journal.

This patch isn't in Ted's kernel.org directory and hasn't been in -mm. 
Where did it come from?  Is something amiss with ext4 patch management?

> Signed-off-by: Jose R. Santos <[EMAIL PROTECTED]>
> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> Signed-off-by: Laurent Vivier <[EMAIL PROTECTED]>
> ---
>  fs/ext4/super.c |   11 +++
>  1 file changed, 11 insertions(+)
> 
> Index: linux-2.6.22-rc4/fs/ext4/super.c
> ===
> --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 16:15:54.0 
> -0700
> +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-11 16:16:10.0 -0700
> @@ -1804,6 +1804,13 @@

Please prepare patches using `diff -p'

>   goto failed_mount3;
>   }
>  
> + if (ext4_blocks_count(es) > 0xULL &&
> + !jbd2_journal_set_features(EXT4_SB(sb)->s_journal, 0, 0,
> +JBD2_FEATURE_INCOMPAT_64BIT)) {
> + printk(KERN_ERR "ext4: Failed to set 64-bit journal feature\n");
> + goto failed_mount4;
> + }

It is not appropriate for the text "ext4" to appear in a JBD2 message.

>   /* We have now updated the journal if required, so we can
>* validate the data journaling mode. */
>   switch (test_opt(sb, DATA_FLAGS)) {


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 2][PATCH 1/5] cleanups: Propagate some i_flags to disk

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:36:12 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into
> ext4-specific i_flags. Hence, when someone sets these flags via a different
> interface than ioctl, they are stored correctly.
> 

This changelog is inadequate.  I had to hunt down the equivalent ext3
patch's changelog to understand the reasons for this change.  Please update
this patch's changelog using the below: 

ext3: copy i_flags to inode flags on write

A patch that stores inode flags such as S_IMMUTABLE, S_APPEND, etc.  from
i_flags to EXT3_I(inode)->i_flags when inode is written to disk.  The same
thing is done on GETFLAGS ioctl.

Quota code changes these flags on quota files (to make it harder for
sysadmin to screw himself) and these changes were not correctly propagated
into the filesystem (especially, lsattr did not show them and users were
wondering...).

Propagate flags such as S_APPEND, S_IMMUTABLE, etc.  from i_flags into
ext3-specific i_flags.  Hence, when someone sets these flags via a
different interface than ioctl, they are stored correctly.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:36:01 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> Turn on extents feature by default in ext4 filesystem. User could use
> -o noextents to turn it off.
> 

Oh, there you go.

> 
> Index: linux-2.6.22-rc4/fs/ext4/super.c
> ===
> --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:22.0 
> -0700
> +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-11 17:03:09.0 -0700
> @@ -1546,6 +1546,12 @@
>  
>   set_opt(sbi->s_mount_opt, RESERVATION);
>  
> + /*
> +  * turn on extents feature by default in ext4 filesystem
> +  * User -o noextents to turn it off
> +  */
> + set_opt (sbi->s_mount_opt, EXTENTS);
> +

Broken coding style.

Please feed all the ext4 patches through scripts/checkpatch.pl (preferably
version 0.07 - see Andy's patch on lkml) and then consider addressing the
(quite large) number of mistakes which are detected.


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 1][PATCH 1/2] Add noextents mount option

2007-07-10 Thread Andrew Morton
On Sun, 01 Jul 2007 03:35:48 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> Add a mount option to turn off extents.

Please update the changelog to describe the reason for making this change.


> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> ---
> Index: linux-2.6.22-rc4/fs/ext4/super.c
> ===
> --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:18.0 
> -0700
> +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-11 17:02:22.0 -0700
> @@ -725,7 +725,7 @@
>   Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
>   Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
>   Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
> - Opt_grpquota, Opt_extents,
> + Opt_grpquota, Opt_extents, Opt_noextents,
>  };
>  
>  static match_table_t tokens = {
> @@ -776,6 +776,7 @@
>   {Opt_usrquota, "usrquota"},
>   {Opt_barrier, "barrier=%u"},
>   {Opt_extents, "extents"},
> + {Opt_noextents, "noextents"},
>   {Opt_err, NULL},
>   {Opt_resize, "resize"},
>  };
> @@ -,6 +1112,9 @@
>   case Opt_extents:
>   set_opt (sbi->s_mount_opt, EXTENTS);
>   break;
> + case Opt_noextents:
> + clear_opt (sbi->s_mount_opt, EXTENTS);
> + break;
>   default:
>   printk (KERN_ERR
>   "EXT4-fs: Unrecognized mount option \"%s\" "
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix error handling in ext3_create_journal

2007-07-03 Thread Andrew Morton
On Mon, 2 Jul 2007 00:11:11 +0200
Borislav Petkov <[EMAIL PROTECTED]> wrote:

> 
> ---
> From: Borislav Petkov <[EMAIL PROTECTED]>
> 
> Fix error handling in ext3_create_journal according to kernel conventions.
> 
> Signed-off-by: Borislav Petkov <[EMAIL PROTECTED]>
> --
> 
> Index: linux-2.6.22-rc6/fs/ext3/super.c
> ===
> --- linux-2.6.22-rc6/fs/ext3/super.c.orig 2007-07-01 21:12:51.0 
> +0200
> +++ linux-2.6.22-rc6/fs/ext3/super.c  2007-07-01 21:14:32.0 +0200
> @@ -2075,6 +2075,7 @@
>  unsigned int journal_inum)
>  {
>   journal_t *journal;
> + int err;
>  
>   if (sb->s_flags & MS_RDONLY) {
>   printk(KERN_ERR "EXT3-fs: readonly filesystem when trying to "
> @@ -2082,13 +2083,15 @@
>   return -EROFS;
>   }
>  
> - if (!(journal = ext3_get_journal(sb, journal_inum)))
> + journal = ext3_get_journal(sb, journal_inum);
> + if (!journal)
>   return -EINVAL;
>  
>   printk(KERN_INFO "EXT3-fs: creating new journal on inode %u\n",
>  journal_inum);
>  
> - if (journal_create(journal)) {
> + err = journal_create(journal);
> + if (err) {
>   printk(KERN_ERR "EXT3-fs: error creating journal.\n");
>   journal_destroy(journal);
>   return -EIO;

Please prepare the equivalent patch for ext4.  Without that, it'd probably
be better to avoid applying the ext3 patch: there are advantages to keeping
the two in sync where possible.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Teach do_mpage_readpage() about unwritten buffers

2007-07-02 Thread Andrew Morton
On Tue, 3 Jul 2007 11:10:19 +1000 David Chinner <[EMAIL PROTECTED]> wrote:

> Teach do_mpage_readpage() about unwritten extents so we can
> always map them in get_blocks rather than they are are holes on
> read. Allows setup_swap_extents() to use preallocated files on XFS
> filesystems for swap files without ever needing to convert them.
> 
> Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]>
> 
> ---
>  fs/mpage.c  |5 +++--
>  fs/xfs/linux-2.6/xfs_aops.c |   13 +++--
>  2 files changed, 6 insertions(+), 12 deletions(-)
> 
> Index: 2.6.x-xfs-new/fs/mpage.c
> ===
> --- 2.6.x-xfs-new.orig/fs/mpage.c 2007-05-29 16:17:59.0 +1000
> +++ 2.6.x-xfs-new/fs/mpage.c  2007-06-27 22:39:35.568852270 +1000
> @@ -207,7 +207,8 @@ do_mpage_readpage(struct bio *bio, struc
>* Map blocks using the result from the previous get_blocks call first.
>*/
>   nblocks = map_bh->b_size >> blkbits;
> - if (buffer_mapped(map_bh) && block_in_file > *first_logical_block &&
> + if (buffer_mapped(map_bh) && !buffer_unwritten(map_bh) &&
> + block_in_file > *first_logical_block &&
>   block_in_file < (*first_logical_block + nblocks)) {
>   unsigned map_offset = block_in_file - *first_logical_block;
>   unsigned last = nblocks - map_offset;
> @@ -242,7 +243,7 @@ do_mpage_readpage(struct bio *bio, struc
>   *first_logical_block = block_in_file;
>   }
>  
> - if (!buffer_mapped(map_bh)) {
> + if (!buffer_mapped(map_bh) || buffer_unwritten(map_bh)) {
>   fully_mapped = 0;
>   if (first_hole == blocks_per_page)
>   first_hole = page_block;
> Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c
> ===
> --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c2007-06-05 
> 22:14:39.0 +1000
> +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-06-27 22:39:29.545636749 
> +1000
> @@ -1340,16 +1340,9 @@ __xfs_get_blocks(
>   return 0;
>  
>   if (iomap.iomap_bn != IOMAP_DADDR_NULL) {
> - /*
> -  * For unwritten extents do not report a disk address on
> -  * the read case (treat as if we're reading into a hole).
> -  */
> - if (create || !(iomap.iomap_flags & IOMAP_UNWRITTEN)) {
> - xfs_map_buffer(bh_result, &iomap, offset,
> -inode->i_blkbits);
> - }
> - if (create && (iomap.iomap_flags & IOMAP_UNWRITTEN)) {
> - if (direct)
> + xfs_map_buffer(bh_result, &iomap, offset, inode->i_blkbits);
> + if (iomap.iomap_flags & IOMAP_UNWRITTEN) {
> + if (create && direct)
>   bh_result->b_private = inode;
>   set_buffer_unwritten(bh_result);
>   }

Seems sane, although one does wonder whether it's a worthy tradeoff.  We
add additional overhead to readpage[s]() just to avoid some IO during
mkswap?

Also, I wonder what ext4 does in this situation?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-06-29 Thread Andrew Morton
On Fri, 29 Jun 2007 16:55:25 -0400
Theodore Tso <[EMAIL PROTECTED]> wrote:

> On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote:
> > 
> > Guys, Mike and Sreenivasa at google are looking into implementing
> > fallocate() on ext2.  Of course, any such implementation could and should
> > also be portable to ext3 and ext4 bitmapped files.
> 
> What's the eventual goal of this work?  Would it be for mainline use,
> or just something that would be used internally at Google?

Mainline, preferably.

>  I'm not
> particularly ennthused about supporting two ways of doing fallocate();
> one for ext4 and one for bitmap-based files in ext2/3/4.  Is the
> benefit reallyworth it?

umm, it's worth it if you don't want to wear the overhead of journalling,
and/or if you don't want to wait on the, err, rather slow progress of ext4.

> What I would suggest, which would make much easier, is to make this be
> an incompatible extensions (which you as you point out is needed for
> security reasons anyway) and then steal the high bit from the block
> number field to indicate whether or not the block has been initialized
> or not.  That way you don't end up having to seek to a potentially
> distant part of the disk to check out the bitmap.  Also, you don't
> have to worry about how to recover if the "block initialized bitmap"
> inode gets smashed.  
> 
> The downside is that it reduces the maximum size of the filesystem
> supported by ext2 by a factor of two.  But, there are at least two
> patch series floating about that promise to allow filesystem block
> sizes > than PAGE_SIZE which would allow you to recover the maximum
> size supported by the filesytem.
> 
> Furthermore, I suspect (especially after listening to a very fasting
> Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks
> ago) that for many of Google's workloads, using a filesystem blocksize
> of 16K or 32K might not be a bad thing in any case.
> 
> It would be a lot simpler
> 

Hadn't thought of that.

Also, it's unclear to me why google is going this way rather than using
(perhaps suitably-tweaked) ext2 reservations code.

Because the stock ext2 block allcoator sucks big-time.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-29 Thread Andrew Morton
On Fri, 29 Jun 2007 11:50:04 -0400
Mingming Caoc <[EMAIL PROTECTED]> wrote:

> I think the ext4 patch queue is in good shape now.

Which ext4 patches are you intending to merge into 2.6.23?

Please send all those out to lkml for review?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


fallocate support for bitmap-based files

2007-06-29 Thread Andrew Morton

Guys, Mike and Sreenivasa at google are looking into implementing
fallocate() on ext2.  Of course, any such implementation could and should
also be portable to ext3 and ext4 bitmapped files.

I believe that Sreenivasa will mainly be doing the implementation work.


The basic plan is as follows:

- Create (with tune2fs and mke2fs) a hidden file using one of the
  reserved inode numbers.  That file will be sized to have one bit for each
  block in the partition.  Let's call this the "unwritten block file".

  The unwritten block file will be initialised with all-zeroes

- at fallocate()-time, allocate the blocks to the user's file (in some
  yet-to-be-determined fashion) and, for each one which is uninitialised,
  set its bit in the unwritten block file.  The set bit means "this block
  is uninitialised and needs to be zeroed out on read".

- truncate() would need to clear out set-bits in the unwritten blocks file.

- When the fs comes to read a block from disk, it will need to consult
  the unwritten blocks file to see if that block should be zeroed by the
  CPU.

- When the unwritten-block is written to, its bit in the unwritten blocks
  file gets zeroed.

- An obvious efficiency concern: if a user file has no unwritten blocks
  in it, we don't need to consult the unwritten blocks file.

  Need to work out how to do this.  An obvious solution would be to have
  a number-of-unwritten-blocks counter in the inode.  But do we have space
  for that?

  (I expect google and others would prefer that the on-disk format be
  compatible with legacy ext2!)

- One concern is the following scenario:

  - Mount fs with "new" kernel, fallocate() some blocks to a file.

  - Now, mount the fs under "old" kernel (which doesn't understand the
unwritten blocks file).

- This kernel will be able to read uninitialised data from that
  fallocated-to file, which is a security concern.

  - Now, the "old" kernel writes some data to a fallocated block.  But
this kernel doesn't know that it needs to clear that block's flag in
the unwritten blocks file!

  - Now mount that fs under the "new" kernel and try to read that file.
 The flag for the block is set, so this kernel will still zero out the
data on a read, thus corrupting the user's data

  So how to fix this?  Perhaps with a per-inode flag indicating "this
  inode has unwritten blocks".  But to fix this problem, we'd require that
  the "old" kernel clear out that flag.

  Can anyone propose a solution to this?

  Ah, I can!  Use the compatibility flags in such a way as to prevent the
  "old" kernel from mounting this filesystem at all.  To mount this fs
  under an "old" kernel the user will need to run some tool which will

  - read the unwritten blocks file

  - for each set-bit in the unwritten blocks file, zero out the
corresponding block

  - zero out the unwritten blocks file

  - rewrite the superblock to indicate that this fs may now be mounted
by an "old" kernel.

  Sound sane?

- I'm assuming that there are more reserved inodes available, and that
  the changes to tune2fs and mke2fs will be basically a copy-n-paste job
  from the `tune2fs -j' code.  Correct?

- I haven't thought about what fsck changes would be needed.

  Presumably quite a few.  For example, fsck should check that set-bits
  in the unwriten blobks file do not correspond to freed blocks.  If they
  do, that should be fixed up.

  And fsck can check each inodes number-of-unwritten-blocks counters
  against the unwritten blocks file (if we implement the per-inode
  number-of-unwritten-blocks counter)

  What else should fsck do?

- I haven't thought about the implications of porting this into ext3/4. 
  Probably the commit to the unwritten blocks file will need to be atomic
  with the commit to the user's file's metadata, so the unwritten-blocks
  file will effectively need to be in journalled-data mode.

  Or, more likely, we access the unwritten blocks file via the blockdev
  pagecache (ie: use bmap, like the journal file) and then we're just
  talking direct to the disk's blocks and it becomes just more fs metadata.

- I guess resize2fs will need to be taught about the unwritten blocks
  file: to shrink and grow it appropriately.


That's all I can think of for now - I probably missed something. 

Suggestions and thought are sought, please.


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-28 Thread Andrew Morton
On Thu, 28 Jun 2007 23:27:57 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:

> > Please drop the non-ext4 patches from the ext4 tree and send incremental
> > patches against the (non-ext4) fallocate patches in -mm.
> 
> Please let us know what you think of Mingming's suggestion of posting
> all the fallocate patches including the ext4 ones as incremental ones
> against the -mm.

I think Mingming was asking that Ted move the current quilt tree into git,
presumably because she's working off git.

I'm not sure what to do, really.  The core kernel patches need to be in
Ted's tree for testing but that'll create a mess for me.

ug.

Options might be

a) I drop the fallocate patches from -mm and from the ext4 tree, hack up
   any needed build fixes, then just wait for it all to mature and then
   think about it again

b) We do what we normally don't do and reserve the syscall slots in mainline.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-28 Thread Andrew Morton
On Mon, 25 Jun 2007 18:58:10 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:

> N O T E: 
> ---
> 1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part
>of ext4 patch queue git tree hosted by Ted.

Why the heck are replacements for these things being sent out again when
they're already in -mm and they're already in Ted's queue (from which I
need to diligently drop them each time I remerge)?

Are we all supposed to re-review the entire patchset (or at least #4 and
#7) again?

The core kernel changes are not appropriate to the ext4 tree.

For a start, the syscall numbers in Ted's queue are wrong (other new
syscalls are pending).

Patches which add syscalls are an utter PITA to carry due to all the patch
conflicts and to the relatively frequent syscall renumbering (they don't
get numbered in time-of-arrival order due to differing rates at which patches
mature).

Please drop the non-ext4 patches from the ext4 tree and send incremental
patches against the (non-ext4) fallocate patches in -mm.

And try to get the code finished?  Time is pressing.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] zero_user_page conversion

2007-06-23 Thread Andrew Morton
> On Wed, 20 Jun 2007 17:08:24 -0500 Eric Sandeen <[EMAIL PROTECTED]> wrote:
> Use zero_user_page() in cifs, ocfs2, ext4, and gfs2 where possible.

One patch, splattered across four maintainers, each of whom maintain separate
trees.

Sigh.  Please, don't.  _someone_ has to split this up, and it might as well
not be me.



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fw: [Bugme-new] [Bug 8643] New: Bug with negative timestamps on 64bit machines

2007-06-16 Thread Andrew Morton

someone ort to fix this


Begin forwarded message:

Date: Sat, 16 Jun 2007 17:50:35 -0700 (PDT)
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: [Bugme-new] [Bug 8643] New: Bug with negative timestamps on 64bit 
machines


http://bugzilla.kernel.org/show_bug.cgi?id=8643

   Summary: Bug with negative timestamps on 64bit machines
   Product: File System
   Version: 2.5
 KernelVersion: 2.6.21-1.3228.fc7
  Platform: All
OS/Version: Linux
  Tree: Fedora
Status: NEW
  Severity: normal
  Priority: P1
 Component: ext3
AssignedTo: [EMAIL PROTECTED]
ReportedBy: [EMAIL PROTECTED]


Most recent kernel where this bug did not occur:
Distribution: Fedora Core 7
Hardware Environment: AMD Athlon 64
Software Environment:
Problem Description:

Negative timestamps (i.e. before 1970) are incorrectly handled due to some
obvious 64bit issues.  They show up normally as long as the cache has not been
purged.

Steps to reproduce:

Touch a file in an ext3-file system with a negative timestamp, e.g.:

  # touch -t 19690101 /opt/foo

Unless you have been writing a lot of data to this partition after the last
command, the next one should display the following:

  #  ls -l /opt/foo
  -rw-r--r-- 1 root root 0 1969-01-01 00:00 /opt/foo

This is still correct.  But now we purge the file system cache by unmounting
and mounting the partition again:

  # umount /opt/foo
  # mount /opt/foo

Now the timestamp will be corrupted:

  # ls -l /opt/foo 
  -rw-r--r-- 1 root root 0 2105-02-07 06:28 /opt/foo

It seems obvious that the upper 32 bits of the 64 bits representing the time
stamp get lost, which explains the above observation.

On 32bit architectures there is no such problem.  I haven't managed to
reproduce this problem with other file systems (e.g. XFS).


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug, or are watching someone who is.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] fs/buffer.c:1821 in 2.6.22-rc4-mm2

2007-06-11 Thread Andrew Morton
On Sun, 10 Jun 2007 17:57:14 +0200 Eric Sesterhenn / Snakebyte <[EMAIL 
PROTECTED]> wrote:

> hi,
> 
> i got the following BUG while running the syscalls.sh
> from ltp-full-20070531 on an ext3 partition, it is easily reproducible
> for me
> 
> [  476.338068] [ cut here ]
> [  476.338223] kernel BUG at fs/buffer.c:1821!
> [  476.338324] invalid opcode:  [#1]
> [  476.338423] PREEMPT 
> [  476.338665] Modules linked in:
> [  476.338833] CPU:0
> [  476.338836] EIP:0060:[]Not tainted VLI
> [  476.338840] EFLAGS: 00010202   (2.6.22-rc4-mm2 #1)
> [  476.339206] EIP is at __block_prepare_write+0x64/0x410
> [  476.339311] eax: 0001   ebx: c136fbb8   ecx: c07faf28   edx:
> 0001
> [  476.339417] esi: c1dc9040   edi: c32d2dfc   ebp: c3733db8   esp:
> c3733d50
> [  476.339584] ds: 007b   es: 007b   fs:   gs: 0033  ss: 0068
> [  476.339690] Process vmsplice01 (pid: 7680, ti=c3733000 task=c351ed60
> task.ti=c3733000)
> [  476.339796] Stack: c3733d70 c0143e76 c1a0eab0 0046 
> c2509d64 0cd8 c136fbb8 
> [  476.340675]c32d2dfc 0296 c02313b6 c1086088 0050
> c02313b6 c1dc9040 c2509d50 
> [  476.341491]c1dc9054 c3733dc4 c02313e9 c3733dbc c015728d
> c32d2f0c  c136fbb8 
> [  476.342371] Call Trace:
> [  476.342565]  [] block_write_begin+0x83/0xf0
> [  476.342804]  [] ext3_write_begin+0xc8/0x1c0
> [  476.342987]  [] pagecache_write_begin+0x4f/0x150
> [  476.343243]  [] pipe_to_file+0x9b/0x170
> [  476.343418]  [] __splice_from_pipe+0x70/0x260
> [  476.343654]  [] splice_from_pipe+0x48/0x70
> [  476.343828]  [] generic_file_splice_write+0x88/0x130
> [  476.344066]  [] do_splice_from+0xb7/0xc0
> [  476.344240]  [] sys_splice+0x1a1/0x230
> [  476.344474]  [] sysenter_past_esp+0x5f/0x99
> [  476.344656]  [] 0xe410
> [  476.344882]  ===
> [  476.344984] INFO: lockdep is turned off.
> [  476.345084] Code: 00 0f 97 c2 e8 ee 2f 22 00 85 c0 74 04 0f 0b eb fe
> 31 d2 b8 28 af 7f c0 81 7d 08 00 10 00 00 0f 97 c2 e8 d0 2f 22 00 85 c0
> 74 04 <0f> 0b eb fe 8b 55 08 39 55 b0 0f 97 c0 0f b6 d0 b8 0c af 7f c0 
> [  476.350365] EIP: [] __block_prepare_write+0x64/0x410 SS:ESP
> 0068:c3733d50

Yep, vmsplice01 is not supported on -mm kernels ;)

Nick has a protofix but I don't think it's been tested yet.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH ext3/ext4] orphan list corruption due bad inode

2007-06-04 Thread Andrew Morton
On Mon, 04 Jun 2007 09:19:10 +0400 Vasily Averin <[EMAIL PROTECTED]> wrote:

> After ext3 orphan list check has been added into ext3_destroy_inode() (please 
> see my previous patch) the following situation has been detected:
>  EXT3-fs warning (device sda6): ext3_unlink: Deleting nonexistent file 
> (37901290), 0
>  Inode 0101a15b7840: orphan list check failed!
>  0773 6f665f00 74616d72 0573 65725f00 06737270 6600 616d726f
> ...
>  Call Trace: [] ext3_destroy_inode+0x79/0x90
>   [] sys_unlink+0x126/0x1a0
>   [] error_exit+0x0/0x81
>   [] system_call+0x7e/0x83
> 
> First messages said that unlinked inode has i_nlink=0, then ext3_unlink() 
> adds this inode into orphan list.
> 
> Second message means that this inode has not been removed from orphan list. 
> Inode dump has showed that i_fop = &bad_file_ops and it can be set in 
> make_bad_inode() only. Then I've found that ext3_read_inode() can call 
> make_bad_inode() without any error/warning messages, for example in the 
> following case:
> ...
> if (inode->i_nlink == 0) {
> if (inode->i_mode == 0 ||
> !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
> /* this inode is deleted */
> brelse (bh);
> goto bad_inode;
> ...
> 
> Bad inode can live some time, ext3_unlink can add it to orphan list, but
> ext3_delete_inode() do not deleted this inode from orphan list. As
> result we can have orphan list corruption detected in ext3_destroy_inode().
> 
> However it is not clear for me how to fix this issue correctly.
> 
> As far as i see is_bad_inode() is called after iget() in all places excluding 
> ext3_lookup() and ext3_get_parent(). I believe it makes sense to add bad 
> inode check to these functions too and call iput if bad inode detected.

Please avoid the 500-column paragraphs?

> Signed-off-by:Vasily Averin <[EMAIL PROTECTED]>
> 
> diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
> index 9bb046d..e3ac8c3 100644
> --- a/fs/ext3/namei.c
> +++ b/fs/ext3/namei.c
> @@ -1019,6 +1019,11 @@ static struct dentry *ext3_lookup(struct inode * dir, 
> struct dentry *dentry, str
>  
>   if (!inode)
>   return ERR_PTR(-EACCES);
> +
> + if (is_bad_inode(inode)) {
> + iput(inode);
> + return ERR_PTR(-ENOENT);
> + }
>   }
>   return d_splice_alias(inode, dentry);
>  }
> @@ -1054,6 +1059,11 @@ struct dentry *ext3_get_parent(struct dentry *child)
>   if (!inode)
>   return ERR_PTR(-EACCES);
>  
> + if (is_bad_inode(inode)) {
> + iput(inode);
> + return ERR_PTR(-ENOENT);
> + }
> +
>   parent = d_alloc_anon(inode);
>   if (!parent) {
>   iput(inode);

Seems reasonable.  So this prevents the bad inodes from getting onto the
orphan list in the first place?

What caused those inodes to be bad, anyway?  Memory allocation failures?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH ext3/ext4] orphan list check on destroy_inode

2007-06-04 Thread Andrew Morton
On Mon, 04 Jun 2007 09:18:55 +0400 Vasily Averin <[EMAIL PROTECTED]> wrote:

> Customers claims to ext3-related errors, investigation showed that ext3 
> orphan list has been corrupted and have the reference to non-ext3 inode. The 
> following debug helps to understand the reasons of this issue.
> 
> Signed-off-by:Vasily Averin <[EMAIL PROTECTED]>
> 
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 6e30629..46e2fa6 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -459,6 +459,21 @@ static struct inode *ext3_alloc_inode(struct super_block 
> *sb)
>  
>  static void ext3_destroy_inode(struct inode *inode)
>  {
> + if (!list_empty(&(EXT3_I(inode)->i_orphan))) {
> + int i, imax;
> + unsigned int *p;
> +
> + p = (unsigned int *)EXT3_I(inode);
> + imax = sizeof(struct ext3_inode_info) / sizeof(unsigned int);
> + printk("Inode %p: orphan list check failed!\n", EXT3_I(inode));
> + for (i = 0; i < imax; i++) {
> + if (i && ((i % 8) == 0))
> + printk("\n");
> + printk("%08x ", *p++);
> + }
> + printk("\n");
> + dump_stack();

umm, OK, but please use the new lib/hexdump.c for this.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fw: ext3 dir_index causes an error

2007-05-31 Thread Andrew Morton

Ted is dir_index maintainer ;)

That's a nice-looking bug report, btw.  Thanks.


Begin forwarded message:

Date: Fri, 01 Jun 2007 13:01:07 +0900
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: ext3 dir_index causes an error



Hello,

First of all, I really appricate your great works.
Now I've found a problem around dir_index feature.
Here is a report following linux/REPORTING-BUGS.


[1.] One line summary of the problem:
ext3 dir_index causes an error

[2.] Full description of the problem/report:
This is my local test program to reproduce this problem. The
readdir1.c calls creat(2), opendir(3) and readdir(3). And the shell
script execute it repeatedly with a brand-new ext3fs image on a
loopback device.
When the script adds '-O dir_index' to mkfs, some errors appear.

On a system with linux-2.6.21.3, ext3fs produces these error message,
and the filesystem seems to be corrupted.
--
kjournald starting.  Commit interval 5 seconds
EXT3 FS on loop0, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
:::
EXT3-fs: mounted filesystem with ordered data mode.
EXT3-fs error (device loop0): htree_dirblock_to_tree: bad entry in directory 
#2: rec_len is too small for name_len - offset=6924, inode=26, rec_len=244, 
name_len=249
EXT3-fs error (device loop0): htree_dirblock_to_tree: bad entry in directory 
#2: rec_len is too small for name_len - offset=6924, inode=26, rec_len=244, 
name_len=249
EXT3-fs error (device loop0): htree_dirblock_to_tree: bad entry in directory 
#2: rec_len is too small for name_len - offset=6924, inode=26, rec_len=244, 
name_len=249
kjournald starting.  Commit interval 5 seconds
:::
--

On the other system with linux-2.6.18 (debian etch kernel), the same
error appears.
When the script adds '-O ^dir_index' to mkfs, the problem never appears.

It is not everytime that these errors appear. So the shell script
executes the readdir1 test program repeatedly.
Recently I upgraded my debian system from version 3.1 'sarge' to 4.0
'etch'. The debian etch sets the dir_index feature by default. So I
found this problem.

[3.] Keywords (i.e., modules, networking, kernel):
ext3 dir_index

[4.] Kernel information
[4.1.] Kernel version (from /proc/version):
[4.2.] Kernel .config file:
[5.] Most recent kernel version which did not have the bug:
[6.] Output of Oops.. message (if applicable) with symbolic information
 resolved (see Documentation/oops-tracing.txt)
[7.] A small shell script or example program which triggers the
 problem (if possible)

(readdir1.c)

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

void fin(char *s)
{
perror(s);
exit(1);
}

void msg(int found, char *fname)
{
printf("%s%s found\n", fname, found?"":" not");
}

int
main(int argc, char *argv[])
{
DIR *dp;
struct dirent *de;
int err, found, i;
char a[250];

err = chdir(argv[1]);
if (err)
fin("chdir");

memset(a, 'a', sizeof(a)-1);
a[sizeof(a)-1] = 0;
for (i = 0; i < 16+1; i++) {
a[0]++;
err = creat(a, 0644);
if (err < 0)
fin("creat");

err = creat(argv[2], 0644);
if (err < 0)
fin("creat");
}

#if 0
err = unlink(argv[2]);
if (err && errno != ENOENT)
fin("unlink");
#endif

dp = opendir(".");
if (!dp)
fin("opendir");

de = readdir(dp);
if (!de)
fin("1st readdir");
assert(strcmp(argv[2], de->d_name));

#if 0
argv[2][0]++;
err = creat(argv[2], 0644);
if (err < 0)
fin("creat");

argv[2][0]--;
#endif
err = creat(argv[2], 0644);
if (err < 0)
fin("creat");

#if 0
err = unlink(argv[2]);
if (err && errno != ENOENT)
fin("unlink");
#endif

found = 0;
while ((de = readdir(dp)) && !found)
found = !strcmp(argv[2], de->d_name);
msg(found, argv[2]);

found = 0;
rewinddir(dp);
while ((de = readdir(dp)) && !found)
found = !strcmp(argv[2], de->d_name);
msg(found, argv[2]);

closedir(dp);
dp = opendir(".");
if (!dp)
fin("opendir");

found = 0;
while ((de = readdir(dp)) && !found)
found = !strcmp(argv[2], de->d_name);
msg(found, argv[2]);

return 0;
}
--
#!/bin/sh

img=rw.img
dir=rw
set -e
make /tmp/readdir1

cd /dev/shm
dd if=/dev/zero of=$img bs=1k count=4k 2> /dev/null
mkdir -p $dir
ulimit -c un

Re: [PATCH 1/9] readahead: introduce PG_readahead

2007-05-19 Thread Andrew Morton
On Sat, 19 May 2007 20:30:31 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Fri, May 18, 2007 at 11:28:24PM -0700, Andrew Morton wrote:
> > On Thu, 17 May 2007 06:47:53 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > 
> > > Introduce a new page flag: PG_readahead.
> > 
> > Is there any way in which we can avoid adding a new page flag?
> > 
> > We have the advantage that if the kernel very occasionally gets the wrong
> > result for PageReadahead(page), nothing particularly bad will happen, so we
> > can do racy things.
> > 
> > >From a quick peek, it appears that PG_readahead is only ever set against
> > non-uptodate pages.  If true we could perhaps exploit that: say,
> > PageReadahead(page) == PG_referenced && !PG_uptodate?
> 
> PG_uptodate will flip to 1 before the reader touches the page :(

OK.

> However, it may be possible to share the same bit with PG_reclaim or 
> PG_booked.
> Which one would be preferred?

I'd like PG_booked to go away too - I don't think we've put that under the
microscope yet.  If it remains then its scope will be "defined by the
filesystem", so readahead shouldn't use it.  PG_reclaim belongs to core VFS
so that's much better.

Let's not do anything ugly, slow or silly in there, but please do take a
look, see if there is an opportunity here.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE4] fallocate system call

2007-05-18 Thread Andrew Morton
On Thu, 17 May 2007 19:41:15 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:

> fallocate() is a new system call being proposed here which will allow
> applications to preallocate space to any file(s) in a file system.

I merged the first three patches into -mm, thanks.

All the system call numbers got changed due to recent additions.  They
may change in the future, too - nothing is stable until the code lands
in mainline.

I didn't merge any of the ext4 changes as they appear to be in Ted's
devel tree.  Although I didn't check that they are 100% the same in 
that tree.

What's the plan to get some ext4 updates into mainline, btw?  Things
seem to be rather gradual.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] resolve duplicate flag no for PG_lazyfree

2007-05-14 Thread Andrew Morton
On Mon, 14 May 2007 14:06:19 -0400
Theodore Tso <[EMAIL PROTECTED]> wrote:

> On Sun, May 13, 2007 at 10:46:30PM -0700, Andrew Morton wrote:
> > otoh, the intersection between pages which are PageBooked() and pages which
> > are PageLazyFree() should be zreo, so it'd be good to actually formalise
> > this reuse within the ext4 patches.
> > 
> > otoh2, PageLazyFree() could have reused PG_owner_priv_1.
> > 
> > Rik, Ted: any thoughts?  We do need to scrimp on page flags: when we
> > finally run out, we're screwed.
> 
> It makes sense to me.  PG_lazyfree is currently only in -mm, right?

Ah, yes, I got confused, sorry.

>  I
> don't see it in my git tree.  It would probably would be a good idea
> to make sure that we check to add some sanity checking code if it
> isn't there already that PG_lazyfree isn't already set when try to set
> PG_lazyfree (just in case there is a bug in the future which causes
> the should-never-happen case of trying lazy free a PageBooked page).
> 

Actually, I think the current status of
lazy-freeing-of-memory-through-madv_free.patch is "might not be needed".  I
_think_ we've determined that 0a27a14a62921b438bb6f33772690d345a089be6
sufficiently fixed the perfomance problems we had in there?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] resolve duplicate flag no for PG_lazyfree

2007-05-13 Thread Andrew Morton
On Mon, 14 May 2007 10:37:18 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:

> PG_lazyfree and PG_booked shares the same bit.
> 
> Either it is a bug that shall fixed by the following patch, or
> the situation should be explicitly documented?
> 
> Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
> ---
>  include/linux/page-flags.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-2.6.21-mm2.orig/include/linux/page-flags.h
> +++ linux-2.6.21-mm2/include/linux/page-flags.h
> @@ -91,7 +91,7 @@
>  #define PG_buddy 19  /* Page is free, on buddy lists */
>  #define PG_booked20  /* Has blocks reserved on-disk */
>  
> -#define PG_lazyfree  20  /* MADV_FREE potential throwaway */
> +#define PG_lazyfree  21  /* MADV_FREE potential throwaway */
>  
>  /* PG_owner_priv_1 users should have descriptive aliases */
>  #define PG_checked   PG_owner_priv_1 /* Used by some filesystems */

That's an accident: PG_lazyfree got added but the out-of-tree ext4 patches
didn't get updated.

otoh, the intersection between pages which are PageBooked() and pages which
are PageLazyFree() should be zreo, so it'd be good to actually formalise
this reuse within the ext4 patches.

otoh2, PageLazyFree() could have reused PG_owner_priv_1.

Rik, Ted: any thoughts?  We do need to scrimp on page flags: when we
finally run out, we're screwed.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 07 May 2007 17:00:24 -0700
Mingming Cao <[EMAIL PROTECTED]> wrote:

> > +   while (ret >= 0 && ret < max_blocks) {
> > +   block = block + ret;
> > +   max_blocks = max_blocks - ret;
> > +   ret = ext4_ext_get_blocks(handle, inode, block,
> > + max_blocks, &map_bh,
> > + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > +   BUG_ON(!ret);
> > +   if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> > +   && ((block + ret) > (i_size_read(inode) << 
> > blkbits)))
> > +   nblocks = nblocks + ret;
> > +   }
> > +
> > +   if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, 
> > &retries))
> > +   goto retry;
> > +
> > Now the interesting question is: what do we do if we get halfway through
> > this loop and then run out of space?  We could leave the disk all filled up
> > and then return failure to the caller, but that's pretty poor behaviour,
> > IMO.
> > 
> The current code handles earlier ENOSPC by three times retries. After
> that if we still run out of space, then it's propably right to notify
> the caller there isn't much space left.
> 
> We could extend the block reservation window size before the while loop
> so we could get a lower chance to get more fragmented.

yes, but my point is that the proposed behaviour is really quite bad.

We will attempt to allocate the disk space and then we will return failure,
having consumed all the disk space and having partially and uselessly
populated an unknown amount of the file.

Userspace could presumably repair the mess in most situations by truncating
the file back again.  The kernel cannot do that because there might be live
data in amongst there.

So we'd need to either keep track of which blocks were newly-allocated and
then free them all again on the error path (doesn't work right across
commit+crash+recovery) or we could later use the space-reservation scheme which
delayed allocation will need to introduce.

Or we could decide to live with the above IMO-crappy behaviour.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 19:14:42 -0400
Theodore Tso <[EMAIL PROTECTED]> wrote:

> On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > > Actually, this is a non-issue.  The reason that it is handled for 
> > > extent-only
> > > is that this is the only way to allocate space in the filesystem without
> > > doing the explicit zeroing.  For other filesystems (including ext3 and
> > > ext4 with block-mapped files) the filesystem should return an error (e.g.
> > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in 
> > > userspace.
> > 
> > It can be a bit suboptimal from the layout POV.  The reservations code will
> > largely save us here, but kernel support might make it a bit better.
> 
> Actually, the reservations code won't matter, since glibc will fall
> back to its current behavior, which is it will do the preallocation by
> explicitly writing zeros to the file.

No!  Reservations code is *critical* here.  Without reservations, we get
disastrously-bad layout if two processes were running a large fallocate()
at the same time.  (This is an SMP-only problem, btw: on UP the timeslice
lengths save us).

My point is that even though reservations save us, we could do even-better
in-kernel.

But then, a smart application would bypass the glibc() fallocate()
implementation and would tune the reservation window size and would use
direct-IO or sync_file_range()+fadvise(FADV_DONTNEED).

> This wlil result in the same
> layout as if we had done the persistent preallocation, but of course
> it will mean the posix_fallocate() could potentially take a long time
> if you're a PVR and you're reserving a gig or two for a two hour movie
> at high quality.  That seems suboptimal, granted, and ideally the
> application should be warned about this before it calls
> posix_fallocate().  On the other hand, it's what happens today, all
> the time, so applications won't be too badly surprised.

A PVR implementor would take all this over and would do it themselves, for
sure.

> If we think applications programmers badly need to know in advance if
> posix_fallocate() will be fast or slow, probably the right thing is to
> define a new fpathconf() configuration option so they can query to see
> whether a particular file will support a fast posix_fallocate().  I'm
> not 100% convinced such complexity is really needed, but I'm willing
> to be convinced  what do folks think?
> 

An application could do sys_fallocate(one-byte) to work out whether it's
supported in-kernel, I guess.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 15:21:04 -0700
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> On May 07, 2007  13:58 -0700, Andrew Morton wrote:
> > Final point: it's fairly disappointing that the present implementation is
> > ext4-only, and extent-only.  I do think we should be aiming at an ext4
> > bitmap-based implementation and an ext3 implementation.
> 
> Actually, this is a non-issue.  The reason that it is handled for extent-only
> is that this is the only way to allocate space in the filesystem without
> doing the explicit zeroing.  For other filesystems (including ext3 and
> ext4 with block-mapped files) the filesystem should return an error (e.g.
> -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

hrm, spose so.

It can be a bit suboptimal from the layout POV.  The reservations code will
largely save us here, but kernel support might make it a bit better.

Totally blowing pagecache could be a problem.  Fixable in userspace by
using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andrew Morton
On Mon, 7 May 2007 05:37:54 -0600
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> > > + block = offset >> blkbits;
> > > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > > +  - block;
> > > + mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > > + credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > > + mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> > 
> > Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
> > space, and that this disk space will require an arbitrary amount of
> > metadata, how can we work out how much journal space we'll be needing
> > without at least looking at `len'?
> 
> Good question.
> 
> The uninitialized extent can cover up to 128MB with a single entry.
> If @path isn't specified, then ext4_ext_calc_credits_for_insert()
> function returns the maximum number of extents needed to insert a leaf,
> including splitting all of the index blocks.  That would allow up to 43GB
> (340 extents/block * 128MB) to be preallocated, but it still needs to take
> the size of the preallocation into account (adding 3 blocks per 43GB - a
> leaf block, a bitmap block and a group descriptor).

I think the use of ext4_journal_extend() (as Amit has proposed) will help
here, but it is not sufficient.

Because under some circumstances, a journal_extend() failure could mean
that we fail to allocate all the required disk space.  If it is infrequent
enough, that is acceptable when the caller is using fallocate() for
performance reasons.

But it is very much not acceptable if the caller is using fallocate() for
space-reservation reasons.  If you used fallocate to reserve 1GB of disk
and fallocate() "succeeded" and you later get ENOSPC then you'd have a
right to get a bit upset.

So I think the ext3/4 fallocate() implementation will need to be
implemented as a loop: 

while (len) {
journal_start();
len -= do_fallocate(len, ...);
journal_stop();
}


Now the interesting question is: what do we do if we get halfway through
this loop and then run out of space?  We could leave the disk all filled up
and then return failure to the caller, but that's pretty poor behaviour,
IMO.



Does the proposed implementation handle quotas correctly, btw?  Has that
been tested?


Final point: it's fairly disappointing that the present implementation is
ext4-only, and extent-only.  I do think we should be aiming at an ext4
bitmap-based implementation and an ext3 implementation.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: JBD: ext2online wants too many credits (744 > 256)

2007-05-06 Thread Andrew Morton
On Mon, 7 May 2007 00:26:26 +0200 Frank van Maarseveen <[EMAIL PROTECTED]> 
wrote:

> 2.6.20.6, FC4:
> 
> I created a 91248k ext3 fs with 4k blocksize:
> 
> | mke2fs -j -b 4096 /dev/vol1/project 
> | mke2fs 1.38 (30-Jun-2005)
> | Filesystem label=
> | OS type: Linux
> | Block size=4096 (log=2)
> | Fragment size=4096 (log=2)
> | 23552 inodes, 23552 blocks
> | 1177 blocks (5.00%) reserved for the super user
> | First data block=0
> | Maximum filesystem blocks=25165824
> | 1 block group
> | 32768 blocks per group, 32768 fragments per group
> | 23552 inodes per group
> 
> Writing inode tables: done
> Creating journal (1024 blocks): done
> Writing superblocks and filesystem accounting information: done
> 
> Next, I tried to resize it to about 3G using ext2online while mounted:
> 
> | # ext2online /dev/vol1/project 
> | ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b
> | ext2online: ext2_ioctl: No space left on device
> |
> | ext2online: unable to resize /dev/mapper/vol1-project
> 
> At that time the kernel said:
> 
> |JBD: ext2online wants too many credits (744 > 256)
> 
> What is the limitation I should be aware of? Has it something to do with
> the journal log size?
> 
> The size actually did increase a bit, to 128112k.
> 
> 
> Steps to reproduce:
> Create a 3G partition, say /dev/vol1/project
> mke2fs -j -b 4096 /dev/vol1/project 22812
> mount it
> ext2online /dev/vol1/project said:
> 
> | ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b
> | ext2online: ext2_ioctl: No space left on device
> | 
> | ext2online: unable to resize /dev/mapper/vol1-project
> 
> kernel said:
> 
> | JBD: ext2online wants too many credits (721 > 256)
> 

(added linux-ext4 to cc)
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] ext3: remove inode constructor

2007-05-05 Thread Andrew Morton
On Sat, 5 May 2007 11:58:45 +0300 (EEST) Pekka J Enberg <[EMAIL PROTECTED]> 
wrote:

> On Fri, 4 May 2007, Andrew Morton wrote:
> > I got 100% rejects against this because Christoph has already had
> > his paws all over the slab constructor code everywhere.
> > 
> > Was going to fix it up but then decided that we ought to make changes
> > like this to ext4 as well.  Ideally beforehand, but simultaneously is
> > OK as long as it's simple enough.
> 
> I'll send you proper patches for them (and will convert other filesystems 
> too).

May as well.

> On Fri, 4 May 2007, Andrew Morton wrote:
> > btw, for a benchmark I'd suggest just a silly create-1-files
> > tight loop rather than something more complex like postmark.
> 
> Do you want me to redo the benchmarks or are you happy enough with the 
> postmark numbers?

I doubt if this is measurable, really.  It'll be something like the
difference between an L1 hit and an L2 hit in amongst all the other stuff
we do on a per-inode basis.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] ext3: remove inode constructor

2007-05-04 Thread Andrew Morton
On Fri, 4 May 2007 13:14:35 +0300 (EEST)
Pekka J Enberg <[EMAIL PROTECTED]> wrote:

> As explained by Christoph Lameter, ext3_alloc_inode() touches the same
> cache line as init_once() so we gain nothing from using slab
> constructors.  The SLUB allocator will be more effective without it
> (free pointer can be placed inside the free'd object), so move inode
> initialization to ext3_alloc_inode completely.

I got 100% rejects against this because Christoph has already had
his paws all over the slab constructor code everywhere.

Was going to fix it up but then decided that we ought to make changes
like this to ext4 as well.  Ideally beforehand, but simultaneously is
OK as long as it's simple enough.

btw, for a benchmark I'd suggest just a silly create-1-files
tight loop rather than something more complex like postmark.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.21-git4 Scheduler, NOHZ, VFS bugs

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 18:20:51 +0200 Michal Piotrowski <[EMAIL PROTECTED]> wrote:

> I ran this script tree times,
> 
> #! /bin/sh
> 
> for i in `find /sys/ -type f`
> do
> echo "wyświetlam $i"
> sudo cat $i > /dev/null
> done
> 
> First run - scheduler bug
> Second run - NOHZ bug
> Third - VFS bug
> 
> H...
> 
> [93298.252601] BUG: at /mnt/md0/devel/linux-git/kernel/sched.c:3241 
> add_preempt_count()
> [93298.260334]  [] show_trace_log_lvl+0x1a/0x2f
> [93298.265507]  [] show_trace+0x12/0x14
> [93298.269974]  [] dump_stack+0x16/0x18
> [93298.274434]  [] add_preempt_count+0x89/0x8b
> [93298.279501]  [] irq_enter+0xd/0x2e
> [93298.283788]  [] smp_apic_timer_interrupt+0x2a/0x84
> [93298.289458]  [] apic_timer_interrupt+0x33/0x38
> [93298.294783]  [] show_uevent+0x58/0xcd
> [93298.299329]  ===
> [93390.468056] NOHZ: local_softirq_pending 22
> [93447.105850] NOHZ: local_softirq_pending 22
> [93450.332884] BUG: unable to handle kernel paging request at virtual address 
> 3e343c0c
> [93450.340626]  printing eip:
> [93450.34] c018e2bc
> [93450.345520] *pde = 
> [93450.348314] Oops:  [#1]

Nice.  What was the last file which it read before crashing?

Are you able to consistently crash it by reading just that file?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 11:39:22 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > I'm still not understanding.  The terms you're using are a bit ambiguous.
> > 
> > What does "find some dirty unallocated blocks" mean?  Find a page which is
> > dirty and which does not have a disk mapping?
> > 
> > Normally the above operation would be implemented via
> > ext4_writeback_writepage(), and it runs under lock_page().
> 
> I'm mostly worried about delayed allocation case. My impression was that
> holding number of pages locked isn't a good idea, even if they're locked
> in index order. so, I was going to turn number of pages writeback, then
> allocate blocks for all of them at once, then put proper blocknr's into
> bh's (or PG_mappedtodisk?).

ooh, that sounds hacky and quite worrisome.  If someone comes in and does
an fsync() we've lost our synchronisation point.  Yes, all callers happen
to do

lock_page();
wait_on_page_writeback();

(I think) but we've never considered a bare PageWriteback() as something
which protects page internals.  We're OK wrt page reclaim and we're OK wrt
truncate and invalidate.  As long as the page is uptodate we _should_ be OK
wrt readpage().  But still, it'd be better to use the standard locking
rather than inventing new rules, if poss.


I'd be 100% OK with locking multiple pages in ascending pgoff_t order. 
Locking the page is the standard way of doing this synchronisation and the
only problem I can think of is that having a tremendous number of pages
locked could cause the wake_up_page() waitqueue hashes to get overloaded
and go slow.  But it's also possible to lock many, many pages with
readahead and nobody has reported problems in there.


> > 
> > 
> >>going to commit
> >>find inode I dirty
> >>do NOT find these blocks because they're
> >>  allocated only, but pages/bhs aren't 
> >> mapped
> >>  to them
> >>start commit
> > 
> > I think you're assuming here that commit would be using ->t_sync_datalist
> > to locate dirty buffer_heads.
> 
> nope, I mean sb->inode->page walk.
> 
> > But under this proposal, t_sync_datalist just gets removed: the new
> > ordered-data mode _only_ need to do the sb->inode->page walk.  So if I'm
> > understanding you, the way in which we'd handle any such race is to make
> > kjournald's writeback of the dirty pages block in lock_page().  Once it
> > gets the page lock it can look to see if some other thread has mapped the
> > page to disk.
> 
> if I'm right holding number of pages locked, then they won't be locked, but
> writeback. of course kjournald can block on writeback as well, but how does
> it find pages with *newly allocated* blocks only?

I don't think we'd want kjournald to do that.  Even if a page was dirtied
by an overwrite, we'd want to write it back during commit, just from a
quality-of-implementation point of view.  If we were to leave these pages
unwritten during commit then a post-recovery file could have a mix of
up-to-five-second-old data and up-to-30-seconds-old data.

> > It may turn out that kjournald needs a private way of getting at the
> > I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
> > had the radix-tree-of-dirty-inodes thing then that's easy enough to do
> > anyway, with a tagged search.  But I expect that a single pass through the
> > superblock's dirty inodes would suffice for ordered-data.  Files which
> > have chattr +j would screw things up, as usual.
> 
> not dirty inodes only, but rather some fast way to find pages with newly
> allocated pages.

Newly allocated blocks, you mean?

Just write out the overwritten blocks as well as the new ones, I reckon. 
It's what we do now.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Andrew Morton
On Fri, 04 May 2007 10:57:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:
> > 
> >> Andrew Morton wrote:
> >>> Yes, there can be issues with needing to allocate journal space within the
> >>> context of a commit.  But
> >> no-no, this isn't required. we only need to mark pages/blocks within
> >> transaction, otherwise race is possible when we allocate blocks in 
> >> transaction,
> >> then transacton starts to commit, then we mark pages/blocks to be flushed
> >> before commit.
> > 
> > I don't understand.  Can you please describe the race in more detail?
> 
> if I understood your idea right, then in data=ordered mode, commit thread 
> writes
> all dirty mapped blocks before real commit.
> 
> say, we have two threads: t1 is a thread doing flushing and t2 is a commit 
> thread
> 
> t1t2
> find dirty inode I
> find some dirty unallocated blocks
> journal_start()
> allocate blocks
> attach them to I
> journal_stop()

I'm still not understanding.  The terms you're using are a bit ambiguous.

What does "find some dirty unallocated blocks" mean?  Find a page which is
dirty and which does not have a disk mapping?

Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().


>   going to commit
>   find inode I dirty
>   do NOT find these blocks because they're
> allocated only, but pages/bhs aren't 
> mapped
> to them
>   start commit

I think you're assuming here that commit would be using ->t_sync_datalist
to locate dirty buffer_heads.

But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb->inode->page walk.  So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page().  Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.



It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search.  But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data.  Files which
have chattr +j would screw things up, as usual.

I assume (hope) that your delayed allocation code implements
->writepages()?  Doing the allocation one-page-at-a-time sounds painful...

> 
> map pages/bhs to just allocate blocks
> 
> 
> so, either we mark pages/bhs someway within journal_start()--journal_stop() or
> commit thread should do lookup for all dirty pages. the latter doesn't sound 
> nice, IMHO.
> 

I don't think I'm understanding you fully yet.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-03 Thread Andrew Morton
On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > Yes, there can be issues with needing to allocate journal space within the
> > context of a commit.  But
> 
> no-no, this isn't required. we only need to mark pages/blocks within
> transaction, otherwise race is possible when we allocate blocks in 
> transaction,
> then transacton starts to commit, then we mark pages/blocks to be flushed
> before commit.

I don't understand.  Can you please describe the race in more detail?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-03 Thread Andrew Morton
On Fri, 4 May 2007 16:07:31 +1000 David Chinner <[EMAIL PROTECTED]> wrote:

> On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > This patch implements the fallocate() system call and adds support for
> > > i386, x86_64 and powerpc.
> > > 
> > > ...
> > > +{
> > > + struct file *file;
> > > + struct inode *inode;
> > > + long ret = -EINVAL;
> > > +
> > > + if (len == 0 || offset < 0)
> > > + goto out;
> > 
> > The posix spec implies that negative `len' is permitted - presumably 
> > "allocate
> > ahead of `offset'".  How peculiar.
> 
> I just checked the man page for posix_fallocate() and it says:
> 
>   EINVAL  offset or len was less than zero.
> 
> We should probably follow this lead.

Yes, I think so.  I'm suspecting that
http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
is just buggy.  Or I can't read.

I mean, if we're going to support negative `len' then is the byte at
`offset' inside or outside the segment?  Head spins.

However it would be neat if someone could test $OTHER_OS and, perhaps more
importantly, the present glibc emulation (which I assume your manpage is
referring to, so this would be a manpage test ;)).

> > > +
> > > + ret = -ENODEV;
> > > + if (!S_ISREG(inode->i_mode))
> > > + goto out_fput;
> > 
> > So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
> > seems a bit silly of them.
> 
> H - I thought that the intention of sys_fallocate() was to
> be generic enough to eventually allow preallocation on directories.
> If that is the case, then this check will prevent that

The above opengroup page only permits S_ISREG.  Preallocating directories
sounds quite useful to me, although it's something which would be pretty
hard to emulate if the FS doesn't support it.  And there's a decent case to
be made for emulating it - run-anywhere reasons.  Does glibc emulation support
directories?  Quite unlikely.

But yes, sounds like a desirable thing.  Would XFS support it easily if the 
above
check was relaxed?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-03 Thread Andrew Morton
On Thu, 3 May 2007 21:29:55 -0700 Andrew Morton <[EMAIL PROTECTED]> wrote:

> > +   ret = -EFBIG;
> > +   if (offset + len > inode->i_sb->s_maxbytes)
> > +   goto out_fput;
> 
> This code does handle offset+len going negative, but only by accident, I
> suspect.

But it doesn't handle offset+len wrapping through zero.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-03 Thread Andrew Morton
On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:

> This patch has the ext4 implemtation of fallocate system call.
> 
> ...
>
> + /* ext4_can_extents_be_merged should have checked that either
> +  * both extents are uninitialized, or both aren't. Thus we
> +  * need to check only one of them here.
> +  */

Please always format multiline comments like this:

/*
 * ext4_can_extents_be_merged should have checked that either
 * both extents are uninitialized, or both aren't. Thus we
 * need to check only one of them here.
 */

> ...
>
> +/*
> + * ext4_fallocate:
> + * preallocate space for a file
> + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> + */

This description is rather thin.  What is the filesystem's actual behaviour
here?  If the file is using extents then the implementation will do
.  If the file is using bitmaps then we will do .

But what?   Here is where it should be described.

> +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> +{
> + handle_t *handle;
> + ext4_fsblk_t block, max_blocks;
> + int ret, ret2, nblocks = 0, retries = 0;
> + struct buffer_head map_bh;
> + unsigned int credits, blkbits = inode->i_blkbits;
> +
> + /* Currently supporting (pre)allocate mode _only_ */
> + if (mode != FA_ALLOCATE)
> + return -EOPNOTSUPP;
> +
> + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> + return -ENOTTY;

So we don't implement fallocate on bitmap-based files!  Well that's huge
news.  The changelog would be an appropriate place to communicate this,
along with reasons why, or a description of the plan to fix it.

Also, posix says nothing about fallocate() returning ENOTTY.

> + block = offset >> blkbits;
> + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> +  - block;
> + mutex_lock(&EXT4_I(inode)->truncate_mutex);
> + credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> + mutex_unlock(&EXT4_I(inode)->truncate_mutex);

Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
space, and that this disk space will require an arbitrary amount of
metadata, how can we work out how much journal space we'll be needing
without at least looking at `len'?

> + handle=ext4_journal_start(inode, credits +

Please always put spaces around "="

> + EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1);

And around "+"

> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> +retry:
> + ret = 0;
> + while (ret >= 0 && ret < max_blocks) {
> + block = block + ret;
> + max_blocks = max_blocks - ret;
> + ret = ext4_ext_get_blocks(handle, inode, block,
> +   max_blocks, &map_bh,
> +   EXT4_CREATE_UNINITIALIZED_EXT, 0);
> + BUG_ON(!ret);

BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
ext4_error() would be safer and more useful here.

> + if (ret > 0 && test_bit(BH_New, &map_bh.b_state)

Use buffer_new() here.   A separate patch which fixes the three existing
instances of open-coded BH_foo usage would be appreciated.

> + && ((block + ret) > (i_size_read(inode) << blkbits)))

Check for wrap though the sign bit and through zero please.

> + nblocks = nblocks + ret;
> + }
> +
> + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> + goto retry;
> +
> + /* Time to update the file size.
> +  * Update only when preallocation was requested beyond the file size.
> +  */

Fix comment layout.

> + if ((offset + len) > i_size_read(inode)) {

Both the lhs and the rhs here are signed.  Please review for possible
overflows through the sign bit and through zero.  Perhaps a comment
explaining why it's correct would be appropriate.


> + if (ret > 0) {
> + /* if no error, we assume preallocation succeeded completely */
> + mutex_lock(&inode->i_mutex);
> + i_size_write(inode, offset + len);
> + EXT4_I(inode)->i_disksize = i_size_read(inode);
> + mutex_unlock(&inode->i_mutex);
> + } else if (ret < 0 && nblocks) {
> + /* Handle partial allocation scenario */

The above two comments should be indented one additional tabstop.

> + loff_t newsize;
> + mutex_lock(&inode->i_mutex);
> + newsize  = (nblocks << blkbits) + i_size_read(inode);
> + i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits));
> + EXT4_I(inode)->i_disksize = i_size_read(inode);
> + mut

Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-05-03 Thread Andrew Morton
On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:

> This patch adds write support for preallocated (using fallocate system
> call) blocks/extents. The preallocated extents in ext4 are marked
> "uninitialized", hence they need special handling especially while
> writing to them. This patch takes care of that.
> 
> ...
>
>  /*
> + * ext4_ext_try_to_merge:
> + * tries to merge the "ex" extent to the next extent in the tree.
> + * It always tries to merge towards right. If you want to merge towards
> + * left, pass "ex - 1" as argument instead of "ex".
> + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
> + * 1 if they got merged.

OK.

> + */
> +int ext4_ext_try_to_merge(struct inode *inode,
> + struct ext4_ext_path *path,
> + struct ext4_extent *ex)
> +{
> + struct ext4_extent_header *eh;
> + unsigned int depth, len;
> + int merge_done=0, uninitialized = 0;

space around "=", please.

Many people prefer not to do the multiple-definitions-per-line, btw:

int merge_done = 0;
int uninitialized = 0;

reasons:

- If gives you some space for a nice comment

- It makes patches much more readable, and it makes rejects easier to fix

- standardisation.

> + depth = ext_depth(inode);
> + BUG_ON(path[depth].p_hdr == NULL);
> + eh = path[depth].p_hdr;
> +
> + while (ex < EXT_LAST_EXTENT(eh)) {
> + if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
> + break;
> + /* merge with next extent! */
> + if (ext4_ext_is_uninitialized(ex))
> + uninitialized = 1;
> + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
> + + ext4_ext_get_actual_len(ex + 1));
> + if (uninitialized)
> + ext4_ext_mark_uninitialized(ex);
> +
> + if (ex + 1 < EXT_LAST_EXTENT(eh)) {
> + len = (EXT_LAST_EXTENT(eh) - ex - 1)
> + * sizeof(struct ext4_extent);
> + memmove(ex + 1, ex + 2, len);
> + }
> + eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);

Kenrel convention is to put spaces around "-"

> + merge_done = 1;
> + BUG_ON(eh->eh_entries == 0);

eek, scary BUG_ON.  Do we really need to be that severe?  Would it be
better to warn and run ext4_error() here?

> + }
> +
> + return merge_done;
> +}
> +
> +
>
> ...
>
> +/*
> + * ext4_ext_convert_to_initialized:
> + * this function is called by ext4_ext_get_blocks() if someone tries to write
> + * to an uninitialized extent. It may result in splitting the uninitialized
> + * extent into multiple extents (upto three). Atleast one initialized extent
> + * and atmost two uninitialized extents can result.

There are some typos here

> + * There are three possibilities:
> + *   a> No split required: Entire extent should be initialized.
> + *   b> Split into two extents: Only one end of the extent is being written 
> to.
> + *   c> Split into three extents: Somone is writing in middle of the extent.

and here

> + */
> +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
> + struct ext4_ext_path *path,
> + ext4_fsblk_t iblock,
> + unsigned long max_blocks)
> +{
> + struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex;
> + struct ext4_extent_header *eh;
> + unsigned int allocated, ee_block, ee_len, depth;
> + ext4_fsblk_t newblock;
> + int err = 0, ret = 0;
> +
> + depth = ext_depth(inode);
> + eh = path[depth].p_hdr;
> + ex = path[depth].p_ext;
> + ee_block = le32_to_cpu(ex->ee_block);
> + ee_len = ext4_ext_get_actual_len(ex);
> + allocated = ee_len - (iblock - ee_block);
> + newblock = iblock - ee_block + ext_pblock(ex);
> + ex2 = ex;
> +
> + /* ex1: ee_block to iblock - 1 : uninitialized */
> + if (iblock > ee_block) {
> + ex1 = ex;
> + ex1->ee_len = cpu_to_le16(iblock - ee_block);
> + ext4_ext_mark_uninitialized(ex1);
> + ex2 = &newex;
> + }
> + /* for sanity, update the length of the ex2 extent before
> +  * we insert ex3, if ex1 is NULL. This is to avoid temporary
> +  * overlap of blocks.
> +  */
> + if (!ex1 && allocated > max_blocks)
> + ex2->ee_len = cpu_to_le16(max_blocks);
> + /* ex3: to ee_block + ee_len : uninitialised */
> + if (allocated > max_blocks) {
> + unsigned int newdepth;
> + ex3 = &newex;
> + ex3->ee_block = cpu_to_le32(iblock + max_blocks);
> + ext4_ext_store_pblock(ex3, newblock + max_blocks);
> + ex3->ee_len = cpu_to_le16(allocated - max_blocks);
> + ext4_ext_mark_uni

Re: [PATCH 3/5] ext4: Extent overlap bugfix

2007-05-03 Thread Andrew Morton
On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:

> +unsigned int ext4_ext_check_overlap(struct inode *inode,
> + struct ext4_extent *newext,
> + struct ext4_ext_path *path)
> +{
> + unsigned long b1, b2;
> + unsigned int depth, len1;
> +
> + b1 = le32_to_cpu(newext->ee_block);
> + len1 = le16_to_cpu(newext->ee_len);
> + depth = ext_depth(inode);
> + if (!path[depth].p_ext)
> + goto out;
> + b2 = le32_to_cpu(path[depth].p_ext->ee_block);
> +
> + /* get the next allocated block if the extent in the path
> +  * is before the requested block(s) */
> + if (b2 < b1) {
> + b2 = ext4_ext_next_allocated_block(path);
> + if (b2 == EXT_MAX_BLOCK)
> + goto out;
> + }
> +
> + if (b1 + len1 > b2) {

Are we sure that b1+len cannot wrap through zero here?

> + newext->ee_len = cpu_to_le16(b2 - b1);
> + return 1;
> + }
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-03 Thread Andrew Morton
On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:

> This patch implements the fallocate() system call and adds support for
> i386, x86_64 and powerpc.
> 
> ...
>
> +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

Please add a comment over this function which specifies its behaviour. 
Really it should be enough material from which a full manpage can be
written.

If that's all too much, this material should at least be spelled out in the
changelog.  Because there's no way in which this change can be fully
reviewed unless someone (ie: you) tells us what it is setting out to
achieve.

If we 100% implement some standard then a URL for what we claim to
implement would suffice.  Given that we're at least using different types from
posix I doubt if such a thing would be sufficient.

And given the complexity and potential variability within the filesystem
implementations of this, I'd expect that _something_ additional needs to be
said?

> +{
> + struct file *file;
> + struct inode *inode;
> + long ret = -EINVAL;
> +
> + if (len == 0 || offset < 0)
> + goto out;

The posix spec implies that negative `len' is permitted - presumably "allocate
ahead of `offset'".  How peculiar.

> + ret = -EBADF;
> + file = fget(fd);
> + if (!file)
> + goto out;
> + if (!(file->f_mode & FMODE_WRITE))
> + goto out_fput;
> +
> + inode = file->f_path.dentry->d_inode;
> +
> + ret = -ESPIPE;
> + if (S_ISFIFO(inode->i_mode))
> + goto out_fput;
> +
> + ret = -ENODEV;
> + if (!S_ISREG(inode->i_mode))
> + goto out_fput;

So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
seems a bit silly of them.

> + ret = -EFBIG;
> + if (offset + len > inode->i_sb->s_maxbytes)
> + goto out_fput;

This code does handle offset+len going negative, but only by accident, I
suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
here would settle the reader's mind.

> + if (inode->i_op && inode->i_op->fallocate)
> + ret = inode->i_op->fallocate(inode, mode, offset, len);
> + else
> + ret = -ENOSYS;

If we _are_ going to support negative `len', as posix suggests, I think we
should perform the appropriate sanity conversions to `offset' and `len'
right here, rather than expecting each filesystem to do it.

If we're not going to handle negative `len' then we should check for it.

> +out_fput:
> + fput(file);
> +out:
> + return ret;
> +}
> +EXPORT_SYMBOL(sys_fallocate);

I don't believe this needs to be exported to modules?

> +/*
> + * fallocate() modes
> + */
> +#define FA_ALLOCATE  0x1
> +#define FA_DEALLOCATE0x2

Now those aren't in posix.  They should be documented, along with their
expected semantics.

>  #ifdef __KERNEL__
>  
>  #include 
> @@ -1125,6 +1131,7 @@ struct inode_operations {
>   ssize_t (*listxattr) (struct dentry *, char *, size_t);
>   int (*removexattr) (struct dentry *, const char *);
>   void (*truncate_range)(struct inode *, loff_t, loff_t);
> + long (*fallocate)(struct inode *, int, loff_t, loff_t);

I really do think it's better to put the variable names in definitions such
as this.  Especially when we have two identically-typed variables next to
each other like that.  Quick: which one is the offset and which is the
length?


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-03 Thread Andrew Morton
On Thu, 03 May 2007 21:38:10 +0400
Alex Tomas <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > We can make great improvements here, and I've (twice) previously decribed
> > how: hoist the entire ordered-mode data handling out of ext3, and out of
> > the buffer_head layer and move it up into the VFS pagecache layer. 
> > Basically, do ordered-data with a commit-time inode walk, calling
> > do_sync_mapping_range().
> > 
> > Do it in the VFS.  Make reiserfs use it, remove reiserfs ordered-mode too. 
> > Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.
> 
> I'm not sure it's that easy.
> 
> if we move to pages, then we have to mark pages to be flushed holding
> transaction open. now take delayed allocation into account: we need
> to allocate number of blocks at once and then mark all pages mapped,
> again within context of the same transaction.

Yes, there can be issues with needing to allocate journal space within the
context of a commit.  But

a) If the page has newly allocated space on disk then the metadata which
   refers to that page is already in the journal: no new journal space
   needed.

b) If the page doesn't have space allocated on disk then we don't need
   to write it out at ordered-mode commit time, because the post-recovery
   filesystem will not have any references to that page.

c) If the page is dirty due to overwrite then no metadata update was required.

IOW, under what circumstances would an ordered-mode commit need to allocate
space for a delayed-allocate page?

However b) might lead to the hey-my-file-is-full-of-zeroes problem.

> so, an implementation
> would look like the following?
> 
> generic_writepages() {
>   /* collect set of contig. dirty pages */
>   foo_get_blocks() {
>   foo_journal_start();
>   foo_new_blocks();
>   foo_attach_blocks_to_inode();
>   generic_mark_pages_mapped();
>   foo_journal_stop();
>   }
> }
> 
> another question is will it scale well given number of dirty inodes
> can be much larger than number of inodes with dirty mapped blocks
> (in delayed allocation case, for example) ?

Possibly - zillions of dirty-for-atime inodes might get in the way.  A
short-term fix would be to create a separate dirty-inode list on the
superblock (ug).  A long-term fix is to rip all the per-superblock
dirty-inode lists and use a radix-tree.  Not for lookup purposes, but for
the tree's ability to do tagged and restartable searches.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Andrew Morton
On Fri, 27 Apr 2007 13:31:30 -0600
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> On Apr 27, 2007  08:30 -0700, Linus Torvalds wrote:
> > On a good filesystem, when you do "fsync()" on a file, nothing at all 
> > happens to any other files. On ext3, it seems to sync the global journal, 
> > which means that just about *everything* that writes even a single byte 
> > (well, at least anything journalled, which would be all the normal 
> > directory ops etc) to disk will just *stop* dead cold!
> > 
> > It's horrid. And it really is ext3, not "fsync()".
> > 
> > I used to run reiserfs, and it had its problems, but this was the 
> > "feature" of ext3 that I've disliked most. If you run a MUA with local 
> > mail, it will do fsync's for most things, and things really hickup if you 
> > are doing some other writes at the same time. In contrast, with reiser, if 
> > you did a big untar or some other big write, if somebody fsync'ed a small 
> > file, it wasn't even a blip on the radar - the fsync would sync just that 
> > small thing.
> 
> It's true that this is a "feature" of ext3 with data=ordered (the default),
> but I suspect the same thing is now true in reiserfs too.  The reason is
> that if a journal commit doesn't flush the data as well then a crash will
> result in garbage (from old deleted files) being visible in the newly
> allocated file.  People used to complain about this with reiserfs all the
> time having corrupt data in new files after a crash, which is why I believe
> it was fixed.

People still complain about hey-my-files-are-all-full-of-zeroes on XFS.

> There definitely are some problems with the ext3 journal commit though.
> If the journal is full it will cause the whole journal to checkpoint out
> to the filesystem synchronously even if just space for a small transaction
> is needed.  That is doubly bad if you have a very large journal.  I believe
> Alex has a patch to have it checkpoint much smaller chunks to the fs.
> 

We can make great improvements here, and I've (twice) previously decribed
how: hoist the entire ordered-mode data handling out of ext3, and out of
the buffer_head layer and move it up into the VFS pagecache layer. 
Basically, do ordered-data with a commit-time inode walk, calling
do_sync_mapping_range().

Do it in the VFS.  Make reiserfs use it, remove reiserfs ordered-mode too. 
Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.

And guess what?  We can then partly fix _this_ problem too.  If we're
running a commit on behalf of fsync(inode1) and we come across an inode2
which doesn't have any block allocation metadata in this commit, we don't
need to sync inode2's pages.

Weep.  It's times like this when I want to escape all this patch-wrangling
nonsense and go do some real stuff.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-04-27 Thread Andrew Morton
On Fri, 27 Apr 2007 09:59:27 +0200 Mike Galbraith <[EMAIL PROTECTED]> wrote:

> Greetings,
> 
> As subject states, my GUI is going away for extended periods of time
> when my very full and likely highly fragmented (how to find out)
> filesystem is under heavy write load.  While write is under way, if
> amarok (mp3 player) is running, no song change will occur until write is
> finished, and the GUI can go _entirely_ comatose for very long periods.
> Usually, it will come back to life after write is finished, but
> occasionally, a complete GUI restart is necessary.

I'd be suspecting a GUI bug if a restart is necessary.  Perhaps it went to
lunch for so long in the kernel that some time-based thing went bad.

> The longest comatose period to date was ~20 minutes with 2.6.20.7 a few
> days ago.  I was letting SuSE's software update programs update my SuSE
> 10.2 system, and started a bonnie while it was running (because I had
> been seeing this on recent kernels, and wanted to see if it was in
> stable as well), WHAM, instant dead GUI.  When this happens, kbd and
> mouse events work fine, so I hot-keyed to a VT via CTRL+ALT+F1, and
> killed the bonnie.  No joy, GUI stayed utterly comatose until the
> updater finished roughly 20 minutes later, at which time the shells I'd
> tried to start popped up, and all worked as if nothing bad had ever
> happened.  During the time in between, no window could be brought into
> focus, nada.
> 
> While a bonnie is writing, if I poke KDE's menu button, that will
> instantly trigger nastiness, and a trace (this one was with a cfs
> kernel, but I just did same with virgin 2.6.21) shows that "kicker",
> KDE's launcher proggy does an fdatasync for some reason, and that's the
> end of it's world for ages.  When clicking on amarok's icon, it does an
> fsync, and that's the last thing that will happen in it's world until
> write is done as well.  I've repeated this with CFQ and AS IO
> schedulers.

Well that all sucks.

> I have a couple of old kernels lying around that I can test with, but I
> think it's going to be the same.  Seems to be ext3's journal that is
> causing my woes.  Below this trace of kicker is one of amarok during
> it's dead to the world time.
> 
> Box is 3GHz P4 intel ICH5, SMP/UP doesn't matter.  .config is latest
> kernel tested attached.  Mount options are
> noatime,nodiratime,acl,user_xattr.
> 
> 
> [  308.046646] kickerD 0044 0  5897  1 (NOTLB)
> [  308.052611]f32abe4c 00200082 83398b5a 0044 c01c251e f32ab000 
> f32ab000 c01169b6 
> [  308.060926]f772fbcc cdc7e694 0039 8339857a 0044 83398b5a 
> 0044  
> [  308.069422]c1b5ab00 c1b5aac0 00353928 f32abe80 c01c7ab8  
> f32ab000 c1b5ab10 
> [  308.077927] Call Trace:
> [  308.080568]  [] log_wait_commit+0x9d/0x11f
> [  308.085549]  [] journal_stop+0x1a1/0x22a
> [  308.090364]  [] journal_force_commit+0x1d/0x20
> [  308.095699]  [] ext3_force_commit+0x24/0x26
> [  308.100774]  [] ext3_write_inode+0x2d/0x3b
> [  308.105771]  [] __writeback_single_inode+0x2df/0x3a9
> [  308.111633]  [] sync_inode+0x15/0x38
> [  308.116093]  [] ext3_sync_file+0xbd/0xc8
> [  308.120900]  [] do_fsync+0x58/0x8b
> [  308.125188]  [] __do_fsync+0x20/0x2f
> [  308.129656]  [] sys_fdatasync+0x10/0x12
> [  308.134384]  [] sysenter_past_esp+0x5d/0x81
> [  308.139441]  ===

Right.  One possibility here is that bonnie is stuffing new dirty blocks
onto the committing transaction's ordered-data list and JBD commit is
livelocking.  Only we're not supposed to be putting those blocks on that
list.

Another livelock possibility is that bonnie is redirtying pages faster than
commit can write them out, so commit got livelocked:

When I was doing the original port-from-2.2 I found that an application
which does

for ( ; ; )
pwrite(fd, "", 1, 0);

would permanently livelock the fs.  I fixed that, but it was six years ago,
and perhaps we later unfixed it.

It would be most interesting to try data=writeback.

> [  311.755953] bonnieD 0046 0  6146   5929 (NOTLB)
> [  311.761929]e7622a60 00200082 04d7e5fe 0046 03332bd5  
> e7622000 c02c0c54 
> [  311.770244]d8eaabcc e7622a64 f7d0c3ec 04d7e521 0046 04d7e5fe 
> 0046  
> [  311.778758]e7622aa4 f4f94400 e7622aac e7622a68 c04a2f06 e7622a70 
> c018b105 e7622a8c 
> [  311.787261] Call Trace:
> [  311.789904]  [] io_schedule+0xe/0x16
> [  311.794373]  [] sync_buffer+0x2e/0x32
> [  311.798927]  [] __wait_on_bit_lock+0x3f/0x62
> [  311.804089]  [] out_of_line_wait_on_bit_lock+0x5f/0x67
> [  311.810115]  [] __lock_buffer+0x2b/0x31
> [  311.814846]  [] sync_dirty_buffer+0x88/0xc3
> [  311.819921]  [] journal_dirty_data+0x1dd/0x205
> [  311.825256]  [] ext3_journal_dirty_data+0x12/0x37
> [  311.830858]  [] journal_dirty_data_fn+0x15/0x1c
> [  311.836280]  [] walk_page_buffers+0x36/0x68
> [  311.841347]  [] ext3_ordered_writepage+0x11a/0x

Re: [PATCH] Check for error returned by kthread_create on creating journal thread

2007-04-20 Thread Andrew Morton
On Mon, 16 Apr 2007 11:41:14 +0400
Pavel Emelianov <[EMAIL PROTECTED]> wrote:

> If the thread failed to create the subsequent wait_event
> will hang forever.
> 
> This is likely to happen if kernel hits max_threads limit.
> 
> Will be critical for virtualization systems that limit the
> number of tasks and kernel memory usage within the container.
> 
> 
> [diff-jbd-check-start-journal-thread-return-value  text/plain (1.7KB)]
> --- ./fs/jbd/journal.c.jbdthreads 2007-04-16 11:17:36.0 +0400
> +++ ./fs/jbd/journal.c2007-04-16 11:30:09.0 +0400
> @@ -211,10 +211,16 @@ end_loop:
>   return 0;
>  }
>  
> -static void journal_start_thread(journal_t *journal)
> +static int journal_start_thread(journal_t *journal)
>  {
> - kthread_run(kjournald, journal, "kjournald");
> + struct task_struct *t;
> +
> + t = kthread_run(kjournald, journal, "kjournald");
> + if (IS_ERR(t))
> + return PTR_ERR(t);
> +
>   wait_event(journal->j_wait_done_commit, journal->j_task != 0);
> + return 0;
>  }

Thanks.   Please don't forget those Signed-off-by:s

I assume that you runtime tested this and that the mount failed in
an appropriate fashion?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Copy i_flags to ext3 inode flags on write (version 2)

2007-04-20 Thread Andrew Morton
On Tue, 17 Apr 2007 12:38:55 +0200 Jan Kara <[EMAIL PROTECTED]> wrote:

>   attached is a second version of a patch that stores inode flags such as
> S_IMMUTABLE, S_APPEND, etc. from i_flags to EXT3_I(inode)->i_flags when
> inode is written to disk. The same thing is done on GETFLAGS ioctl.
>   Quota code changes these flags on quota files (to make it harder for
> sysadmin to screw himself) and these changes were not correctly
> propagated into the filesystem (especially, lsattr did not show them and
> users were wondering...). Andrew, could you please put the patch into your
> queue? Thanks.

umm, OK.

What about ext2 and all the zillions of others?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance degradation with FFSB between 2.6.20 and 2.6.21-rc7

2007-04-18 Thread Andrew Morton
> On Wed, 18 Apr 2007 15:54:00 +0200 Valerie Clement <[EMAIL PROTECTED]> wrote:
> 
> Running benchmark tests (FFSB) on an ext4 filesystem, I noticed a 
> performance degradation (about 15-20 percent) in sequential write tests 
> between 2.6.19-rc6 and 2.6.21-rc4 kernels.
> 
> I ran the same tests on ext3 and XFS filesystems and I saw the same 
> performance difference between the two kernel versions for these two 
> filesystems.
> 
> I have also reproduced it between 2.6.20.7 and 2.6.21-rc7.
> The FFSB tests run 16 threads, each creating 1GB files. The tests were 
> done on the same x86_64 system, with the same kernel configuration and 
> on the same scsi device. Below are the throughput values given by FFSB.
> 
>kernel   XFSext3
> --
>   2.6.20.748 MB/sec 44 MB/sec
> 
>   2.6.21-rc7  38 MB/sec 37 MB/sec
> 
> Did anyone else run across the problem?
> Is there a known issue?
> 

That's a new discovery, thanks.

It could be due to I/O scheduler changes.  Which one are you using?  CFQ?

Or it could be that there has been some changed behaviour at the VFS/pagecache
layer: the VFS might be submitting little hunks of lots of files, rather than
large hunks of few files.

Or it could be a block-layer thing: perhaps some driver change has caused
us to be placing less data into the queue.  Which device driver is that machine
using?

Being a simple soul, the first thing I'll try when I get near a test box
will be

for i in $(seq 1 16)
do
time dd if=/dev/zero of=$i bs=1M count=1024 &
done
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Interface for the new fallocate() system call

2007-03-29 Thread Andrew Morton
On Thu, 29 Mar 2007 17:21:26 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:

> Hello,
> 
> We need to come up with the best possible layout of arguments for the
> fallocate() system call. Various architectures have different
> requirements for how the arguments should look like. Since the mail
> chain has become huge, here is the summary of various inputs received
> so far.
> 
> Platform: s390
> --
> s390 prefers following layout:
> 
>int fallocate(int fd, loff_t offset, loff_t len, int mode)
> 
> For details on why and how "int, int, loff_t, loff_t" is a problem on
> s390, please see Heiko's mail on 16th March. Here is the link:
> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg133595.html
> 
> Platform: ppc, arm
> --
> ppc (32 bit) has a problem with "int, loff_t, loff_t, int" layout,
> since this will result in a pad between fd and offset, making seven
> arguments total - which is not supported by ppc32. It supports only
> 6 arguments. Thus the desired layout by ppc32 is:
> 
>int fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> Even ARM prefers above kind of layout. For details please see the
> definition of sys_arm_sync_file_range().

This is a clean-looking option.  Can s390 be changed to support seven-arg
syscalls?

> Option of loff_t => high u32 + low u32
> --
> Matthew and Russell have suggested another option of breaking each
> "loff_t" into two "u32"s. This will result in 6 arguments in total.
> 
> Following think that this is a good alternative:
> Matthew Wilcox, Russell King, Heiko Carstens
> 
> Following do not like this idea:
> Chris Wedgwood

It's a bit weird-looking, but the six-32-bit-args approach is simple
enought to understand and implement.  Presumably the glibc wrapper
would hide that detail from everyone.
 
> 
> What are your thoughts on this ? What layout should we finalize on ?
> Perhaps, since sync_file_range() system call has similar arguments, we
> can take hint from the challenges faced on implementing it on various
> architectures, and decide.
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext[34] EA block reference count racing fix

2007-03-22 Thread Andrew Morton
On Wed, 21 Feb 2007 15:54:10 -0800 Mingming Cao <[EMAIL PROTECTED]> wrote:

> There are race issues around ext[34] xattr block release code.
> 
> ext[34]_xattr_release_block() checks the reference count of xattr block
> (h_refcount) and frees that xattr block if it is the last one reference
> it. Unlike ext2, the check of this counter is unprotected by any lock.
> ext[34]_xattr_release_block() will free the mb_cache entry before
> freeing that xattr block. There is a small window between the check for
> the re h_refcount ==1 and the call to mb_cache_entry_free(). During this
> small window another inode might find this xattr block from the mbcache
> and reuse it, racing a refcount updates.  The xattr block will later be
> freed by the first inode without notice other inode is still use it.
> Later if that block is reallocated as a datablock for other file, then
> more serious problem might happen.
> 
> We need put a lock around places checking the refount as well to avoid
> racing issue. Another place need this kind of protection is in
> ext3_xattr_block_set(), where it will modify the xattr block content in-
> the-fly if the refcount is 1 (means it's the only inode reference it).
> 
> This will also fix another issue: the xattr block may not get freed at
> all if no lock is to protect the refcount check at the release time. It
> is possible that the last two inodes could release the shared xattr
> block at the same time. But both of them think they are not the last one
> so only decreased the h_refcount without freeing xattr block at all.
> 
> We need to call lock_buffer() after ext3_journal_get_write_access() to
> avoid deadlock (because the later will call lock_buffer()/unlock_buffer
> () as well).
> 
> 
> Signed-Off-By: Mingming Cao <[EMAIL PROTECTED]>
> ---
> 
>  linux-2.6.19-ming/fs/ext3/xattr.c |   42 
> +++---
>  linux-2.6.19-ming/fs/ext4/xattr.c |   41 
> ++---
>  2 files changed, 51 insertions(+), 32 deletions(-)
> 
> diff -puN fs/ext3/xattr.c~ext34_xattr_release_block_race_fix fs/ext3/xattr.c
> --- linux-2.6.19/fs/ext3/xattr.c~ext34_xattr_release_block_race_fix   
> 2007-02-14 16:40:48.0 -0800
> +++ linux-2.6.19-ming/fs/ext3/xattr.c 2007-02-21 11:45:10.0 -0800
> @@ -478,8 +478,15 @@ ext3_xattr_release_block(handle_t *handl
>struct buffer_head *bh)
>  {
>   struct mb_cache_entry *ce = NULL;
> + int error = 0;
>  
>   ce = mb_cache_entry_get(ext3_xattr_cache, bh->b_bdev, bh->b_blocknr);
> + error = ext3_journal_get_write_access(handle, bh);
> + if (error)
> +  goto out;
> +
> + lock_buffer(bh);
> +
>   if (BHDR(bh)->h_refcount == cpu_to_le32(1)) {
>   ea_bdebug(bh, "refcount now=0; freeing");
>   if (ce)
> @@ -488,21 +495,20 @@ ext3_xattr_release_block(handle_t *handl
>   get_bh(bh);
>   ext3_forget(handle, 1, inode, bh, bh->b_blocknr);
>   } else {
> - if (ext3_journal_get_write_access(handle, bh) == 0) {
> - lock_buffer(bh);
> - BHDR(bh)->h_refcount = cpu_to_le32(
> + BHDR(bh)->h_refcount = cpu_to_le32(
>   le32_to_cpu(BHDR(bh)->h_refcount) - 1);
> - ext3_journal_dirty_metadata(handle, bh);
> - if (IS_SYNC(inode))
> - handle->h_sync = 1;
> - DQUOT_FREE_BLOCK(inode, 1);
> - unlock_buffer(bh);
> - ea_bdebug(bh, "refcount now=%d; releasing",
> -   le32_to_cpu(BHDR(bh)->h_refcount));
> - }
> + error = ext3_journal_dirty_metadata(handle, bh);
> + handle->h_sync = 1;
> + DQUOT_FREE_BLOCK(inode, 1);
> + ea_bdebug(bh, "refcount now=%d; releasing",
> +   le32_to_cpu(BHDR(bh)->h_refcount));
>   if (ce)
>   mb_cache_entry_release(ce);
>   }
> + unlock_buffer(bh);
> +out:
> + ext3_std_error(inode->i_sb, error);
> + return;
>  }
>  
>  struct ext3_xattr_info {
> @@ -678,7 +684,7 @@ ext3_xattr_block_set(handle_t *handle, s
>   struct buffer_head *new_bh = NULL;
>   struct ext3_xattr_search *s = &bs->s;
>   struct mb_cache_entry *ce = NULL;
> - int error;
> + int error = 0;
>  
>  #define header(x) ((struct ext3_xattr_header *)(x))
>  
> @@ -687,16 +693,17 @@ ext3_xattr_block_set(handle_t *handle, s
>   if (s->base) {
>   ce = mb_cache_entry_get(ext3_xattr_cache, bs->bh->b_bdev,
>   bs->bh->b_blocknr);
> + error = ext3_journal_get_write_access(handle, bs->bh);
> + if (error)
> + goto cleanup;
> + lock_buffer(bs->bh);
> +
>   if (header(s->base)->h_refcount == cpu_to_le32(1)) {
>   if (ce) {
>

Re: dead code: EXT3_IOC_WAIT_FOR_READONLY?

2007-03-16 Thread Andrew Morton
On Fri, 16 Mar 2007 14:16:28 +0300 Vasily Averin <[EMAIL PROTECTED]> wrote:

> Hi all,
> 
> I've found some strange ext3 ioctl under CONFIG_JBD_DEBUG:
> EXT3_IOC_WAIT_FOR_READONLY.
> 
> I had DEBUG_SPINLOCK enabled and my kernel was crashed on wait-for-readonly.c
> test from ext3-tools because ioctl handler tries to use uninitialized
> EXT3_SB(sb)->ro_wait_queue and waits EXT3_SB(sb)->turn_ro_timer that nobody 
> uses
> in kernel.
> 
> I do not understand the function of this ioctl and believe this code is dead
> long time ago. From my POV it makes sense to remove it from kernel instead of
> fixing it. Has somebody any objections?

It exists to support my original ext3 recovery-testing code.  The idea is
that you mount the filesytem with ro_after=1000, then after 1000 jiffies a
timer fires and the underlying device driver starts ignoring writes.  You
then run the EXT3_IOC_WAIT_FOR_READONLY ioctl to wait for the timer to have
fired, then kill off the the stresstest programs, then unmount the fs, then
fsck it.


Yes, it's kind of dead code now - I don't think anyone has used it since
2001 or so.

We do need to resurrect the functionality: it was very powerful and found
several problems.

People recently reported serious-looking ext3 recovery bugs but nobody
seems interested in investigating them.

A reimplementation probaby wouldn't use any of the old code so sure, feel
free to send in a patch which removes it.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: Re: 2.6.21-rc3-mm1

2007-03-13 Thread Andrew Morton
> On Tue, 13 Mar 2007 10:29:05 -0500 Dave Kleikamp <[EMAIL PROTECTED]> wrote:
> On Tue, 2007-03-13 at 10:05 -0500, Eric Sandeen wrote:
> > Andrew Morton wrote:
> > 
> > > And broken stuff too :-)
> > > The nanoseconds patch is broken on x86_64 - makes mtimes from the future:
> > > e.g. year 2431. I suspect an endianness issue.
> > > x86 works fine according to my sources.
> > >
> > > The files themselves have correct mtimes, as booting previous kernel
> > > or one w/o the nanoseconds patch works fine.
> > There were later patches posted to the list after this version, I think, 
> > which
> > are not yet in Ted's tree... I'll find some time today to test the "take3" 
> > version
> > on x86_64, unless someone beats me to it.
> 
> I didn't quite beat you to it, but I did make a diff to bring
> 2.6.21-rc3-mm1 in tune with the "take3" patch.  It makes the code easier
> to understand, but I'm not sure if it contains anything to fix the bug.
> Code inspection hasn't gotten me any closer to figuring out what's
> wrong.
> 
> Shaggy

It's possible that I screwed up merging Ted's tree into mainline - it spat
some rejects, and needs updating to 2.6.21-rc3.


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fw: Re: 2.6.21-rc3-mm1

2007-03-12 Thread Andrew Morton


Begin forwarded message:

Date: Mon, 12 Mar 2007 19:14:17 +0100
From: "Radoslaw Szkodzinski" <[EMAIL PROTECTED]>
To: "Andrew Morton" <[EMAIL PROTECTED]>, "Theodore T'so" <[EMAIL PROTECTED]>
Cc: linux-kernel@vger.kernel.org
Subject: Re: 2.6.21-rc3-mm1


On 3/8/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
>
> - Re-added the ext4 development tree to the -mm lineup.  It has stuff in
>   it.
>

And broken stuff too :-)
The nanoseconds patch is broken on x86_64 - makes mtimes from the future:
e.g. year 2431. I suspect an endianness issue.
x86 works fine according to my sources.

The files themselves have correct mtimes, as booting previous kernel
or one w/o the nanoseconds patch works fine.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fw: EXT3-fs warning (device sdd2): dx_probe: Unrecognised inode hash code 232

2007-03-09 Thread Andrew Morton


Begin forwarded message:

Date: Thu, 8 Mar 2007 18:35:54 +0100
From: Jan De Luyck <[EMAIL PROTECTED]>
To: linux-kernel@vger.kernel.org
Subject: EXT3-fs warning (device sdd2): dx_probe: Unrecognised inode hash code 
232


Hello,

Running 2.6.19.1 on AMD64.

While copying some files on an ext3 partition, I got this in the logs:

EXT3-fs warning (device sdd2): dx_probe: Unrecognised inode hash code 232
Assertion failure in dx_probe() at fs/ext3/namei.c:384: "dx_get_limit(entries) 
== dx_root_limit(dir, root->info.info_length)"
--- [cut here ] - [please bite here ] -
Kernel BUG at fs/ext3/namei.c:384
invalid opcode:  [1] PREEMPT SMP
CPU 0
Modules linked in: xt_multiport iptable_filter ip_tables x_tables binfmt_misc 
fuse parport_pc lp parport thermal fan button nls_iso8859_15 ntfs kqemu(P) 
tcp_diag inet_diag w83627ehf i2c_isa cpufreq_ondemand powernow_k8 freq_table 
processor snd_hda_intel snd_hda_codec snd_pcm evdev snd_timer psmouse snd 
pcspkr floppy ehci_hcd soundcore snd_page_alloc usbcore k8temp hwmon ohci1394 
ieee1394 forcedeth i2c_nforce2 i2c_core sg sr_mod cdrom
Pid: 4018, comm: find Tainted: P  2.6.19.5 #1
RIP: 0010:[]  [] dx_probe+0x18b/0x320
RSP: 0018:81006b789d28  EFLAGS: 00010292
RAX: 0081 RBX: 810066f07018 RCX: c100
RDX: 81006b788000 RSI: 81006b5c62c0 RDI: 
RBP:  R08: 81006b788000 R09: 0001
R10:  R11: 80212a20 R12: 810066e6f8e8
R13:  R14:  R15: 810066f06730
FS:  2ae8c9f5c6d0() GS:80586000() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 2af535db8086 CR3: 6b7b2000 CR4: 06e0
Process find (pid: 4018, threadinfo 81006b788000, task 81006b5c62c0)
Stack:  81006b789dd4 81006b789d88 81007f79d750 810066e6acc0
  810066f06730 810066f06730 
 81007efca580 802f13ab 810066e6ace8 7ff98780
Call Trace:
 [] ext3_htree_fill_tree+0xab/0x1e0
 [] ext3_readdir+0x1a9/0x530
 [] vfs_readdir+0x73/0xc0
 [] sys_getdents64+0x7f/0xd0
 [] system_call+0x7e/0x83
 [<2ae8c9dad41a>]


Code: 0f 0b 68 f0 0c 48 80 c2 80 01 48 8b 54 24 08 48 89 54 24 10
RIP  [] dx_probe+0x18b/0x320
 RSP 

Yeah, I know it's tainted - just wanted to hear if anyone saw this one before. 
System seems to be fine otherwise, I'll run a fsck on the filesystem lateron.

Jan

-- 
Common sense is instinct, and enough of it is genius.
-- Josh Billings
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qla2xxx BUG: workqueue leaked lock or atomic

2007-03-07 Thread Andrew Morton
On Wed, 7 Mar 2007 18:09:55 +0100 Andre Noll <[EMAIL PROTECTED]> wrote:

> On 20:39, Andrew Morton wrote:
> > On Wed, 28 Feb 2007 16:37:22 +0100 Andre Noll <[EMAIL PROTECTED]> wrote:
> > 
> > > On 16:18, Andre Noll wrote:
> > > 
> > > > With 2.6.21-rc2 I am unable to reproduce this BUG message. However,
> > > > writing to both raid systems at the same time via lvm still locks up
> > > > the system within minutes.
> > > 
> > > Screenshot of the resulting kernel panic:
> > > 
> > >   http://systemlinux.org/~maan/shots/kernel-panic-21-rc2-huangho2.png
> > > 
> > 
> > It died in CFQ.  Please try a different IO scheduler.  Use something
> > like
> > 
> > echo deadline > /sys/block/sda/queue/scheduler
> > 
> > This could still be the old qla2xxx bug, or it could be a new qla2xxx bug,
> > or it could be a block bug, or it could be an LVM bug.
> 
> OK. I'm running with deadline right now. But I guess this kernel
> panic was caused by an LVM bug because lockdep reported problems with
> LVM. Nobody responded to my bug report on the LVM mailing list (see
> http://www.redhat.com/archives/linux-lvm/2007-February/msg00102.html).
> 
> Non-working snapshots and no help from the mailing list convinced me
> to ditch the lvm setup [1] in favour of linear software raid. This
> means I can't do lvm-related tests any more.

Sigh.

> BTW: Are ext3 filesystem sizes greater than 8T now officially
> supported?

I think so, but I don't know how much 16TB testing developers and
distros are doing - perhaps the linux-ext4 denizens can tell us?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Andrew Morton
On Fri, 02 Mar 2007 08:13:00 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote:

> > 
> > > What about 
> > > if the
> > > blocks already exists ? What would be return values in those cases ?
> > 
> > 0 on success, other normal errors oetherwise..
> > 
> > If asked for a range that includes already-allocated blocks, you just 
> > allocate any non-allocated blocks in the range, I think.
> 
> Yes. What I was trying to figure out is, if there is a requirement that
> interface need to return exact number of bytes it *really* allocated
> (like write() or read()). I can't think of any, but just wanted to
> through it out..

Hopefully not, because posix didn't anticipate that.

We could of course return a positive number on success, but it'd get
tricky on 32-bit machines.

> BTW, what is the interface for finding out what is the size of the
> pre-allocated file ? 

stat.st_blocks?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Andrew Morton
On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote:

> Just curious .. What does posix_fallocate() return ?

bookmark this:

http://www.opengroup.org/onlinepubs/009695399/nfindex.html

Upon successful completion, posix_fallocate() shall return zero;
otherwise, an error number shall be returned to indicate the error.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Andrew Morton
On Thu, 01 Mar 2007 22:44:16 +
Dave Kleikamp <[EMAIL PROTECTED]> wrote:

> On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
> > On Fri, 2 Mar 2007 00:04:45 +0530
> > "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> 
> > > +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
> > > +{
> > > + struct file *file;
> > > + struct inode *inode;
> > > + long ret = -EINVAL;
> > > + file = fget(fd);
> > > + if (!file)
> > > + goto out;
> > > + inode = file->f_path.dentry->d_inode;
> > > + if (inode->i_op && inode->i_op->fallocate)
> > > + ret = inode->i_op->fallocate(inode, offset, len);
> > > + else
> > > + ret = -ENOTTY;
> > > + fput(file);
> > > +out:
> > > +return ret;
> > > +}
> > 
> 
> > ENOTTY is a bit unconventional - we often use EINVAL for this sort of
> > thing.  But EINVAL has other meanings for posix_fallocate() and isn't
> > really appropriate here anyway.  So I'm not sure what would be better...
> 
> Would EINVAL (or whatever) make it back to the caller of
> posix_fallocate(), or would glibc fall back to its current
> implementation?
> 
> Forgive me if I haven't put enough thought into it, but would it be
> useful to create a generic_fallocate() that writes zeroed pages for any
> non-existent pages in the range?  I don't know how glibc currently
> implements posix_fallocate(), but maybe the kernel could do it more
> efficiently, even in generic code.  Maybe we don't care, since the major
> file systems can probably do something better in their own code.

Given that glibc already implements fallocate for all filesystems, it will
need to continue to do so for filesystems which don't implement this
syscall - otherwise applications would start breaking.

However with this kernel change, glibc will need to look at the errno,
so that it can correctly propagate EIO, ENOSPC and whatever.  So we will
need to return a reliable and stable and sensible value so that glibc knows
when it should emulate and when it should propagate.

Perhaps Ulrich can comment.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Andrew Morton
On Fri, 02 Mar 2007 09:40:54 +1100
Nathan Scott <[EMAIL PROTECTED]> wrote:

> On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
> > On Fri, 2 Mar 2007 00:04:45 +0530
> > "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> > 
> > > This is to give a heads up on few patches that we will be soon coming up
> > > with. These patches implement a new system call sys_fallocate() and a
> > > new inode operation "fallocate", for persistent preallocation. The new
> > > system call, as Andrew suggested, will look like:
> > > 
> > >   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> > ...
> > 
> > I'd agree with Eric on the "command" flag extension.
> 
> Seems like a separate syscall would be better, "command" sounds
> a bit ioctl like, especially if that command is passed into the
> filesystems..
> 

madvise, fadvise, lseek, etc seem to work OK.

I get repeatedly traumatised by patch rejects whenever a new syscall gets
added, so I'm biased.

The advantage of a command flag is that we can add new modes in the future
without causing lots of churn, waiting for arch maintainers to catch up,
potentially adding new compat code, etc.

Rename it to "mode"? ;)

I'm inclined to merge this patch nice and early, so the syscall number is
stabilised.  Otherwise the people who are working on out-of-tree code (ie:
ext4) will have to keep playing catchup.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Andrew Morton
On Fri, 2 Mar 2007 00:04:45 +0530
"Amit K. Arora" <[EMAIL PROTECTED]> wrote:

> This is to give a heads up on few patches that we will be soon coming up
> with. These patches implement a new system call sys_fallocate() and a
> new inode operation "fallocate", for persistent preallocation. The new
> system call, as Andrew suggested, will look like:
> 
>   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

It is intended that glibc use this same syscall for both posix_fallocate()
and posix_fallocate64().

I'd agree with Eric on the "command" flag extension.

That new argument might need to come after "fd" - ARM has funny requirements on
syscall arg padding and layout.

> +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
> +{
> + struct file *file;
> + struct inode *inode;
> + long ret = -EINVAL;
> + file = fget(fd);
> + if (!file)
> + goto out;
> + inode = file->f_path.dentry->d_inode;
> + if (inode->i_op && inode->i_op->fallocate)
> + ret = inode->i_op->fallocate(inode, offset, len);
> + else
> + ret = -ENOTTY;
> + fput(file);
> +out:
> +return ret;
> +}

Please always put a blank line between the variable definitions and the
first statement.

Please always use hard tabs, not bunch-of-spaces.  This seems to happening
rather a lot in the ext4 patches.  It's a trivial thing, but also trivial
to fix.  A grep across the diffs is needed.

ENOTTY is a bit unconventional - we often use EINVAL for this sort of
thing.  But EINVAL has other meanings for posix_fallocate() and isn't
really appropriate here anyway.  So I'm not sure what would be better...

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1][RFC] prepare_write positive return value V(2)

2007-02-25 Thread Andrew Morton
Could someone please review/test/comment upon this work?

I don't think ext[34] is an area where I need to do everything myself ;)

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 1/1] nanosecond timestamps

2007-02-25 Thread Andrew Morton
> On Fri, 02 Feb 2007 20:09:40 +0530 Kalpak Shah <[EMAIL PROTECTED]> wrote:
> +#define EXT3_INODE_GET_XTIME(xtime, extra_xtime, inode, raw_inode)
> \
> +do {
> \
> + (inode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime);
> \
> +   \
> + if (offsetof(typeof(*raw_inode), extra_xtime) -
> \
> + offsetof(typeof(*raw_inode), i_extra_isize) +
> \
> + sizeof((raw_inode)->extra_xtime) <=
> \
> + le16_to_cpu((raw_inode)->i_extra_isize)) {
> \
> + if (sizeof((inode)->xtime.tv_sec) > 4)\
> + (inode)->xtime.tv_sec |=  \
> + (__u64)(le32_to_cpu((raw_inode)->extra_xtime) &   \
> + EXT3_EPOCH_MASK) << 32;   \
> + (inode)->xtime.tv_nsec =  \
> + (le32_to_cpu((raw_inode)->extra_xtime) &  \
> + EXT3_NSEC_MASK) >> 2; \
> + }
> \
> +} while (0)

ow, my eyes.  Can we find a way to do this in C rather than in cpp?

> +static inline struct timespec ext3_current_time(struct inode *inode)
> +{
> + return (inode->i_sb->s_time_gran < 10) ?

NSEC_PER_SEC

> + current_fs_time(inode->i_sb) : CURRENT_TIME_SEC;
> +}
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Resubmit][PATCH 5/5] Secure Deletion and Trash-Bin Support for Ext4

2007-02-25 Thread Andrew Morton
> On Wed, 31 Jan 2007 11:11:10 -0600 Dave Kleikamp <[EMAIL PROTECTED]> wrote:
> Have you considered putting ALL of the function in the vfs layer?

Yes, I think that's where it should be implemented.

I suspect we'll run into problems with the .trash/ thing.  What directory
do we put .trash in?  The top-level of the mountpoint?   The user might not
be able to access that if he's in a chroot.  And UIDs are now per-nsproxy
things - UIDs are already per-uid-namespace things. So we can have two or more
different users using UID 42.

It all need to be thrashed out with the containerisation and VFS guys.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Resubmit][Patch 0/2] Persistent preallocation in ext4

2007-02-25 Thread Andrew Morton
> On Wed, 17 Jan 2007 15:16:58 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> (1) The final interface is yet to be decided. We have the option of
> chosing from one of these:
>   a> modifying posix_fallocate() in glibc
>   b> using fcntl
>   c> using ftruncate, or
>   d> using the ioctl interface.

I'd suggest we add a new syscall, which implements the posix_fallocate()
API as per the relevant specs.  We can work the final details out with Ulrich
and I'm sure that glibc's posix_fallocate() will start using the syscall if it
is available.

Actually, we should implement

 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

note: loff_t, not off_t.

This probably means that we'll need to implement file_operations.fallocate().

It wouldn't surprise me if XFS was able to implement fallocate() too.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: loopback mount EXT3 oops

2007-02-18 Thread Andrew Morton
On Mon, 19 Feb 2007 10:52:42 +1100 Rusty Russell <[EMAIL PROTECTED]> wrote:

> Hi all,
> 
>   This happened on 2.6.20-rc6-mm3, so I upgraded to 2.6.20-git14, and
> same thing happened.  Doing a forced fsck on the (corrupted by
> double-mounting I think) filesystem fixed it, but I took a copy before
> doing that.
> 
> Do we care about ext3 barfing on corrupted filesystems?

yup.

> [ 1329.782923] EXT3 FS on loop0, internal journal
> [ 1329.782926] EXT3-fs: recovery complete.
> [ 1329.783186] EXT3-fs: mounted filesystem with ordered data mode.
> [ 1329.921992] EXT3-fs warning (device loop0): ext3_unlink: Deleting 
> nonexistent file (842294), 0
> [ 1329.922516] EXT3-fs warning (device loop0): ext3_unlink: Deleting 
> nonexistent file (842295), 0
> [ 1329.922524] list_add corruption. next->prev should be prev (f6705c9c), but 
> was f21a886c. (next=f21a886c).
> [ 1329.922546] [ cut here ]
> [ 1329.922548] kernel BUG at lib/list_debug.c:27!
> [ 1329.922550] invalid opcode:  [#1]
> [ 1329.922551] Modules linked in: loop binfmt_misc i915 ipv6 drm 
> cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table 
> cpufreq_conservative video thermal processor fan button battery asus_acpi 
> backlight ac tsdev keyspan usbserial usbhid af_packet evdev psmouse 
> snd_hda_intel snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer 
> ehci_hcd rtc uhci_hcd iTCO_wdt iTCO_vendor_support snd soundcore 
> snd_page_alloc e1000 intel_agp agpgart usbcore
> [ 1329.922574] CPU:0
> [ 1329.922575] EIP:0060:[]Not tainted VLI
> [ 1329.922576] EFLAGS: 00010286   (2.6.20-git14 #25)
> [ 1329.922581] EIP is at __list_add+0x40/0x60
> [ 1329.922583] eax: 0070   ebx: f6705c9c   ecx: c011d791   edx: f6352ac0
> [ 1329.922585] esi: f21a88cc   edi: f2056a54   ebp: f2169eac   esp: f2169e94
> [ 1329.922587] ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
> [ 1329.922589] Process rm (pid: 3494, ti=f2168000 task=f6352ac0 
> task.ti=f2168000)
> [ 1329.922591] Stack: c03a437c f6705c9c f21a886c f21a886c  f21a88cc 
> f2169eb4 c01f0bca 
> [ 1329.922597]f2169ef4 c01ba4fc f2169ed4 f219a5c8 f21a8b54 f21a8c58 
> f69683c4 f69683c4 
> [ 1329.922602]f21a87c8 f21a886c f219aba0 0300 00ce  
> f21a88cc f21a8c58 
> [ 1329.922607] Call Trace:
> [ 1329.922609]  [] show_trace_log_lvl+0x1a/0x30
> [ 1329.922612]  [] show_stack_log_lvl+0xb1/0xe0
> [ 1329.922615]  [] show_registers+0x1d0/0x2d0
> [ 1329.922618]  [] die+0xf9/0x220
> [ 1329.922621]  [] do_trap+0x82/0xb0
> [ 1329.922623]  [] do_invalid_op+0x97/0xb0
> [ 1329.922626]  [] error_code+0x74/0x7c
> [ 1329.922630]  [] list_add+0xa/0x10
> [ 1329.922632]  [] ext3_orphan_add+0x17c/0x210
> [ 1329.922636]  [] ext3_unlink+0x127/0x1b0
> [ 1329.922638]  [] vfs_unlink+0x88/0xe0
> [ 1329.922642]  [] do_unlinkat+0xb8/0x140
> [ 1329.922645]  [] sys_unlink+0x10/0x20
> [ 1329.922647]  [] sysenter_past_esp+0x69/0xb1
> [ 1329.922650]  ===

I thought Eric fixed that.  Maybe he broke it instead ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: data=journal busted

2007-02-16 Thread Andrew Morton
On Fri, 16 Feb 2007 16:31:09 -0700
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> > I suspect we should resurrect and formalise my old
> > make-the-disk-stop-accepting-writes-when-a-timer-goes-off thing.  It was
> > very useful for stress-testing recovery.
> 
> We have a patch that we use for Lustre testing which allows you to set a
> block device readonly (silently discarding all writes), without the
> filesystem immediately keeling over dead like set_disk_ro.  The readonly
> state persists until the the last reference on the block device is dropped,
> so there are no races w.r.t. VFS cleanup of inodes and flushing buffers
> after the filesystem is unmounted.

Not sure I understand all that.

For this application, we *want* to expose VFS races, errors in handling
EIO, errors in handling lost writes, etc.  It's another form of for-developers
fault injection, not a thing-for-production.

The reason I prefer doing it from the timer interrupt is to toss more
randomness in there, avoid the possibility of getting synchronised
with application or kernel activity in some fashion.

I don't know if there's much value in that, but it provides peace-of-mind.


I'm now seeing reports that ordered-data is corrupting data and metadata,
as is data=writeback.  I'm hoping that the blame lies with the
allededly-battery-backed RAID controller, but it could be ext3.  Has anyone
actually done any decent recovery testing in the past half decade?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: data=journal busted

2007-02-16 Thread Andrew Morton
On Fri, 16 Feb 2007 14:50:22 -0800
Randy Dunlap <[EMAIL PROTECTED]> wrote:

> On Thu, 15 Feb 2007 20:44:45 -0800 Andrew Morton wrote:
> 
> > 
> > I have a report from a google person who just did some basic
> > power-it-off-during-a-write testing on 2.6.20's ext3.  ordered-data is OK,
> > but data=journal came back with crap in the file data.
> > 
> > Is anyone doing any formal recovery stress-testing?
> > 
> > I suspect we should resurrect and formalise my old
> > make-the-disk-stop-accepting-writes-when-a-timer-goes-off thing.  It was
> > very useful for stress-testing recovery.
> 
> Where is your make-troubles patch, please sir?

gee, it's all buried in the ext3 debug patch




But that's just the concept.  I think the core should be formalisation of the
concept, so that other journalling filesystems can use it.

And that concept is: make a device stop accepting writes when an interrupt
goes off.  I think using a timer is appropriate, and I think setting that
timer at [re]mount-time is suitable.  So it'd be a matter of hooking that
into the block layer (or the drivers) at the appropriate point, exposing
the APIs to the filesystems and then wiring up ext4.

Perhaps it has some synergy with the new fault-injection code, dunno.



otoh, perhaps filesystems don't need to be involved at all.  Just do

echo 42 > /sys/block/sda3/reject-writes-after-this-many-milliseconds

while [ $(cat /sys/block/sda3/reject...) -gt 0 ]
do
sleep 1
done













 drivers/ide/ide-io.c |   23 ++
 fs/Kconfig   |4 
 fs/Makefile  |2 
 fs/buffer.c  |  103 -
 fs/direct-io.c   |1 
 fs/ext3/inode.c  |7 
 fs/ext3/namei.c  |   40 -
 fs/ext3/super.c  |   82 ++
 fs/ext3/xattr.c  |4 
 fs/jbd-kernel.c  |  255 +
 fs/jbd/commit.c  |   17 +-
 fs/jbd/transaction.c |   45 +++--
 include/linux/buffer-trace.h |   83 ++
 include/linux/buffer_head.h  |4 
 include/linux/jbd.h  |   24 +--
 15 files changed, 650 insertions(+), 44 deletions(-)

diff -puN drivers/ide/ide-io.c~ext3-debug drivers/ide/ide-io.c
--- a/drivers/ide/ide-io.c~ext3-debug
+++ a/drivers/ide/ide-io.c
@@ -999,6 +999,29 @@ static ide_startstop_t start_request (id
HWIF(drive)->name, (unsigned long) rq);
 #endif
 
+#ifdef CONFIG_JBD_DEBUG
+   {
+   /*
+* Silently stop writing to this disk to simulate a crash.
+*/
+   extern struct block_device *journal_no_write[2];
+   int i;
+
+   if (!(rq->cmd_flags & REQ_RW))
+   goto its_a_read;
+
+   for (i = 0; i < 2; i++) {
+   struct block_device *bdev = journal_no_write[i];
+   if (bdev && bdev_get_queue(bdev) == rq->q) {
+   printk("%s: write suppressed\n", __FUNCTION__);
+   ide_end_request(drive, 1, rq->nr_sectors);
+   return ide_stopped;
+   }
+   }
+   }
+its_a_read:
+   ;
+#endif
/* bail early if we've exceeded max_failures */
if (drive->max_failures && (drive->failures > drive->max_failures)) {
goto kill_rq;
diff -puN fs/Kconfig~ext3-debug fs/Kconfig
--- a/fs/Kconfig~ext3-debug
+++ a/fs/Kconfig
@@ -265,6 +265,10 @@ config JBD2_DEBUG
  generated.  To turn debugging off again, do
  "echo 0 > /proc/sys/fs/jbd2-debug".
 
+config BUFFER_DEBUG
+   bool "buffer-layer tracing"
+   depends on JBD
+
 config FS_MBCACHE
 # Meta block cache for Extended Attributes (ext2/ext3/ext4)
tristate
diff -puN fs/Makefile~ext3-debug fs/Makefile
--- a/fs/Makefile~ext3-debug
+++ a/fs/Makefile
@@ -31,6 +31,8 @@ obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout
 obj-$(CONFIG_BINFMT_EM86)  += binfmt_em86.o
 obj-$(CONFIG_BINFMT_MISC)  += binfmt_misc.o
 
+obj-$(CONFIG_BUFFER_DEBUG) += jbd-kernel.o
+
 # binfmt_script is always there
 obj-y  += binfmt_script.o
 
diff -puN fs/buffer.c~ext3-debug fs/buffer.c
--- a/fs/buffer.c~ext3-debug
+++ a/fs/buffer.c
@@ -35,6 +35,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -53,6 +55,7 @@ init_buffer(struct buffer_head *bh, bh_e
 {
bh->b_end_io = handler;
bh->b_private = private;
+   buffer_trace_init(&bh->b_history);
 }
 
 static int sync_buffer(void *word)
@@ -61,6 +64,7 @@ static int sync_buffer(void *word)
struct b

Re: booked-page-flag.patch

2007-02-16 Thread Andrew Morton
On Fri, 16 Feb 2007 10:56:55 +0300 Alex Tomas <[EMAIL PROTECTED]> wrote:

> >>>>> Andrew Morton (AM) writes:
> 
>  AM> Well, one could just assume that the page has no disk mapping and go and
>  AM> make the space reservation.  Things will work out OK when we come to do
>  AM> writepage().
> 
>  AM> Or one could do both: call get_block() only if the page was inside 
> i_size.
> 
> well, even so we need to reserve not that block ony, but
> also needed metadata (for the worst case).

Right.  Reserve all the blocks for the page and all the metadata for a page
at that file offset.  Worst case.

> probably this
> is work for get_block or some different method?

umm, when I did this I think I added a new ->reservepage address_space op
and called that from within the VFS's delalloc_prepare_write().

> anyway,
> we have to call it if the page is being written partial. 

Not necessarily.  If we're operating in nobh mode that page is brought
uptodate in prepare_write().

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: booked-page-flag.patch

2007-02-15 Thread Andrew Morton
On Fri, 16 Feb 2007 10:30:39 +0300 Alex Tomas <[EMAIL PROTECTED]> wrote:

> >>>>> Andrew Morton (AM) writes:
> 
>  -> get_block(with BH_Delay) can be used to signal
>  >> filesystem that no actual allocation is required.
>  >> so, aware filesystem can just reserve space. then
>  -> writepages() should walk through the pages like
>  >> it does currently, collect contiugous sequences
>  >> and again call ->get_block(w/o BH_Delay) with b_size
>  >> covering all contiguous pages ...
>  >> 
> 
>  AM> That sounds like it'd work, yes.
> 
>  AM> Except for an address_space which is using delayed allocation, its
>  -> prepare_write wouldn't call get_block at all, so perhaps that isn't
>  AM> needed.
> 
> hmm. I thought it has to call get_block() at least to know whether
> the block is already allocated. and I was going to reserve space
> in prepare_write for which some fs-specific method is needed becase
> only fs knows how much metadata it'd need.

Well, one could just assume that the page has no disk mapping and go and
make the space reservation.  Things will work out OK when we come to do
writepage().

Or one could do both: call get_block() only if the page was inside i_size.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


data=journal busted

2007-02-15 Thread Andrew Morton

I have a report from a google person who just did some basic
power-it-off-during-a-write testing on 2.6.20's ext3.  ordered-data is OK,
but data=journal came back with crap in the file data.

Is anyone doing any formal recovery stress-testing?

I suspect we should resurrect and formalise my old
make-the-disk-stop-accepting-writes-when-a-timer-goes-off thing.  It was
very useful for stress-testing recovery.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: booked-page-flag.patch

2007-02-15 Thread Andrew Morton
On Fri, 16 Feb 2007 00:07:55 +0300
Alex Tomas <[EMAIL PROTECTED]> wrote:

> >>>>> Andrew Morton (AM) writes:
> 
>  AM> If the page doesn't have buffer-heads, set PG_private and clear 
> page->private
> 
>  AM> If the page does have buffer_heads, use BH_Delay.
> 
> I did exactly this way in the first version, but later
> got feeling that the community'd discard "ugly hack".

These things can be wrapped in nice helper functions in header files.  that
allows us to easily reimplement if a better idea comes along.

And we already do all sorts of horrid things to save space in the page
struct.

> one more question: how much of it you want in VFS?

As much as possible, really.  It's probably too late for XFS to reuse any
new infrastructure, but if the infrastructure is well-designed it _should_
be possible to us it in new filesytems, and to convert ext2.

> ->get_block(with BH_Delay) can be used to signal
> filesystem that no actual allocation is required.
> so, aware filesystem can just reserve space. then
> ->writepages() should walk through the pages like
> it does currently, collect contiugous sequences
> and again call ->get_block(w/o BH_Delay) with b_size
> covering all contiguous pages ...
> 

That sounds like it'd work, yes.

Except for an address_space which is using delayed allocation, its
->prepare_write wouldn't call get_block at all, so perhaps that isn't
needed.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: booked-page-flag.patch

2007-02-15 Thread Andrew Morton
On Thu, 15 Feb 2007 20:30:43 +0300 Alex Tomas <[EMAIL PROTECTED]> wrote:

> >>>>> Eric Sandeen (ES) writes:
> 
>  ES> Andrew Morton wrote:
>  >> Sorry, we're seriously, seriously, seriously short on flags in the page
>  >> struct and this patch is going to need one heck of a good case for it to 
> be
>  >> acceptable.
> 
>  ES> This was for the delayed allocation patchset, right; and by managing
>  ES> this at the page level that means we can't do this for block size <
>  ES> page size, I think...
> 
> I still think that having PG_booked and special code to handle
> case when blocksize==pagesize is worthwhile -- we save 56 bytes
> on 32bit and 104 bytes on 64bit for every written page.
> 

If the page doesn't have buffer-heads, set PG_private and clear page->private

If the page does have buffer_heads, use BH_Delay.


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: booked-page-flag.patch

2007-02-15 Thread Andrew Morton
On Thu, 15 Feb 2007 09:18:34 -0800 Eric Sandeen <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > Sorry, we're seriously, seriously, seriously short on flags in the page
> > struct and this patch is going to need one heck of a good case for it to be
> > acceptable.
> 
> This was for the delayed allocation patchset, right; and by managing 
> this at the page level that means we can't do this for block size < page 
> size, I think...

If the page is partially mapped to disk then we can still reserve an entire
page's worth of blocks, as long as we unreserve an entire page's worth
once the page gets fully mapped to disk.

> There are already buffer head flags for delalloc (block to be allocated 
>   on flush) and unwritten (actual block allocated to a file but not yet 
> written) in the vfs - shouldn't we be looking at using those?

We could, but we have this no-buffer-head mode which really should be the
preferred method, for 4k pagesize systems, at least.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


i_version_3_ext4_user_interface.patch

2007-02-15 Thread Andrew Morton

ia64 allmodconfig gives:

fs/stat.c: In function `cp_new_stat':
fs/stat.c:234: error: structure has no member named `st_i_version'

I'll drop the ext4 tree.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


booked-page-flag.patch

2007-02-15 Thread Andrew Morton

Sorry, we're seriously, seriously, seriously short on flags in the page
struct and this patch is going to need one heck of a good case for it to be
acceptable.

Even then, we should put a lot of effort into finding some way of avoiding
adding that page flag.  One option might be to add a new radix-tree tag,
and defining it as "for filesytem usage".  Or use PG_checked (which should
be renamed to to PG_fs_misc) (if that doesn't conflict with ext4's existing
use of PG_checked).  Or use !PageMappedToDisk()?

These patches seem to have a number of issues - we should get them properly
commented and properly changelogged then get them on the wire for decent
review before investing too much in them, please.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


i_version_1_vfs_layer

2007-02-15 Thread Andrew Morton
This:

ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/2.6.20-ext4-1/broken-out/i_version_1_vfs_layer

significantly deoptimises file_update_time() for major filesystems (eg, ext3).

The changelog offers no reason for this alteration.  In fact the changelog
basically says nothing at all and needs a lot of work.

What is this patch doing and why does it need to make that change?

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


i_version_2_ext4_specific_code

2007-02-15 Thread Andrew Morton

This:

ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/2.6.20-ext4-1/broken-out/i_version_2_ext4_specific_code

has 100% broken whitespace: it uses spaces where there should be hard tabs.

Please let's not merge anything which does that.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-08 Thread Andrew Morton
On Thu, 8 Feb 2007 11:21:02 +0100 Jan Kara <[EMAIL PROTECTED]> wrote:

> On Thu 08-02-07 01:45:29, Andrew Morton wrote:
>  
> > >   I though Andreas meant "any write changes" - i.e. you check that noone
> > > has open file descriptor for writing and block any new open for writing.
> > > That can be done quite easily.
> > >   Anyway, I agree with you that userspace solution to a possible page
> > > cache pollution is preferable after thinking about it for a while.
> > > As I've been thinking about it, we could actually do the copying
> > > from user space. We could do something like:
> > >   block any writes to file (as I described above)
> > >   craft new inode with blocks allocated as we want (using preallocation,
> > > we should mostly have the kernel infrastructure we need)
> > >   copy data using splice syscall
> > >   call the kernel to switch data
> > > 
> > 
> > I don't think we need to block any writes to any file or anything.
> > 
> > To move a page within a file:
> > 
> > fd = open(file);
> > p = mmap(fd);
> > the_page_was_in_core = mincore(p, offset);
> > munmap(p);
> > ioctl(fd, ..., new_block);
> > 
> > 
> > read_cache_page(inode, offset);
> > lock_page(page);
> > if (try_to_free_buffers(page)) {
> > 
> > set_page_dirty(page);
> > }
> > unlock_page(page);
> > 
> > if (the_page_was_in_core) {
> > sync_file_range(fd, offset SYNC_FILE_RANGE_WAIT_BEFORE|
> > SYNC_FILE_RANGE_WRITE|
> > SYNC_FILE_RANGE_WAIT_AFTER);
> > fadvise(fd, offset, FADV_DONTNEED);
> > }
> > 
> > completely coherent with pagecache, quite safe in the presence of mmap,
> > mlock, O_DIRECT, everything else.  Also fully journallable in-kernel.
>   Yes, this is the simple way. But I see two disadvantages:
> 1) You'd like to relocate metadata (indirect blocks) too.

Well.  Do we really?  Are we looking for a 100% solution here, or a 90% one?

Relocating data is the main thing.  After that, yeah, relocating metadata,
inodes and directories is probably a second-order thing.

> For that you need
>a different mechanism.

I suspect a similar approach will work there: load and lock the
buffer_heads (or maybe just the top-level buffer_head) and then alter their
contents.  It could be that verify_chain() will just magically do the right
thing there, but some changes might be needed.

> In my approach, you can mostly assume you've got
>sanely laid out metadata and so the existence of such mechanism is not
>so important.
> 2) You'd like to allocate new blocks in big chunks. So your kernel function
>should rather take a range. Also when you fail in the middle of
>relocating a file (for example the block you'd like to use is already
>taken by someone else), I find it nice if you can return at least to the
>original state. But that's probably not important.

Well yes, that was a minimal sketch.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-08 Thread Andrew Morton
On Thu, 8 Feb 2007 10:29:45 +0100 Jan Kara <[EMAIL PROTECTED]> wrote:

> On Wed 07-02-07 12:56:59, Andrew Morton wrote:
> > On Wed, 7 Feb 2007 13:46:57 -0700
> > Andreas Dilger <[EMAIL PROTECTED]> wrote:
> > 
> > > On Feb 06, 2007  17:35 -0800, Andrew Morton wrote:
> > > > On Mon, 5 Feb 2007 14:12:04 +0100
> > > > Jan Kara <[EMAIL PROTECTED]> wrote:
> > > > > > Move the blocks on the temporary inode to the original inode
> > > > > > by a page.
> > > > > > 1. Read the file data from the old blocks to the page
> > > > > > 2. Move the block on the temporary inode to the original inode
> > > > > > 3. Write the file data on the page into the new blocks
> > > > >   I have one thing - it's probably not good to use page cache for
> > > > > defragmentation.
> > > > 
> > > > Then it is no longer online defragmentation.  The issues with 
> > > > maintaining
> > > > correctness and coherency with ongoing VFS activity would be truly 
> > > > ghastly.
> > > > 
> > > > If we're worried about pagecache pollution then it would be better to 
> > > > control
> > > > that from userspace via fadvise().
> > > 
> > > It should be possible to have the online defrag tool lock the inode 
> > > against
> > > any changes,
> > 
> > Sounds easy when you say it fast.  But how do we "lock" against, say, a
> > read pagefault?  Only by writing back then removing the pagecache page then
> > reinstantiating it as a locked, not-uptodate page and then removing it from
> > pagecache afterwards prior to unlocking it.  Or something.
> > 
> > I don't think we want to go there.
>   I though Andreas meant "any write changes" - i.e. you check that noone
> has open file descriptor for writing and block any new open for writing.
> That can be done quite easily.
>   Anyway, I agree with you that userspace solution to a possible page
> cache pollution is preferable after thinking about it for a while.
> As I've been thinking about it, we could actually do the copying
> from user space. We could do something like:
>   block any writes to file (as I described above)
>   craft new inode with blocks allocated as we want (using preallocation,
> we should mostly have the kernel infrastructure we need)
>   copy data using splice syscall
>   call the kernel to switch data
> 

I don't think we need to block any writes to any file or anything.

To move a page within a file:

fd = open(file);
p = mmap(fd);
the_page_was_in_core = mincore(p, offset);
munmap(p);
ioctl(fd, ..., new_block);


read_cache_page(inode, offset);
lock_page(page);
if (try_to_free_buffers(page)) {

set_page_dirty(page);
}
unlock_page(page);

if (the_page_was_in_core) {
sync_file_range(fd, offset SYNC_FILE_RANGE_WAIT_BEFORE|
SYNC_FILE_RANGE_WRITE|
SYNC_FILE_RANGE_WAIT_AFTER);
fadvise(fd, offset, FADV_DONTNEED);
}

completely coherent with pagecache, quite safe in the presence of mmap,
mlock, O_DIRECT, everything else.  Also fully journallable in-kernel.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-07 Thread Andrew Morton
On Wed, 7 Feb 2007 13:46:57 -0700
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> On Feb 06, 2007  17:35 -0800, Andrew Morton wrote:
> > On Mon, 5 Feb 2007 14:12:04 +0100
> > Jan Kara <[EMAIL PROTECTED]> wrote:
> > > > Move the blocks on the temporary inode to the original inode
> > > > by a page.
> > > > 1. Read the file data from the old blocks to the page
> > > > 2. Move the block on the temporary inode to the original inode
> > > > 3. Write the file data on the page into the new blocks
> > >   I have one thing - it's probably not good to use page cache for
> > > defragmentation.
> > 
> > Then it is no longer online defragmentation.  The issues with maintaining
> > correctness and coherency with ongoing VFS activity would be truly ghastly.
> > 
> > If we're worried about pagecache pollution then it would be better to 
> > control
> > that from userspace via fadvise().
> 
> It should be possible to have the online defrag tool lock the inode against
> any changes,

Sounds easy when you say it fast.  But how do we "lock" against, say, a
read pagefault?  Only by writing back then removing the pagecache page then
reinstantiating it as a locked, not-uptodate page and then removing it from
pagecache afterwards prior to unlocking it.  Or something.

I don't think we want to go there.

> flush all pages out of the cache for that inode, and then do
> the reallocated outside of the page cache.  For inodes not already in cache
> this is a no-op.  For the (hopefully rare) case were the inode already has
> cached pages and also needs to be reallocated it would be a performance hit.
> 
> Alternately, we could skip files currently being modified (or mmaped), or
> even recently modified (e.g. within the last 30 minutes) in the default case,
> on the assumption that they might be deleted soon anyways.

argh.

It's simple to just use pagecache.  The "we don't want to swamp the machine
with pagecache" argument is bogus.  If it becomes a problem (and it might
not) then it is very simple to control the pagecache from userspace.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-06 Thread Andrew Morton
On Mon, 5 Feb 2007 14:12:04 +0100
Jan Kara <[EMAIL PROTECTED]> wrote:

> > Move the blocks on the temporary inode to the original inode
> > by a page.
> > 1. Read the file data from the old blocks to the page
> > 2. Move the block on the temporary inode to the original inode
> > 3. Write the file data on the page into the new blocks
>   I have one thing - it's probably not good to use page cache for
> defragmentation.

Then it is no longer online defragmentation.  The issues with maintaining
correctness and coherency with ongoing VFS activity would be truly ghastly.

If we're worried about pagecache pollution then it would be better to control
that from userspace via fadvise().
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-06 Thread Andrew Morton
On Tue, 16 Jan 2007 21:05:20 +0900
[EMAIL PROTECTED] wrote:

> Move the blocks on the temporary inode to the original inode
> by a page.
> 1. Read the file data from the old blocks to the page
> 2. Move the block on the temporary inode to the original inode
> 3. Write the file data on the page into the new blocks
> 
> ...
>
> +
> +/**
> + * ext4_ext_replace_branches - replace original extents with new extents.
> + * @org_inodeOriginal inode
> + * @dest_inode   temporary inode
> + * @from_pagePage offset
> + * @count_page   Page count to be replaced
> + * @delete_start block offset for deletion
> + *
> + * This function returns 0 if succeed, otherwise returns error value.
> + * Replace extents for blocks from "from" to "from+count-1".
> + */
> +static int
> +ext4_ext_replace_branches(struct inode *org_inode, struct inode *dest_inode,
> + pgoff_t from_page,  pgoff_t dest_from_page,
> + pgoff_t count_page, unsigned long *delete_start) 
> +{
> + struct ext4_ext_path *org_path = NULL;
> + struct ext4_ext_path *dest_path = NULL;
> + struct ext4_extent   *oext, *dext;
> + struct ext4_extent   tmp_ext;
> + int err = 0;
> + int depth;
> + unsigned long from, count, dest_off, diff, replaced_count = 0;

These should be sector_t, shouldn't they?

> + handle_t *handle = NULL;
> + unsigned jnum;
> +
> + from = from_page << (PAGE_CACHE_SHIFT - dest_inode->i_blkbits);

In which case one needs to be very careful to avoid overflows in
expressions such as this one.

> + wait_on_page_locked(page);
> + lock_page(page);

The wait_on_page_locked() is unneeded here.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1][RFC] mm: prepare_write positive return value

2007-02-06 Thread Andrew Morton
On Tue, 06 Feb 2007 11:33:46 +0300
Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:

> Almost all read/write operation handles data with chunks(segments or pages)
> and result has integral behaviour for folowing scenario: 
> for_each_chunk() {
>  res = op();
>  if(IS_ERROR(res))
>return progress ? progress : res;
>  progress += res;
> }
> prepare_write may has integral behaviour in case of blksize < pgsize,
> for example we successfully allocated/read some blocks, but not all of them,
> and than some error happend. Currently we eliminate this progress by doing
> vmtrunate() after prepare_has failed.
> It is good to have ability to signal about this progress. Interprete positive
> prepare_write() ret code as bytes num that fs ready to handle at this moment.
> I've ask SAW, he think it is sane. This size always less than PAGE_CACHE_SIZE
> so it less than AOP_TRUNCATED_PAGE too.
>  
> BTH: This approach was used in OpenVZ 2.6.9 kernel in order to make FS with 
> delayed allocation more correct, and its works well.
> I think not everybody will happy about this,  but let's discuss all advantages
> and disadvantages of this change.

That seems to be a logical change.  We'd need to review all the callers and
callees to make sure that they handle this change correctly.

Your changes deviate quite a lot from standard kernel coding style.  Please fix
that.

Please cc linux-ext4@vger.kernel.org on the next version of these patches.  I'm
seriously running out of bandwidth over here and ext4 has other maintainers.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fw: [BUG -mm] ext3_orphan_add() accessing corrupted list on a corrupted ext3fs

2007-02-01 Thread Andrew Morton


Begin forwarded message:

Date: Thu, 1 Feb 2007 16:44:39 +0800
From: Fengguang Wu <[EMAIL PROTECTED]>
To: LKML 
Subject: [BUG -mm] ext3_orphan_add() accessing corrupted list on a corrupted 
ext3fs


I accidentally ran two qemu instances on the same ext3 fs, after that bad
things happened. After exiting the two qemus and running a new one, I got the
following oops:

root ~# ll /etc/mtab
/bin/ls: /etc/mtab: Input/output error
root ~# rm /etc/mtab
[  147.213090] EXT3-fs warning (device hda): ext3_unlink: Deleting nonexistent 
file (1775838), 0
root ~# halt
[  152.651209] list_add corruption. next->prev should be prev 
(810007be1a38), but was 81000717e3d8. (next=81000717e3d8).
[  152.652507] [ cut here ]
[  152.652900] kernel BUG at lib/list_debug.c:27!
[  152.653283] invalid opcode:  [1] SMP
[  152.653649] last sysfs file: /block/md2/uevent
[  152.654020] CPU 0
[  152.654228] Modules linked in:
[  152.654549] Pid: 1107, comm: zsh Not tainted 2.6.20-rc6-mm3 #1
[  152.655397] RIP: 0010:[]  [] 
__list_add+0x48/0xb0
[  152.656139] RSP: 0018:8100062bdd78  EFLAGS: 0296
[  152.656572] RAX: 0088 RBX: 81000717e3d8 RCX: 
[  152.657140] RDX: 8101a433 RSI: 0001 RDI: 8141fb40
[  152.657708] RBP: 8100062bdd98 R08: 0002 R09: 8101a270
[  152.658275] R10: 8100062bdb58 R11: 0006 R12: 810007be1a38
[  152.658842] R13: 81000717e3d8 R14: 810005a52170 R15: 81000717e3d8
[  152.659415] FS:  2ba30c98ae90() GS:81488000() 
knlGS:
[  152.660068] CS:  0010 DS:  ES:  CR0: 8005003b
[  152.660531] CR2: 2ba30d67 CR3: 062ee000 CR4: 06e0
[  152.661103] Process zsh (pid: 1107, threadinfo 8100062bc000, task 
810007895080)
[  152.661731] Stack:  8100061245b0  810007bcac20 
81000717e470
[  152.662483]  8100062bdda8 8116f5cc 8100062bde18 
81129463
[  152.663147]  81000717e470 81000717e300 8100061245b0 
0f80
[  152.663779] Call Trace:
[  152.664035]  [] list_add+0xc/0x10
[  152.664439]  [] ext3_orphan_add+0x163/0x1a0
[  152.664943]  [] ext3_unlink+0x150/0x1c0
[  152.665385]  [] vfs_unlink+0xb2/0x110
[  152.665813]  [] do_unlinkat+0x108/0x1f0
[  152.666255]  [] trace_hardirqs_on_thunk+0x35/0x37
[  152.666761]  [] trace_hardirqs_on+0x1a9/0x1d0
[  152.667239]  [] trace_hardirqs_on_thunk+0x35/0x37
[  152.667758]  [] sys_unlink+0x11/0x20
[  152.668180]  [] system_call+0x7e/0x83
[  152.668602]
[  152.668749]
[  152.668754] Code: 0f 0b 66 66 90 66 66 90 eb fe 31 f6 49 3b 1c 24 48 c7 c7 60
[  152.669850] RIP  [] __list_add+0x48/0xb0
[  152.670322]  RSP 
[  152.670842] BUG: at kernel/exit.c:860 do_exit()
[  152.671209]
[  152.671214] Call Trace:
[  152.671543]  [] profile_task_exit+0x15/0x20
[  152.671992]  [] do_exit+0x6b/0xac0
[  152.672384]  [] _spin_unlock_irqrestore+0x4c/0x60
[  152.672871]  [] die+0x61/0x70
[  152.673230]  [] do_trap+0xf0/0x110
[  152.673624]  [] do_invalid_op+0xb3/0xc0
[  152.674048]  [] __list_add+0x48/0xb0
[  152.675955]  [] error_exit+0x0/0x96
[  152.676385]  [] release_console_sem+0x50/0x230
[  152.676876]  [] release_console_sem+0x213/0x230
[  152.677370]  [] __list_add+0x48/0xb0
[  152.67]  [] __list_add+0x48/0xb0
[  152.678200]  [] list_add+0xc/0x10
[  152.678598]  [] ext3_orphan_add+0x163/0x1a0
[  152.679105]  [] ext3_unlink+0x150/0x1c0
[  152.679570]  [] vfs_unlink+0xb2/0x110
[  152.679991]  [] do_unlinkat+0x108/0x1f0
[  152.680436]  [] trace_hardirqs_on_thunk+0x35/0x37
[  152.680945]  [] trace_hardirqs_on+0x1a9/0x1d0
[  152.681411]  [] trace_hardirqs_on_thunk+0x35/0x37
[  152.681909]  [] sys_unlink+0x11/0x20
[  152.682341]  [] system_call+0x7e/0x83
[  152.682761]

Regards,
Wu
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How git affects kernel.org performance

2007-01-07 Thread Andrew Morton
On Sun, 7 Jan 2007 09:55:26 +0100
Willy Tarreau <[EMAIL PROTECTED]> wrote:

> On Sat, Jan 06, 2007 at 09:39:42PM -0800, Linus Torvalds wrote:
> > 
> > 
> > On Sat, 6 Jan 2007, H. Peter Anvin wrote:
> > > 
> > > During extremely high load, it appears that what slows kernel.org down 
> > > more
> > > than anything else is the time that each individual getdents() call takes.
> > > When I've looked this I've observed times from 200 ms to almost 2 seconds!
> > > Since an unpacked *OR* unpruned git tree adds 256 directories to a cleanly
> > > packed tree, you can do the math yourself.
> > 
> > "getdents()" is totally serialized by the inode semaphore. It's one of the 
> > most expensive system calls in Linux, partly because of that, and partly 
> > because it has to call all the way down into the filesystem in a way that 
> > almost no other common system call has to (99% of all filesystem calls can 
> > be handled basically at the VFS layer with generic caches - but not 
> > getdents()).
> > 
> > So if there are concurrent readdirs on the same directory, they get 
> > serialized. If there is any file creation/deletion activity in the 
> > directory, it serializes getdents(). 
> > 
> > To make matters worse, I don't think it has any read-ahead at all when you 
> > use hashed directory entries. So if you have cold-cache case, you'll read 
> > every single block totally individually, and serialized. One block at a 
> > time (I think the non-hashed case is likely also suspect, but that's a 
> > separate issue)
> > 
> > In other words, I'm not at all surprised it hits on filldir time. 
> > Especially on ext3.
> 
> At work, we had the same problem on a file server with ext3. We use rsync
> to make backups to a local IDE disk, and we noticed that getdents() took
> about the same time as Peter reports (0.2 to 2 seconds), especially in
> maildir directories. We tried many things to fix it with no result,
> including enabling dirindexes. Finally, we made a full backup, and switched
> over to XFS and the problem totally disappeared. So it seems that the
> filesystem matters a lot here when there are lots of entries in a
> directory, and that ext3 is not suitable for usages with thousands
> of entries in directories with millions of files on disk. I'm not
> certain it would be that easy to try other filesystems on kernel.org
> though :-/
> 

Yeah, slowly-growing directories will get splattered all over the disk.

Possible short-term fixes would be to just allocate up to (say) eight
blocks when we grow a directory by one block.  Or teach the
directory-growth code to use ext3 reservations.

Longer-term people are talking about things like on-disk rerservations. 
But I expect directories are being forgotten about in all of that.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext2: conditional removal of NFSD code

2007-01-06 Thread Andrew Morton
On Sat, 6 Jan 2007 22:58:31 +0300
Alexey Dobriyan <[EMAIL PROTECTED]> wrote:

> Nor me nor my box is going to act as NFS server, so ifdef all
> exporting code.
> 
> Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]>
> ---
> 
>  fs/ext2/namei.c |2 ++
>  fs/ext2/super.c |4 
>  2 files changed, 6 insertions(+)
> 
> --- a/fs/ext2/namei.c
> +++ b/fs/ext2/namei.c
> @@ -70,6 +70,7 @@ static struct dentry *ext2_lookup(struct
>   return d_splice_alias(inode, dentry);
>  }
>  
> +#if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE)
>  struct dentry *ext2_get_parent(struct dentry *child)
>  {
>   unsigned long ino;
> @@ -94,6 +95,7 @@ struct dentry *ext2_get_parent(struct de
>   }
>   return parent;
>  } 
> +#endif
>  
>  /*
>   * By the time this is called, we already have created
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -250,6 +250,7 @@ #ifdef CONFIG_QUOTA
>  #endif
>  };
>  
> +#if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE)
>  static struct dentry *ext2_get_dentry(struct super_block *sb, void *vobjp)
>  {
>   __u32 *objp = vobjp;
> @@ -297,6 +298,7 @@ static struct export_operations ext2_exp
>   .get_parent = ext2_get_parent,
>   .get_dentry = ext2_get_dentry,
>  };
> +#endif
>  
>  static unsigned long get_sb_block(void **data)
>  {
> @@ -916,7 +918,9 @@ static int ext2_fill_super(struct super_
>* set up enough so that it can read an inode
>*/
>   sb->s_op = &ext2_sops;
> +#if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE)
>   sb->s_export_op = &ext2_export_ops;
> +#endif
>   sb->s_xattr = ext2_xattr_handlers;
>   root = iget(sb, EXT2_ROOT_INO);
>   sb->s_root = d_alloc_root(root);
> 

The linker will remove unreferenced static functions from vmlinux.  So a
better approach would be to move ext2_get_parent into super.c, make it
static and do Matthew's trick, yielding zero ifdefs.

That has the additional advantage that the nfsd-supporting code will always
be compiled.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext2: conditional removal of NFSD code

2007-01-06 Thread Andrew Morton
On Sat, 6 Jan 2007 22:58:31 +0300
Alexey Dobriyan <[EMAIL PROTECTED]> wrote:

> --- a/fs/ext2/namei.c
> +++ b/fs/ext2/namei.c
> @@ -70,6 +70,7 @@ static struct dentry *ext2_lookup(struct
>   return d_splice_alias(inode, dentry);
>  }
>  
> +#if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE)
>  struct dentry *ext2_get_parent(struct dentry *child)

hm.  Officially, one module isn't supposed to know about the presence of
another one at compile-time.  Someone might want to come along and later
configure and compile the nfsd module and then try to load it into a kernel
which wasn't compiled with nfsd enabled.

But given that both modules are in mainline I suspect that nobody would
really be doing that in practice, and in the case of nfsd it might not even
work.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 18:32:07 -0500
Theodore Tso <[EMAIL PROTECTED]> wrote:

> On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
> > I think ext3 is terminally crap by now. It still uses buffer heads in 
> > places where it really really shouldn't, and as a result, things like 
> > directory accesses are simply slower than they should be. Sadly, I don't 
> > think ext4 is going to fix any of this, either.
> 
> Not just ext3; ocfs2 is using the jbd layer as well.  I think we're
> going to have to put this (a rework of jbd2 to use the page cache) on
> the ext4 todo list, and work with the ocfs2 folks to try to come up
> with something that suits their needs as well.  Fortunately we have
> this filesystem/storage summit thing coming up in the next few months,
> and we can try to get some discussion going on the linux-ext4 mailing
> list in the meantime.  Unfortunately, I don't think this is going to
> be trivial.

I suspect it would be insane to move any part of JBD (apart from the
ordered-data flush) to use pagecache.  The whole thing is fundamentally
block-based.  But only for metadata - there's no strong reason why ext3/4
needs to manipulate file data via buffer_heads if data=journal and chattr
+j aren't in use.

We could possibly move ext3/4 directories out of the blockdev pagecache and
into per-directory pagecache, but that wouldn't change anything - the
journalling would still be block-based.

Adam Richter spent considerable time a few years ago trying to make the
mpage code go direct-to-BIO in all cases and we eventually gave up.  The
conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
and ugly to fully optimise away the "block" bit in the middle.

buffer_heads become more important with large PAGE_CACHE_SIZE.  I'd expect
nobh mode to be quite inefficient with some workloads on 64k pages.  We
need that representation of the state (and location) of the block-sized
hunks which make up the page.

> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines.  OTOH, reiser3
> probably has the same problems, and I suspect the changes to ext3 to
> cause it to avoid buffer heads, especially in order to support for
> filesystem blocksizes < pagesize, are going to be sufficiently risky
> in terms of introducing regressions to ext3 that they would probably
> be rejected on those grounds.  So unfortunately, we probably are going
> to have to support flushes via buffer heads for the foreseeable
> future.

We'll see.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext4-delayed-allocation.patch

2006-12-23 Thread Andrew Morton
On Fri, 22 Dec 2006 23:28:32 +0300
Alex Tomas <[EMAIL PROTECTED]> wrote:

> +/*
> + * With EXT4_WB_SKIP_SMALL defined the patch will try to avoid
> + * small I/Os ignoring ->writepages() if mapping hasn't enough
> + * contig. dirty pages
> + */
> +#define EXT4_WB_SKIP_SMALL__
> +
> +#define WB_ASSERT(__x__) if (!(__x__)) BUG();

This is unused.  Please kill.

> +#define WB_DEBUG__
> +#ifdef WB_DEBUG
> +#define wb_debug(fmt,a...)   printk(fmt, ##a);
> +#else
> +#define wb_debug(fmt,a...)
> +#endif

It's tiresome for each piece of kernel code to implement private debug
macros.  Why not use pr_debug()?

In general, this patch adds a mountain of code which I suspect just
shouldn't be in a filesystem.  It should be library code in mm/ or fs/. 
It'll take some thought and refactoring and definition of new
address_space_operations entries, etc.  But burying all this inside a
single filesystem is just The Wrong Thing To Do.

I am not inclined to review the code in detail because the lack of suitable
code comments would make that task much larger and significantly less
useful than it should be.  Please spend a day or so commenting the code in
a similar manner to other parts of ext4 and jbd2.  When doing so, put
yourself in the position of an experienced kernel developer who seeks to
understand what the code is doing and why it is doing it.  Skilful
commenting is essential if the code is to be maintainable and it has an
immediate impact upon the code's quality.  Every non-trivial function
should have an introductory comment which tells the reader why this
function exists, what it does and, if not obvious, how it does it.  Don't
bother with kernel-doc comments - for some reason they tend to be useless
for code conprehension.

I shouldn't need to sit here scratching my head and wondering why the heck
a writepages function is running __set_page_dirty_nobuffers().

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext4-block-reservation.patch

2006-12-23 Thread Andrew Morton
On Fri, 22 Dec 2006 23:25:16 +0300
Alex Tomas <[EMAIL PROTECTED]> wrote:

Once this code is settled in we should consider removal of the existing
reservations code from ext4.

> +
> +struct ext4_reservation_slot {
> + __u64   rs_reserved;
> + spinlock_t  rs_lock;
> +} cacheline_aligned;

Should be cacheline_aligned_in_smp.

That's assuming it needs to be cacheline aligned at all.  It can consume a
lot of space.



oh, this should be allocated with alloc_percpu(), in which case the
open-coded alignment can perhaps go away.

> +
> +int ext4_reserve_local(struct super_block *sb, int blocks)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> + struct ext4_reservation_slot *rs;
> + int rc = -ENOSPC;
> +
> + preempt_disable();
> + rs = sbi->s_reservation_slots + smp_processor_id();

use get_cpu() here.

> +void ext4_rebalance_reservation(struct ext4_reservation_slot *rs, __u64 free)
> +{
> + int i, used_slots = 0;
> + __u64 chunk;
> +
> + /* let's know what slots have been used */
> + for (i = 0; i < NR_CPUS; i++)
> + if (rs[i].rs_reserved || i == smp_processor_id())
> + used_slots++;
> +
> + /* chunk is a number of block every used
> +  * slot will get. make sure it isn't 0 */
> + chunk = free + used_slots - 1;
> + do_div(chunk, used_slots);
> +
> + for (i = 0; i < NR_CPUS; i++) {

all these NR_CPUS loops need to go away.  Use either
for_each_possible_cpu() or, preferably, for_each_online_cpu() and a hotplug
notifier.

Why is this code using per-cpu data at all, btw?  These optimisations tend
to be marginal in filesystems.  What is the perfomance impact of making
this data be single-superblock-wide-instance?

> +int ext4_reserve_init(struct super_block *sb)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> + struct ext4_reservation_slot *rs;
> + int i;
> +
> + rs = kmalloc(sizeof(struct ext4_reservation_slot) * NR_CPUS, 
> GFP_KERNEL);

alloc_percpu()

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] Fix kmalloc flags used in ext3 with an active journal handle

2006-12-20 Thread Andrew Morton
On Tue, 19 Dec 2006 18:22:03 -0800
Suzuki <[EMAIL PROTECTED]> wrote:

> 
> Andrew Morton wrote:
> > On Tue, 19 Dec 2006 17:58:12 -0800
> > Suzuki <[EMAIL PROTECTED]> wrote:
> > 
> > 
> >>* Fix the kmalloc flags used from within ext3, when we have an active 
> >>journal handle
> >>
> >>If we do a kmalloc with GFP_KERNEL on system running low on memory, 
> >> with an active journal handle, we might end up in cleaning up the fs cache 
> >> flushing dirty inodes for some other filesystem. This would cause hitting 
> >> a J_ASSERT() in :
> > 
> > 
> > The change might be needed (haven't looked at it yet).  But I'd like to see
> > the full BUG trace, please.  To see the callchain.
> 
> Here is the call trace which was hit by one of our test teams. This was 
> from fs/ext3/xattr.c. While looking for similar calls I found the others 
> described in the patch.
> 
> Assertion failure in journal_start() at fs/jbd/transaction.c:274: "handle-
>  >h_transaction->t_journal == journal"
> kernel BUG at fs/jbd/transaction.c:274!
> illegal operation: 0001 [#1]
> CPU:0Not tainted (2.6.5-7.282-s390x SLES9_SP3_BRANCH-20061031152356)
> Process dbench (pid: 14070, task: 025617f0, ksp: 01057630)
> Krnl PSW : 07018000 08837b38 (journal_start+0x90/0x15c 
> [jbd])
> Krnl GPRS:  00507fc0 002b 
> 01056d80
> 08837b36 2885 08841da6 
> 
> 001bfaa0 03483d08 0002 
> 07a8bda0
> 08833000 088a7d08 08837b36 
> 01056e80
> Krnl Code: 00 00 58 10 b0 0c a7 1a 00 01 b9 04 00 2b 50 10 b0 0c e3 40
> Call Trace:
>   [<088a30fc>] ext3_journal_start+0x8c/0xa4 [ext3]
>   [<08896822>] ext3_dirty_inode+0x3a/0xe0 [ext3]
>   [<001ca362>] __mark_inode_dirty+0x1ae/0x1c8
>   [<001bfaa0>] iput+0xbc/0xf0
>   [<001bdcca>] prune_dcache+0x29e/0x584
>   [<001bdfe4>] shrink_dcache_memory+0x34/0x54
>   [<0017b100>] shrink_slab+0x15c/0x250
>   [<0017b6e4>] try_to_free_pages+0x1c0/0x2a4
>   [<00170276>] __alloc_pages+0x2ba/0x4e0
>   [<0017059a>] __get_free_pages+0x4e/0x8c
>   [<00174ea2>] cache_alloc_refill+0x2a6/0x868
>   [<00175540>] __kmalloc+0xdc/0xe0
>   [<088a4e62>] ext3_xattr_set_handle+0x114a/0x174c [ext3]
>   [<088a54e4>] ext3_xattr_set+0x80/0xd0 [ext3]
>   [<088a6312>] ext3_xattr_user_set+0xce/0xe4 [ext3]
>   [<088a5f1e>] ext3_setxattr+0x17e/0x18c [ext3]
>   [<001c88e6>] setxattr+0x14a/0x234
>   [<001c8a80>] sys_fsetxattr+0xb0/0x110
>   [<0011fc10>] sysc_noemu+0x10/0x16

How did we get from iput() into __mark_inode_dirty()?  I can't see it in
mainline, nor in 2.6.5 which you appear to be using...

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH] Fix kmalloc flags used in ext3 with an active journal handle

2006-12-19 Thread Andrew Morton
On Tue, 19 Dec 2006 17:58:12 -0800
Suzuki <[EMAIL PROTECTED]> wrote:

> * Fix the kmalloc flags used from within ext3, when we have an active journal 
> handle
> 
>   If we do a kmalloc with GFP_KERNEL on system running low on memory, 
> with an active journal handle, we might end up in cleaning up the fs cache 
> flushing dirty inodes for some other filesystem. This would cause hitting a 
> J_ASSERT() in :

The change might be needed (haven't looked at it yet).  But I'd like to see
the full BUG trace, please.  To see the callchain.

Always include the trace...

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4 calls journal_stop

2006-12-08 Thread Andrew Morton
On Thu, 7 Dec 2006 22:17:41 -0800
Randy Dunlap <[EMAIL PROTECTED]> wrote:

> From: Randy Dunlap <[EMAIL PROTECTED]>
> 
> journal_stop() is not defined for ext4; change to ext4_journal_stop().
> 
> Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
> ---
>  fs/ext4/inode.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-2.6.19-git11.orig/fs/ext4/inode.c
> +++ linux-2.6.19-git11/fs/ext4/inode.c
> @@ -1232,7 +1232,7 @@ retry:
>   from, to, NULL, do_journal_get_write_access);
>   if (ret)
>   /* fatal error, just put the handle and return */
> - journal_stop(handle);
> + ext4_journal_stop(handle);

rofl.  That'll link and run if both filesystems are built.  I bet it
doesn't run very well though
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot failure with ext2 and initrds

2006-11-29 Thread Andrew Morton
On Wed, 29 Nov 2006 09:20:24 +
Russell King <[EMAIL PROTECTED]> wrote:

> What I'm looking for is confirmation of the semantics of
> find_next_zero_bit()

What are the existing semantics?  I see no documentation in any of the
architectures I've looked at.  That's my point.

>From a quick read of fs/ext2/balloc.c

ext2_find_next_zero_bit(base, size, offset)

appears to expect that base is the start of the memory buffer, size is the
number of bits at *base and offset is the bit at which to start the search,
relative to base.  If a zero bit is found it will return the offset of that
bit relative to base.  It will return some number greater than `size' if no
zero-bit was found.  

Whether that's how all the implementors interpreted it is anyone's guess. 
Presumably the architectures all do roughly the same thing.

> 

Well likewise.  It appears that nobody (and about 20 people have
implemented these things) could be bothered getting off ass and documenting
the pathetic thing.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot failure with ext2 and initrds

2006-11-29 Thread Andrew Morton
On Wed, 29 Nov 2006 07:40:00 +
Russell King <[EMAIL PROTECTED]> wrote:

> Yet another attempt to get a response from Andrew.  It is rather
> important that you DO respond to this.

You can read the code as easily as I can?  I'm not really sure what
you're asking - I thought Mingming cleared things up.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot failure with ext2 and initrds

2006-11-28 Thread Andrew Morton
On Tue, 28 Nov 2006 13:04:53 -0800
Mingming Cao <[EMAIL PROTECTED]> wrote:

> Thanks, I have acked most of them, and will port them to ext3/4 soon.

You've acked #2 and #3.  #4, #5 and #6 remain un-commented-upon and #1 is
unclear?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot failure with ext2 and initrds

2006-11-16 Thread Andrew Morton
On Thu, 16 Nov 2006 12:15:16 -0800
Mingming Cao <[EMAIL PROTECTED]> wrote:

> On Thu, 2006-11-16 at 01:13 -0800, Andrew Morton wrote:
> > On Thu, 16 Nov 2006 00:49:20 -0800
> > Mingming Cao <[EMAIL PROTECTED]> wrote:
> > 
> > > On Wed, 2006-11-15 at 23:22 -0800, Andrew Morton wrote:
> > > > On Wed, 15 Nov 2006 22:55:43 -0800
> > > > Mingming Cao <[EMAIL PROTECTED]> wrote:
> > > > 
> > > > > Hmm, maxblocks, in bitmap_search_next_usable_block(),  is the end 
> > > > > block 
> > > > > number of the range  to search, not the lengh of the range. maxblocks 
> > > > > get passed to ext2_find_next_zero_bit(), where it expecting to take 
> > > > > the 
> > > > > _size_ of the range to search instead...
> > > > > 
> > > > > Something like this: (this is not a patch)
> > > > >   @@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
> > > > >   ext2_grpblk_t next;
> > > > > 
> > > > >-  next = ext2_find_next_zero_bit(bh->b_data, maxblocks, 
> > > > > start);
> > > > >+  next = ext2_find_next_zero_bit(bh->b_data, 
> > > > > maxblocks-start + 1, start);
> > > > >   if (next >= maxblocks)
> > > > >   return -1;
> > > > >   return next;
> > > > >}
> > > > 
> > > > yes, the `size' arg to find_next_zero_bit() represents the number of 
> > > > bits
> > > > to scan at `offset'.
> > > > 
> > > > So I think your change is correctish.  But we don't want the "+ 1", do 
> > > > we?
> > > > 
> > > I think we still need the "+1", maxblocks here is the ending block of
> > > the reservation window, so the number of bits to scan =end-start+1.
> > > 
> > > > If we're right then this bug could cause the code to scan off the end 
> > > > of the
> > > > bitmap.  But it won't explain Hugh's bug, because of the if (next >= 
> > > > maxblocks).
> > > > 
> > > 
> > > Yeah.. at first I thought it might be related, then, thinked it over,
> > > the bug only makes the bits to scan larger, so if find_next_zero_bit()
> > > returns something off the end of bitmap, that is fine, it just
> > > indicating that there is no free bit left in the rest of bitmap, which
> > > is expected behavior. So bitmap_search_next_usable_block() fail is the
> > > expected. It will move on to next block group and try to create a new
> > > reservation window there.
> > 
> > I wonder why it's never oopsed.  Perhaps there's always a zero in there for
> > some reason.
> > 
> 
> Why you think it should oopsed?  Even if find_next_zero_bit() finds a
> zero bit beyond of the end of bitmap, the check next > maxblocks will
> catch this and make sure we are not taking a zero bit out of the bitmap
> range, so it fails as expected.

If it can read off the end of the buffer, it can oops.  With
CONFIG_DEBUG_PAGEALLOC, especially.


> > > That does not explain the repeated reservation window add and remove
> > > behavior Huge has reported. 
> > 
> > I spent quite some time comparing with ext3.  I'm a bit stumped and I'm
> > suspecting that the simplistic porting the code is now OK, but something's
> > just wrong.
> > 
> > I assume that the while (1) loop in ext3_try_to_allocate_with_rsv() has
> > gone infinite.  I don't see why, but more staring is needed.
> > 
> 
> The loop should not go forever, it will stops when there is no window
> with free bit to reserve in the given block group.

It seems to have done so in Hugh's testing, but there's some question there
now.  Although I didn't check to see if there's a significant difference
between Hugh's patch and mine.


> > What lock protects the fields in struct ext[234]_reserve_window from being
> > concurrently modified by two CPUs?  None, it seems.  Ditto
> > ext[234]_reserve_window_node.  i_mutex will cover it for write(), but not
> > for pageout over a file hole.  If we end up with a zero- or negative-sized
> > window then odd things might happen.
> > 
> 
> Yes, trucate_mutex protect both struct ext[234]_reserve_window and ext
> [234]_reserve_window_node, and struct ext[234]_block_alloc_info.
> Actually I think truncate_mutex protects all data structures related to
> block allocation/mapping structures.

Yes.  I guess ext2 needs a new mutex for this.  Sad.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot failure with ext2 and initrds

2006-11-16 Thread Andrew Morton
On Thu, 16 Nov 2006 01:48:09 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Thu, 16 Nov 2006 12:37:17 +0300
> Alex Tomas <[EMAIL PROTECTED]> wrote:
> 
> > >>>>> Andrew Morton (AM) writes:
> > 
> >  AM> What lock protects the fields in struct ext[234]_reserve_window from 
> > being
> >  AM> concurrently modified by two CPUs?  None, it seems.  Ditto
> >  AM> ext[234]_reserve_window_node.  i_mutex will cover it for write(), but 
> > not
> >  AM> for pageout over a file hole.  If we end up with a zero- or 
> > negative-sized
> >  AM> window then odd things might happen.
> > 
> > truncate_mutex?
> > 
> 
> yes.  hmm.

by which I mean "ext2 doesn't have a truncate_mutex".
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot failure with ext2 and initrds

2006-11-16 Thread Andrew Morton
On Thu, 16 Nov 2006 12:37:17 +0300
Alex Tomas <[EMAIL PROTECTED]> wrote:

> >>>>> Andrew Morton (AM) writes:
> 
>  AM> What lock protects the fields in struct ext[234]_reserve_window from 
> being
>  AM> concurrently modified by two CPUs?  None, it seems.  Ditto
>  AM> ext[234]_reserve_window_node.  i_mutex will cover it for write(), but not
>  AM> for pageout over a file hole.  If we end up with a zero- or 
> negative-sized
>  AM> window then odd things might happen.
> 
> truncate_mutex?
> 

yes.  hmm.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    1   2   3   >