Re: [RFC] ext3 freeze feature

2008-02-16 Thread Christoph Hellwig
On Fri, Feb 15, 2008 at 08:51:15PM +0900, Takashi Sato wrote:
> So XFS_IOC_FREEZE and XFS_IOC_THAW cannot be lifted to generic code simply.
> I think we should create new generic numbers for freeze and thaw

Actually we've lifted specific ioctls to the generic layer before all
the time in drivers.  That's the only way to make functionality that was
specific to a single driver (or in this case filesystem) generic.  If
the numbering issues confuses you make sure to add a big comment
describing it

> And xfs_freeze calls XFS_IOC_FREEZE with a magic number 1, but what is 1?

As Eric said it's ignored.

> Instead, I'd like the sec to timeout on freeze API in order to thaw
> the filesystem automatically.  It can prevent a filesystem from staying
> frozen forever.
> (Because a freezer may cause a deadlock by accessing the frozen filesystem.)

Timeout based locking is generally a horrible idea, there's a reason
we don't have any primitives for that in the kernel :)
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: move headers out of include/linux

2008-02-10 Thread Christoph Hellwig
On Sun, Feb 10, 2008 at 07:54:32AM -0500, Theodore Tso wrote:
> No, none of this is shared with e2fsprogs; e2fsprogs stopped using the
> kernel header files about seven years ago. (May 2001, e2fsprogs 1.20).

Yeah, I know userspace stopped using the direct copy.  But for example
XFS has exact copies of some headers under fs/xfs also in the userspace
package.  But I assume your answer means you have a completely separate
set of headers for e2fsprogs, which makes sense given it supports ext2,
ext3 and ext4 all with one codebase.

> > > Note that I plan to submit similar patches for ext2 and ext3 aswell,
> > > so the diverging from them argument doesn't count.
> 
> There might be other programs like grub that may depend upon ext2_fs.h
> or ext3_fs.h Nope, not grub.  So a few
> things might break, but they are all programs that should have been
> using the libraries shipped with e2fsprogs, and they wouldn't be 
> critical programs.  So no problems that I know of.

We might have to leave the user-space visible parts of ext2_fs.h
in place due to historical reasons, so I will leave that part out of
the first patch.  I don't think that argument is valid for ext3_fs.h
as ext3 only go into mainline at the same time as /usr/include/ext2fs/
started appearing even if ext3_fs.h is exported currently.

> Note Linus just accepted a pull from me (although it just missed the
> -git21 snapshot window ---  it would be nice if that happend at
> 3am Pacific instead of midnight Pacific since very occasionally it's a
> little after midnight before Linus pushes his last set of changes to
> master.kernel.org) so this patch won't apply cleanly to the
> Linus's tip.  I'll take the patch DTRT so it can be placed in the ext4
> tree.

Updated patch is in the same location:

http://verein.lst.de/~hch/ext4-move-headers

> Also, on the git list, Linus mentioned the -rc1 merge window would be
> closing soon, though, so I don't know if this will make 2.6.25.  If it
> doesn't, would you mind terribly if we put this on hold and *not* have
> this in the -mm tree until right before the next merge window opens.
> It's mostly a mechnical change, doesn't need much testing --- and it
> complicates patch management which I know has been making Andrew a bit
> grumpy as of late.

Yeah, if it's too late already we can defer it.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: move headers out of include/linux

2008-02-09 Thread Christoph Hellwig
On Sat, Feb 09, 2008 at 10:39:33AM +0100, Christoph Hellwig wrote:
> Move ext4 headers out of include/linux.  This is just the trivial move,
> there's some more thing that could be done later.
> 
> Ted, is anything of these shared with e2fsprogs or can we rip out all
> that #ifdef __KERNEL__ junk?
> 
> Note that I plan to submit similar patches for ext2 and ext3 aswell,
> so the diverging from them argument doesn't count.
> 
> Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Looks like the patch is to big for vger.  Here's a link instead:

http://verein.lst.de/~hch/ext4-move-headers

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature

2008-02-08 Thread Christoph Hellwig
On Fri, Feb 08, 2008 at 08:26:57AM -0500, Andreas Dilger wrote:
> You may as well make the common ioctl the same as the XFS version,
> both by number and parameters, so that applications which already
> understand the XFS ioctl will work on other filesystems.

Yes.  In facy you should be able to lift the implementations of
XFS_IOC_FREEZE and XFS_IOC_THAW to generic code, there's nothing
XFS-specific in there.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: merge plans, was Re: - disable-ext4.patch removed from -mm tree

2008-02-05 Thread Christoph Hellwig
On Mon, Feb 04, 2008 at 02:35:29PM -0800, Andrew Morton wrote:
> On Mon, 4 Feb 2008 15:24:18 -0500
> Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> 
> > On Sun, Feb 03, 2008 at 07:15:40PM -0800, Andrew Morton wrote:
> > > On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso <[EMAIL PROTECTED]> wrote:
> > > 
> > > > On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote:
> > > > > When I merge David's iget coversion patches this will instead wreck 
> > > > > the
> > > > > ext4 patchset.
> > > > 
> > > > That's ok, it shouldn't be hard for me to fix this up.  How quickly
> > > > will you be able to merge David's iget converstion patches?
> > > 
> > > They're about 1,000 patches back
> > 
> > Care to post a merge plan so we have a slight chance to make sure not
> > too much crap is hiding in these 1000 patches?
> 
> Pretty much everything up to
> 
> #
> # end
> #
> reiser4-sb_sync_inodes.patch

That includes the git trees?  Defintive NACK to unionfs.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


merge plans, was Re: - disable-ext4.patch removed from -mm tree

2008-02-04 Thread Christoph Hellwig
On Sun, Feb 03, 2008 at 07:15:40PM -0800, Andrew Morton wrote:
> On Sun, 3 Feb 2008 20:36:26 -0500 Theodore Tso <[EMAIL PROTECTED]> wrote:
> 
> > On Sun, Feb 03, 2008 at 12:25:51PM -0800, Andrew Morton wrote:
> > > When I merge David's iget coversion patches this will instead wreck the
> > > ext4 patchset.
> > 
> > That's ok, it shouldn't be hard for me to fix this up.  How quickly
> > will you be able to merge David's iget converstion patches?
> 
> They're about 1,000 patches back

Care to post a merge plan so we have a slight chance to make sure not
too much crap is hiding in these 1000 patches?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature

2008-01-26 Thread Christoph Hellwig
On Fri, Jan 25, 2008 at 09:42:30PM +0900, Takashi Sato wrote:
> Hi,
> 
> >I am also wondering whether we should have system call(s) for these:
> >
> >On Jan 25, 2008 12:59 PM, Takashi Sato <[EMAIL PROTECTED]> wrote:
> >>+   case EXT3_IOC_FREEZE: {
> >
> >>+   case EXT3_IOC_THAW: {
> >
> >And just convert XFS to use them too?
> 
> I think it is reasonable to implement it as the generic system call, as you 
> said.
> Does XFS folks think so?

Given that XFS has implemented the ioctls for such a long time it might
make more sense to simply move the ioctl implementation to fs/ioctl.c
so it applies to all filesystem.  No need to add a new syscall when the
equivalent-functionality ioctls have to be supported forever anyway.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

2008-01-15 Thread Christoph Hellwig
On Tue, Jan 15, 2008 at 01:15:33PM +, Christoph Hellwig wrote:
> They won't fsck in planned downtimes.  They will have to use fsck when
> the shit hits the fan and they need to.   Not sure about ext3, but big
> XFS user with a close tie to the US goverment were concerned about this
> case for really big filesystems and have sponsored speedup including
> multithreading xfs_repair.  I'm pretty sure the same arguments apply
> to ext3, even if the filesystems are a few magnitudes smaller.

And to add to that thanks to the not quite optimal default of
peridocially checking that I alwasy forget to turn off on test machines
an ext3 fsck speedup would be in my personal interested, and probably
that of tons of developers :)
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

2008-01-15 Thread Christoph Hellwig
On Tue, Jan 15, 2008 at 03:04:41AM -0800, Andrew Morton wrote:
> I'm wondering about the real value of this change, really.
> 
> In any decent environment, people will fsck their ext3 filesystems during
> planned downtime, and the benefit of reducing that downtime from 6
> hours/machine to 2 hours/machine is probably fairly small, given that there
> is no service interruption.  (The same applies to desktops and laptops).
> 
> Sure, the benefit is not *zero*, but it's small.  Much less than it would
> be with ext2.  I mean, the "avoid unplanned fscks" feature is the whole
> reason why ext3 has journalling (and boy is that feature expensive during
> normal operation).
> 
> So...  it's unobvious that the benefit of this feature is worth its risks
> and costs?

They won't fsck in planned downtimes.  They will have to use fsck when
the shit hits the fan and they need to.   Not sure about ext3, but big
XFS user with a close tie to the US goverment were concerned about this
case for really big filesystems and have sponsored speedup including
multithreading xfs_repair.  I'm pretty sure the same arguments apply
to ext3, even if the filesystems are a few magnitudes smaller.

> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
---end quoted text---
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] jbd/jbd2: JBD memory allocation cleanups

2007-10-03 Thread Christoph Hellwig
On Thu, Oct 04, 2007 at 01:50:36AM -0400, Theodore Ts'o wrote:
> From: Mingming Cao <[EMAIL PROTECTED]>
> 
> JBD: Replace slab allocations with page cache allocations

It's page allocations, not page cache allocations.

> Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

That sounds like it should be a different patch..

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH, RFC] add fsck to util-linux

2007-09-26 Thread Christoph Hellwig
On Wed, Sep 26, 2007 at 06:59:46AM -0400, Theodore Tso wrote:
> It looks like you pulled fsck from the master branch of e2fsprogs git;
> there is one slight bug fix in the maint branch that hasn't been
> merged into master yet, commit ed773a263829493e4e4bf612dbec2380cf09349f:

I'll pick that up.

> BTW, I don't like this syntax in the fstab file AT ALL, but it is in
> use in the wild by at least some Fedora users, and it's not documented
> in the fstab man page.  I'd suggest using a filesystem type of bind,
> rather than ext3, as the officially "blessed" way of specifying it in
> fstab, but it badly needs to be documented in the fstab and/or mount
> man pages.  The above patch should probably get included, though, and
> backwards compatibility for allowing "bind" to be specified in the
> mount options, and with a warning message that the specifying "bind"
> in the options field has been deprecated.

The syntax is indeed horrible.  Is it supported by upstream util-linux?

> For future code movement, I don't mind fsck moving over, but I would
> like to manage moving over blkid to util-linux-ng myself, as I have
> some pretty strong feelings about the right way to do things.  I am
> quite willing to add some low-level interfaces so that fsid can use
> the same fs probing logic, and I'm willing to add some code so that
> the high-level interfaces of libblkid, if the /dev/disk/by-* links are
> present and the user isn't asking for information which isn't in the
> blkid cache, will use the symlinks instead.  However, I really don't
> want to encode a dependency on udev being there, and I think it should
> be possible to make the fallback be transparent instead of being a
> compile-time option.

I've started looking into this, and I think at least for the detect
which filesystem we have part libblkid is complete overkill.  libvolume_id
has a really nice lowlevel API for that that is much more suitable.

So if it was up to me I'd do the following:

 - move libvolume_id out of udev
 - make mount/fsck use libvolume_id unconditionally for detecting the
   filesystem type.  There's absolute no reason to use anything in
   libblkid for this, and caching the result doesn't help us at all
   as we're going to touch the disk anyway as part of the mount/fsck.
 - make libblkid use libvolume_id internally for filesystem detection

note that the latter might aswell be a static inclusion of the code,
I haven't looked at the details yet.

Another note on moving the libraries into util-linux vs a standalone package:
At least in xfs land people do upgrade xfsprogs frequently and sometimes
independent os the distro because new features get added quite a bit, including
new filesystem features that require support.  Having to upgrade util-linux
for that is not very helpful.  So I'm not so sure about moving this to
util-linux
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] obsolete libcom-err for SuSE e2fsprogs

2007-09-25 Thread Christoph Hellwig
On Tue, Sep 25, 2007 at 10:25:50PM +0200, Kay Sievers wrote:
> >  Technical details :-)
> 
> What do you miss, these are all technical details. :) In simple words,
> we need a completely policy-free, not try-to-be-smart in any sense set
> of functions to identify a bytestream by magic bytes.

Which is exactly what mount and fsck should be doing aswell for a given
device.  In addition they also have the need to find a device if the
fstab line is identified with  LABEL and UUID.  But these are rather
separate issues.

> Hmm, only if you reaqlly don't want to pull it in util-linux, we could
> have it as a separate tree. I still think util-linux is the best place,
> because the most important user of it is mount/fsck. It's your call, I
> would have no problem sending patches against util-linux. :)

Shipping this with util-linux would make some sense.  Then again I'm
a big fan of not mixing up shared libraries and binaries in the same
package.  This just means the distros have to split them into separate
packages again.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH, RFC] add fsck to util-linux

2007-09-25 Thread Christoph Hellwig
This adds fsck from latest e2fsprogs git to util-linux.  There are only
tiny changes to integrate it into the build system and nls setup of
util-linux and fixing up the trailing whitespaces quilt is complaining
about.  I've not yet converted it to the fsprobe helpers as the discussion
on those libs is still ongoing and I haven't read up on the fsprobe library
either.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: util-linux-ng/Makefile.am
===
--- util-linux-ng.orig/Makefile.am  2007-09-25 17:27:27.0 +0200
+++ util-linux-ng/Makefile.am   2007-09-25 17:27:30.0 +0200
@@ -4,6 +4,7 @@ SUBDIRS = \
include \
disk-utils \
fdisk \
+   fsck \
getopt \
hwclock \
login-utils \
Index: util-linux-ng/configure.ac
===
--- util-linux-ng.orig/configure.ac 2007-09-25 17:27:27.0 +0200
+++ util-linux-ng/configure.ac  2007-09-25 17:27:30.0 +0200
@@ -564,6 +564,7 @@ AC_CONFIG_FILES([
 Makefile
 disk-utils/Makefile
 fdisk/Makefile
+fsck/Makefile
 getopt/Makefile
 hwclock/Makefile
 include/Makefile
Index: util-linux-ng/fsck/Makefile.am
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ util-linux-ng/fsck/Makefile.am  2007-09-25 17:34:59.0 +0200
@@ -0,0 +1,6 @@
+include $(top_srcdir)/config/include-Makefile.am
+
+usrsbinexec_PROGRAMS = fsck
+fsck_SOURCES = fsck.c fsck.h base_device.c
+fsck_LDADD = -lblkid
+man_MANS = fsck.8
Index: util-linux-ng/fsck/base_device.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ util-linux-ng/fsck/base_device.c2007-09-25 17:29:27.0 +0200
@@ -0,0 +1,169 @@
+/*
+ * base_device.c
+ *
+ * Return the "base device" given a particular device; this is used to
+ * assure that we only fsck one partition on a particular drive at any
+ * one time.  Otherwise, the disk heads will be seeking all over the
+ * place.  If the base device can not be determined, return NULL.
+ *
+ * The base_device() function returns an allocated string which must
+ * be freed.
+ *
+ * Written by Theodore Ts'o, <[EMAIL PROTECTED]>
+ *
+ * Copyright (C) 2000 Theodore Ts'o.
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Public
+ * License.
+ * %End-Header%
+ */
+#include 
+#if HAVE_UNISTD_H
+#include 
+#endif
+#if HAVE_STDLIB_H
+#include 
+#endif
+#include 
+#include 
+
+#include "fsck.h"
+
+/*
+ * Required for the uber-silly devfs /dev/ide/host1/bus2/target3/lun3
+ * pathames.
+ */
+static const char *devfs_hier[] = {
+   "host", "bus", "target", "lun", 0
+};
+
+char *base_device(const char *device)
+{
+   char *str, *cp;
+   const char **hier, *disk;
+   int len;
+
+   str = malloc(strlen(device)+1);
+   if (!str)
+   return NULL;
+   strcpy(str, device);
+   cp = str;
+
+   /* Skip over /dev/; if it's not present, give up. */
+   if (strncmp(cp, "/dev/", 5) != 0)
+   goto errout;
+   cp += 5;
+
+   /* Skip over /dev/dsk/... */
+   if (strncmp(cp, "dsk/", 4) == 0)
+   cp += 4;
+
+   /*
+* For md devices, we treat them all as if they were all
+* on one disk, since we don't know how to parallelize them.
+*/
+   if (cp[0] == 'm' && cp[1] == 'd') {
+   *(cp+2) = 0;
+   return str;
+   }
+
+   /* Handle DAC 960 devices */
+   if (strncmp(cp, "rd/", 3) == 0) {
+   cp += 3;
+   if (cp[0] != 'c' || cp[2] != 'd' ||
+   !isdigit(cp[1]) || !isdigit(cp[3]))
+   goto errout;
+   *(cp+4) = 0;
+   return str;
+   }
+
+   /* Now let's handle /dev/hd* and /dev/sd* devices */
+   if ((cp[0] == 'h' || cp[0] == 's') && (cp[1] == 'd')) {
+   cp += 2;
+   /* If there's a single number after /dev/hd, skip it */
+   if (isdigit(*cp))
+   cp++;
+   /* What follows must be an alpha char, or give up */
+   if (!isalpha(*cp))
+   goto errout;
+   *(cp + 1) = 0;
+   return str;
+   }
+
+   /* Now let's handle devfs (ugh) names */
+   len = 0;
+   if (strncmp(cp, "ide/", 4) == 0)
+   len = 4;
+   if (strncmp(cp, "scsi/", 5) == 0)
+   len = 5;
+   if (len) {
+   cp += len;
+   /*
+* Now we proceed down the expect

Re: [PATCH] JBD slab cleanups

2007-09-18 Thread Christoph Hellwig
On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> Here is the incremental small cleanup patch. 
> 
> Remove kamlloc usages in jbd/jbd2 and consistently use 
> jbd_kmalloc/jbd2_malloc.

Shouldn't we kill jbd_kmalloc instead?

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-17 Thread Christoph Hellwig
On Mon, Sep 17, 2007 at 12:29:51PM -0700, Mingming Cao wrote:
> The problem with this patch, as Andreas Dilger pointed today in ext4
> interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> 1/3-1/2 page space. 
> 
> What was the originally intention to set up slabs for committed_data(and
> frozen_buffer) in JBD? Why not using kmalloc?

kmalloc is using slabs :)

The intent was to avoid the wasted memory, but as we've repeated a gazillion
times wasted memory on a rather rare codepath doesn't really matter when
you just crash random storage drivers otherwise.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/2] JBD: slab management support for large block(>8k)

2007-09-03 Thread Christoph Hellwig
On Mon, Sep 03, 2007 at 12:31:49PM -0700, Christoph Lameter wrote:
> So you'd be fine with replacing the allocs with
> 
> get_free_pages(GFP_xxx, get_order(size)) ?

Yes.  And rip out all that code related to setting up the slabs.  I plan
to add WARN_ONs to bio_add_page and friends to detect further usage of
slab pages if there is any.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/2] JBD: slab management support for large block(>8k)

2007-09-03 Thread Christoph Hellwig
On Mon, Sep 03, 2007 at 12:55:04AM -0700, Christoph Lameter wrote:
> On Sun, 2 Sep 2007, Christoph Hellwig wrote:
> 
> > > We are doing what you describe right now. So the current code is broken?
> > Yes.
> 
> How about getting rid of the slabs there and use kmalloc? Kmalloc in mm 
> (and therfore hopefully 2.6.24) will convert kmallocs > PAGE_SIZE to page 
> allocator calls. Not sure what to do about the 1k and 2k requests though.

The problem is that we must never use kmalloc pages, so we always need
to request a page or more for these.  Better to use get_free_page directly,
that's how I fixed it in XFS a while ago.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/2] JBD: slab management support for large block(>8k)

2007-09-02 Thread Christoph Hellwig
On Sun, Sep 02, 2007 at 04:40:21AM -0700, Christoph Lameter wrote:
> On Sat, 1 Sep 2007, Christoph Hellwig wrote:
> 
> > On Fri, Aug 31, 2007 at 05:12:18PM -0700, Mingming Cao wrote:
> > > >From clameter:
> > > Teach jbd/jbd2 slab management to support >8k block size. Without this, 
> > > it refused to mount on >8k ext3.
> > 
> > 
> > But the real fix is to kill this code.  We can't send down slab pages
> > down the block layer without breaking iscsi or aoe.  And this code is
> > only used in so rare cases that all the normal testing won't hit it.
> > Very bad combination.
> 
> We are doing what you describe right now. So the current code is broken?

Yes.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/2] JBD: slab management support for large block(>8k)

2007-09-01 Thread Christoph Hellwig
On Fri, Aug 31, 2007 at 05:12:18PM -0700, Mingming Cao wrote:
> >From clameter:
> Teach jbd/jbd2 slab management to support >8k block size. Without this, it 
> refused to mount on >8k ext3.


But the real fix is to kill this code.  We can't send down slab pages
down the block layer without breaking iscsi or aoe.  And this code is
only used in so rare cases that all the normal testing won't hit it.
Very bad combination.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ZFS, XFS, and EXT4 compared

2007-08-30 Thread Christoph Hellwig
On Thu, Aug 30, 2007 at 05:07:46PM +1000, Nathan Scott wrote:
> To improve metadata performance, you have many options with XFS (which
> ones are useful depends on the type of metadata workload) - you can try
> a v2 format log, and mount with "-o logbsize=256k", try increasing the
> directory block size (e.g. mkfs.xfs -nsize=16k, etc), and also the log
> size (mkfs.xfs -lsize=XXb).

Okay, these suggestions are one too often now.  v2 log and large logs/log
buffers are the almost universal suggestions, and we really need to make
these defaults.  XFS is already the laughing stock of the Linux community
due to it's absurdely bad default settings.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/25] move handling of setuid/gid bits from VFS into individual setattr functions (RESEND)

2007-08-10 Thread Christoph Hellwig
On Fri, Aug 10, 2007 at 04:47:52PM -0400, Jeff Layton wrote:
> attr->ia_valid after the setattr operation returns. If either ATTR_KILL_*
> bit is set then BUG(). The helper function already clears those bits
> so anything using it should automatically be ok. We'd have to fix
> up NFS and a few others that don't implement suid/sgid.
> 
> This is not as certain as changing the name of the inode operation. It
> would only pop when someone is attempting to change a setuid/setgid
> file on these filesystems. Still, it should conceivably catch most if
> not all offenders. Would that be sufficient to take care of everyone's
> concerns?

I like the idea of checking ia_valid after return a lot.  But instead of
going BUG() it should just do the default action, that we can avoid
touching all the filesystem and only need to change those that need
special care.  I also have plans to add some new AT_ flags for implementing
some filesystem ioctl in generic code that would benefit greatly from
the ia_valid checkin after return to return ENOTTY fr filesystems not
implementing those ioctls.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/25] VFS: move attr_kill logic from notify_change into helper function

2007-08-07 Thread Christoph Hellwig
> +void attr_kill_to_mode(struct inode *inode, struct iattr *attr)

This function badly needs a kerneldoc description.  Also I can't say
I like the name a lot, but without a clearly better idea I should
probably not complain :)

We should at least add a generic_ prefix to indicate it's a generic
helper valid for most filesystem (and the kerneldoc comment can explain
the details)

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/25] move handling of setuid/gid bits from VFS into individual setattr functions (RESEND)

2007-08-07 Thread Christoph Hellwig
First thanks a lot for doing this work, it's been long needed.

Second please don't send out that many patches.  We encourage people
to split things into small patches when the changes are logially
separated.  Which these are not - it's a flag day change (which btw
is fine despite the rants soe people spewed in reply to this), so it
should be one single patch. (Or one for all mainline filesystems +
one per fs only in -mm to make Andrew's life a little easier if you
really care.)
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-29 Thread Christoph Hellwig
On Sun, Jul 29, 2007 at 11:30:36AM -0600, Andreas Dilger wrote:
> Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
> was rejected because "that functionality should go into the VFS".
> Since the performance improvement of delalloc is quite large, we'd
> like to get this into the kernel one way or another.  Can we make a
> decision if the ext4-specific delalloc is acceptable?

I'm a big proponent of having proper common delalloc code, but the
one proposed here is not generic for the existing filesystem using
delalloc.  It's still on my todo list to revamp the xfs code to get
rid of some of the existing mess and make it useable genericly.  If
the ext4 users are fine with the end result we could move to generic
code.

Note that moving to VFS is bullshit either way, writeback code is
nowhere near the VFS nor should it.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-29 Thread Christoph Hellwig
On Sun, Jul 29, 2007 at 09:48:10PM +0400, Alex Tomas wrote:
> I think the latter one is better because it supports bs < pagesize
> (though I'm not sure about data=ordered yet). I'm not against putting
> most of the patch into fs/ext4/, but at least few bits to be changed
> in fs/ - exports in  fs/mpage.c and one "if" in __block_write_full_page().

The changes to __block_write_full_page is obviously fine, and exporting
mpage.c bits sounds fine to me aswell, although I'd like to take a look
at the final patch.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-28 Thread Christoph Hellwig
On Fri, Jul 27, 2007 at 04:38:44PM +0400, Alex Tomas wrote:
> I just realized that you're talking about data=ordered mode in ext4,
> where care is taken to prevent on-disk references to no-yet-written
> blocks.

Any reference to non-written blocks is a bug.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-28 Thread Christoph Hellwig
On Fri, Jul 27, 2007 at 11:51:56AM +0400, Alex Tomas wrote:
> >Secondly, apart from delalloc, XFS cannot use the generic code paths
> >for writeback because unwritten extent conversion also requires
> >custom I/O completion handlers. Given that __mpage_writepage() only
> >calls ->writepage when it is confused, XFS simply cannot use this
> >API.
> 
> this doesn't mean fs/mpage.c should go, right?

mpage.c read side is fine for every block based filesystem I know.
mpage.c write side is fine for every simple (non-delalloc, non-unwritten
extent, etc) filesystem.  So it surely shouldn't go.

> I didn't say "generic", see Subject: :)

then it shouldn't be in generic code.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-28 Thread Christoph Hellwig
On Fri, Jul 27, 2007 at 03:07:14PM +1000, David Chinner wrote:
> > It duplicates fs/mpage.c in bio building and introduces new generic API
> > (iomap, map_blocks_t, etc).
> 
> Using a new API for new functionality is a bad thing?

Depends on wht you do.  This patch is just a quickhack to shoe-horn
delalloc support into ext4.  Introducing a new abstraction is overkill.
If we really want an overhaul of the writeback path that's extent-aware,
and efficient for delalloc and unwritten extents introducing a proper
iomap-like data structure would make sense.  That beeing said I personally
hate the ubffer_head abuse for bmap data that we have in various places
as it's utterly confusing and wasting stack space, but that's a different
discussion.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-28 Thread Christoph Hellwig
On Thu, Jul 26, 2007 at 06:32:56AM -0400, Jeff Garzik wrote:
> Is this based on Christoph's work?
> 
> Christoph, or some other XFS hacker, already did generic delalloc, 
> modeled on the XFS delalloc code.

This is not based on my attempt to make the xfs writeout path generic.
Alex's variant is a lot simpler and thus missed various bits required
for high sustained writeout performance or xfs functionality.

That doesn't mean I want to arge against Alex's code although I'd of
course be more happy if we could actually shared code between multiple
filesystems.

Of ourse the code in it's current form should not go into mpage.c but
rather into ext4 so that it doesn't bloat the kernel for everyone.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-13 Thread Christoph Hellwig
On Fri, Jul 13, 2007 at 07:48:58PM +0530, Amit K. Arora wrote:
> Ok. Since we have only one flag (FALLOC_FL_KEEP_SIZE) and we do not want
> to declare the default mode (FALLOC_ALLOCATE), we can _just_ have this
> flag and remove the other mode too (FALLOC_RESV_SPACE).
> Is this what you are suggesting ?

Yes.

> Should we need a header file just to declare one flag - i.e.
> FALLOC_FL_KEEP_SIZE (since now there is no point of declaring the two
> modes) ? If "linux/fs.h" is not a good place, will "asm-generic/fcntl.h"
> be a sane place for this flag ?

It might sound a litte silly but is the cleanest thing we could do by
far.  And I suspect there will be more more flags soon..

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6][TAKE7] revalidate write permissions for fallocate

2007-07-13 Thread Christoph Hellwig
On Fri, Jul 13, 2007 at 06:18:47PM +0530, Amit K. Arora wrote:
> From: David P. Quigley <[EMAIL PROTECTED]>
> 
> Revalidate the write permissions for fallocate(2), in case security policy has
> changed since the files were opened.
> 
> Acked-by: James Morris <[EMAIL PROTECTED]>
> Signed-off-by: David P. Quigley <[EMAIL PROTECTED]>

This should be merged into the main falloc patch.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-13 Thread Christoph Hellwig
On Fri, Jul 13, 2007 at 06:17:55PM +0530, Amit K. Arora wrote:
>  /*
> + * sys_fallocate - preallocate blocks or free preallocated blocks
> + * @fd: the file descriptor
> + * @mode: mode specifies the behavior of allocation.
> + * @offset: The offset within file, from where allocation is being
> + *   requested. It should not have a negative value.
> + * @len: The amount of space in bytes to be allocated, from the offset.
> + *This can not be zero or a negative value.

kerneldoc comments are for in-kernel APIs which syscalls aren't.  I'd say
just temove this comment, the manpage is a much better documentation anyway.

> + *  Generic fallocate to be added for file systems that do not
> + *support fallocate.

Please remove the comment, adding a generic fallback in kernelspace is a
very dumb idea as we already discussed long time ago.

> --- linux-2.6.22.orig/include/linux/fs.h
> +++ linux-2.6.22/include/linux/fs.h
> @@ -266,6 +266,21 @@ extern int dir_notify_enable;
>  #define SYNC_FILE_RANGE_WRITE2
>  #define SYNC_FILE_RANGE_WAIT_AFTER   4
>  
> +/*
> + * sys_fallocate modes
> + * Currently sys_fallocate supports two modes:
> + * FALLOC_ALLOCATE : This is the preallocate mode, using which an application
> + *   may request reservation of space for a particular file.
> + *   The file size will be changed if the allocation is
> + *   beyond EOF.
> + * FALLOC_RESV_SPACE :   This is same as the above mode, with only one 
> difference
> + *   that the file size will not be modified.
> + */
> +#define FALLOC_FL_KEEP_SIZE0x01 /* default is extend/shrink size */
> +
> +#define FALLOC_ALLOCATE0
> +#define FALLOC_RESV_SPACE  FALLOC_FL_KEEP_SIZE

Just remove FALLOC_ALLOCATE, 0 flags should be the default.  I'm also
not sure there is any point in having two namespace now that we have a flags-
based ABI.

Also please don't add this to fs.h.  fs.h is a complete mess and the
falloc flags are a new user ABI.  Add a linux/falloc.h instead which can
be added to headers-y so the ABI constant can be exported to userspace.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates

2007-07-11 Thread Christoph Hellwig
On Wed, Jul 11, 2007 at 05:57:17AM -0600, Andreas Dilger wrote:
> Ah, this is the patch to disable i_version updates for Lustre.  I don't
> think any normal user would use this mount option, so I don't know if
> there is a need to document it.

This is a reason to not merge it at all.  If the only user of this is
the out of tree lustre code there is no need to put this in.  I should
rather stay in clusterfs' patchkit.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-11 Thread Christoph Hellwig
On Wed, Jul 11, 2007 at 05:52:24AM -0600, Andreas Dilger wrote:
> On Jul 11, 2007  09:47 +0100, Christoph Hellwig wrote:
> > On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote:
> > > This patch is on top of i_version_update_vfs.
> > > The i_version field of the inode is set on inode creation and incremented
> > > when the inode is being modified.
> > 
> > Which is not what i_version is supposed to do.  It'll get you tons of misses
> > for NFSv3 filehandles that rely on the generation staying the same for the
> > same file.  Please add a new field for the NFSv4 sequence counter instead
> > of making i_version unuseable.
> 
> You are confusing i_generation (the instance of this inode number) with
> i_version (whether this file has been modified)?

Yes, sorry.  Objection dropped.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-11 Thread Christoph Hellwig
On Wed, Jul 04, 2007 at 03:37:01PM +1000, Timothy Shimmin wrote:
> We use this capability in XFS at the moment.
> I think this is mainly for DMF (HSM) but is done via the xfs handle 
> interface
> (xfs_open_by_handle) AFAICT.
> 

You're not :)  You're using an O_INVIBLE equivalent (as described below),
which would be a useful thing to have at the VFS level, but adding hacks
to some system calls only wouldn't help any HSM system.  It's just useless
API clutter.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-11 Thread Christoph Hellwig
On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote:
> Given the current behaviour for posix_fallocate() in glibc, I think
> that retaining the same error semantic and punting the cleanup to
> userspace (where the app will fail with ENOSPC anyway) is the only
> sane thing we can do here. Trying to undo this in the kernel leads
> to lots of extra rarely used code in error handling paths...

Agreed, looks like we should stay with the user has to clean up behaviour.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-11 Thread Christoph Hellwig
On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
> Well, if you see the modes proposed using above flags :
> 
> #define FA_ALLOCATE   0
> #define FA_DEALLOCATE FA_FL_DEALLOC
> #define FA_RESV_SPACE FA_FL_KEEP_SIZE
> #define FA_UNRESV_SPACE   (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
> FA_FL_DEL_DATA)
> 
> FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
> for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
> flag. Hence prealloction will never delete data.
> This mode is required only for FA_UNRESV_SPACE, which is a deallocation
> mode, to support any existing XFS aware applications/usage-scenarios.

Sorry, but this doesn't make any sense.  There is no need to put every
feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
be supported forever anyway - as I suggested before they really should
be moved to generic code.

What needs to be supported is what makes sense as an interface.
A punch a hole interface does make sense, but trying to hack this into
a preallocation system call is just madness.  We're not IRIX or windows
that fit things into random subcall just because there was some space
left to squeeze them in.

> > > > FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on size, data 
> > > > change) */
> > > > FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on size, data 
> > > > change) */
> > 
> > NACK to these aswell.  If i_size changes c/mtime need updates, if the size
> > doesn't chamge they don't.  No need to add more flags for this.
> 
> This requirement was from the point of view of HSM applications. Hope
> you saw Andreas previous post and are keeping that in mind.

HSMs needs this basically for every system call, which screams for an
open flag like O_INVISIBLE anyway.  Adding this in a generic way is
a good idea, but hacking bits and pieces that won't fit into the global
design is completely wrong.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-11 Thread Christoph Hellwig
On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote:
> This patch is on top of i_version_update_vfs.
> The i_version field of the inode is set on inode creation and incremented when
> the inode is being modified.

Which is not what i_version is supposed to do.  It'll get you tons of misses
for NFSv3 filehandles that rely on the generation staying the same for the
same file.  Please add a new field for the NFSv4 sequence counter instead
of making i_version unuseable.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-03 Thread Christoph Hellwig
On Tue, Jul 03, 2007 at 03:38:48PM +0530, Amit K. Arora wrote:
> > FA_FL_DEALLOC   0x01 /* deallocate unwritten extent (default 
> > allocate) */
> > FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change 
> > size) */
> > FA_FL_DEL_DATA  0x04 /* delete existing data in alloc range (default 
> > keep) */
> 
> We now have two sets of flags - 
> 1) the above three with which I think no one has any issues with, and

Yes, I do.  FA_FL_DEL_DATA is plain stupid, a preallocation call should
never delete data.  FA_FL_DEALLOC should probably be a separate syscall
because it's very different functionality.

While we're at it I also dislike the FA_ prefix becuase it doesn't say
anything and is far too generic.  FALLOC_ is much better.

> > FA_FL_ERR_FREE  0x08 /* free preallocation on error (default keep 
> > prealloc) */

NACK on this one.  We should have just one behaviour, and from the thread
that not freeing the allocation on error.

> > FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on size, data 
> > change) */
> > FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on size, data 
> > change) */

NACK to these aswell.  If i_size changes c/mtime need updates, if the size
doesn't chamge they don't.  No need to add more flags for this.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-30 Thread Christoph Hellwig
On Wed, Jun 27, 2007 at 11:36:57PM +1000, David Chinner wrote:
> > This
> > would seem to be the only impediment from using fallocated files
> > for swap files.  Maybe if FIEMAP was used by mkswap to get an
> > "UNWRITTEN" flag back instead of "HOLE" it wouldn't be a problem.
> 
> Probably. If we taught do_mpage_readpage() about unwritten mappings,
> then would could map them on read if and then sys_swapon can remain
> blissfully unaware of unwritten extents.

Except for reading the swap header in the first page sys_swapon will
never end up in  do_mpage_readpage.  It rather uses ->bmap to build
it's own extent list and issues bios directly.

Now this is everything but nice and we should rather refactor the direct
I/O code to work on kernel pages without looking at their fields so this
can be done properly.  Alternatively ->bmap would grow a BMAP_SWAP flag
so the filesystem could do the right thing.

But despite not beeing useful for swap the patch below looks very nice
to me.  doing things correctly in core code is always better than hacking
around it in the filesystem, especially as XFS won't stay the only filesystem
using unwritten extents.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-30 Thread Christoph Hellwig
On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
> > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > error) is hit?  Does it keep the current fallocate() or does it free it?
> 
> Currently it is left on the file system implementation. In ext4, we do
> not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> end up with partial (pre)allocation. This is inline with dd and
> posix_fallocate, which also do not free the partially allocated space.

I can't find anything in the specification of posix_fallocate
(http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
that tells what should happen to allocate blocks on error.

But common sense would be to not leak disk space on failure of this
syscall, and this definitively should not be left up to the filesystem,
either we always leak it or always free it, and I'd strongly favour
the latter variant.

> > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> > don't want to expose uninitialized disk blocks to userspace.  I'm not
> > sure if this makes sense at all.
> 
> I don't think we need to make it default - atleast for filesystems which
> have a mechanism to distinguish preallocated blocks from "regular" ones.
> In ext4, for example, we will have a way to mark uninitialized extents.
> All the preallocated blocks will be part of these uninitialized extents.
> And any read on these extents will treat them as a hole, returning
> zeroes to user land. Thus any existing data on uninitialized blocks will
> not be exposed to the userspace.

This is the xfs unwritten extent behaviour.  But anyway, the important bit
is uninitialized blocks should never ever leak to userspace, so there is
not need for the flag.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-30 Thread Christoph Hellwig
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
> I suppose it might be a bit late in the game to add a "goal"
> parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
> the API more suitable for XFS?  The goal could be a single __u64, or
> a struct with e.g. __u64 byte offset (possibly also __u32 lun like
> in FIEMAP).  I guess the one potential limitation here is the
> number of function parameters on some architectures.

This isn't really about "more suitable for XFS" but more about more
suitable for sophisticated layout decisions.

But I'm still not confident this should be shohorned into this
syscall.  In fact I'm already rather unhappy about the feature churn in
the current patch series.

The more I think about it the more I'd prefer we would just put a simple
syscall in that implements nothing but the posix_fallocate(3) semantics
as defined in SuS, and then go on to brainstorm about advanced
preallocation / layout hint semantics.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-29 Thread Christoph Hellwig
On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote:
> I think Mingming was asking that Ted move the current quilt tree into git,
> presumably because she's working off git.
> 
> I'm not sure what to do, really.  The core kernel patches need to be in
> Ted's tree for testing but that'll create a mess for me.

Could we please stop this stupid ext4-centrism?  XFS is ready so we can
put in the syscalls backed by XFS.  We have already done this with the xattr
syscalls in 2.4, btw.

Then again I don't think we should put it in quite yet, because this thread
has degraded into creeping featurism, please give me some more time to
preparate a semi-coheret rants about this..

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iov_iter_fault_in_readable fix

2007-06-14 Thread Christoph Hellwig
On Fri, Jun 15, 2007 at 08:21:09AM +1000, David Chinner wrote:
> Yeah, it can run a subset of the tests on NFS and UDF filesystems as well and
> there are some specific UDF-only tests in it too.  I think the NFS test group
> is mostly generic tests that don't use or test specific XFS features.

Actually most testcases can run on any reasonable posixish filesystem, we
just need some glue to tell the testsuite it's actually okay.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iov_iter_fault_in_readable fix

2007-06-14 Thread Christoph Hellwig
On Wed, Jun 13, 2007 at 05:57:59PM +0400, Dmitriy Monakhov wrote:
>   Function prerform check for signgle region, with out respect to
>   segment nature of iovec, For example writev no longer works :)

Btw, could someone please start to collect all sniplets like this in
a nice simple regression test suite?  If no one wants to start a new
one we should probably just put it into xfsqa (which should be useable
for other filesystems aswell despite the name)

> 
>   /* TESTCASE BEGIN */
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #define SIZE  (4096 * 2)
>   int main(int argc, char* argv[])
>   {   
>   char* ptr[4];
>   struct iovec iov[2];
>   int fd, ret;
>   ptr[0] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
>   MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
>   ptr[1] = mmap(NULL, SIZE, PROT_NONE,
>   MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
>   ptr[2] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
>   MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
>   ptr[3] = mmap(NULL, SIZE, PROT_NONE, 
>   MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 
>   iov[0].iov_base = ptr[0] + (SIZE -1);
>   iov[0].iov_len = 1;
>   memset(ptr[0], 1, SIZE);
> 
>   iov[1].iov_base = ptr[2];
>   iov[1].iov_len = SIZE;
>   memset(ptr[2], 2, SIZE);
> 
>   fd = open(argv[1], O_CREAT|O_RDWR|O_TRUNC, 0666);
>   ret = writev(fd, iov, sizeof(iov) / sizeof(struct iovec));
>   return 0;
>   }   
>   /* TESTCASE END*/
>   We will get folowing result:
>   writev(3, [{"\1", 1}, {"\2"..., 8192}], 2) = -1 EFAULT (Bad 
> address)
>   
>   this is hidden bug, and it was invisiable because _fault_in_readable
>   return value was ignored before. Lets iov_iter_fault_in_readable
>   perform checks for all segments.
> 
> Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fef19fc..7e025ea 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -433,7 +433,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
>  size_t iov_iter_copy_from_user(struct page *page,
>   struct iov_iter *i, unsigned long offset, size_t bytes);
>  void iov_iter_advance(struct iov_iter *i, size_t bytes);
> -int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
> +int iov_iter_fault_in_readable(struct iov_iter *i, size_t *bytes);
>  size_t iov_iter_single_seg_count(struct iov_iter *i);
>  
>  static inline void iov_iter_init(struct iov_iter *i,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8d59ed9..8600c3e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1817,10 +1817,32 @@ void iov_iter_advance(struct iov_iter *i, size_t 
> bytes)
>  }
>  EXPORT_SYMBOL(iov_iter_advance);
>  
> -int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
> +int iov_iter_fault_in_readable(struct iov_iter *i, size_t* bytes)
>  {
> - char __user *buf = i->iov->iov_base + i->iov_offset;
> - return fault_in_pages_readable(buf, bytes);
> + size_t len = *bytes;
> + int ret;
> + if (likely(i->nr_segs == 1)) {
> + ret = fault_in_pages_readable(i->iov->iov_base, len);
> + if (ret)
> + *bytes = 0;
> + } else {
> + const struct iovec *iov = i->iov;
> + size_t base = i->iov_offset;
> + *bytes = 0;
> + while (len) {
> + int copy = min(len, iov->iov_len - base);
> + if ((ret = fault_in_pages_readable(iov->iov_base + 
> base, copy)))
> + break;
> + *bytes += copy;
> + len -= copy;
> + base += copy;
> + if (iov->iov_len == base) {
> + iov++;
> + base = 0;
> + }
> + }
> + }
> + return ret; 
>  }
>  EXPORT_SYMBOL(iov_iter_fault_in_readable);
>  
> @@ -2110,7 +2132,7 @@ static ssize_t generic_perform_write_2copy(struct file 
> *file,
>* to check that the address is actually valid, when atomic
>* usercopies are used, below.
>*/
> - if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
> + if (unlikely(iov_iter_fault_in_readable(i, &bytes) && !bytes)) {
>   status = -EFAULT;
>   break;
>   }
> @@ -2284,7 +2306,7 @@ again:
>* to check that the address is actually valid, when atomic
>* usercopies are used, below.
>*/
> - if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
> + if (unlikely(i

Re: [RFC PATCH ext3/ext4] orphan list corruption due bad inode

2007-06-04 Thread Christoph Hellwig
On Tue, Jun 05, 2007 at 10:11:12AM +0400, Vasily Averin wrote:
> >>return d_splice_alias(inode, dentry);
> >>  }
> > Seems reasonable.  So this prevents the bad inodes from getting onto the
> > orphan list in the first place?
> 
> make_bad_inode() is called from ext3_read_inode() that is called from iget() 
> only.

Which is artefact of using the read_inode interface.  Please switch from
iget to iget_locked and you can handle this case without ever inserting the
"bad" inode into the inode hash.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Check for error returned by kthread_create on creating journal thread

2007-04-16 Thread Christoph Hellwig
On Mon, Apr 16, 2007 at 03:10:42PM +0400, Pavel Emelianov wrote:
> Christoph Hellwig wrote:
> > On Mon, Apr 16, 2007 at 11:41:14AM +0400, Pavel Emelianov wrote:
> >> If the thread failed to create the subsequent wait_event
> >> will hang forever.
> >>
> >> This is likely to happen if kernel hits max_threads limit.
> >>
> >> Will be critical for virtualization systems that limit the
> >> number of tasks and kernel memory usage within the container.
> > 
> >> --- ./fs/jbd/journal.c.jbdthreads  2007-04-16 11:17:36.0 +0400
> >> +++ ./fs/jbd/journal.c 2007-04-16 11:30:09.0 +0400
> >> @@ -211,10 +211,16 @@ end_loop:
> >>return 0;
> >>  }
> >>  
> >> -static void journal_start_thread(journal_t *journal)
> >> +static int journal_start_thread(journal_t *journal)
> >>  {
> >> -  kthread_run(kjournald, journal, "kjournald");
> >> +  struct task_struct *t;
> >> +
> >> +  t = kthread_run(kjournald, journal, "kjournald");
> >> +  if (IS_ERR(t))
> >> +  return PTR_ERR(t);
> >> +
> >>wait_event(journal->j_wait_done_commit, journal->j_task != 0);
> > 
> > Note that this wait_event should exist at all, and the return
> 
> Should NOT you mean?

Umm, yes - of course :)

> > value of kthread_run should be assigned to journal->j_task.  Also
> > the code doesn't use the kthread primitives in other places leading
> > to crufty code.
> 
> Well, this could be done with a separate patch, I think.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Check for error returned by kthread_create on creating journal thread

2007-04-16 Thread Christoph Hellwig
On Mon, Apr 16, 2007 at 11:41:14AM +0400, Pavel Emelianov wrote:
> If the thread failed to create the subsequent wait_event
> will hang forever.
> 
> This is likely to happen if kernel hits max_threads limit.
> 
> Will be critical for virtualization systems that limit the
> number of tasks and kernel memory usage within the container.

> --- ./fs/jbd/journal.c.jbdthreads 2007-04-16 11:17:36.0 +0400
> +++ ./fs/jbd/journal.c2007-04-16 11:30:09.0 +0400
> @@ -211,10 +211,16 @@ end_loop:
>   return 0;
>  }
>  
> -static void journal_start_thread(journal_t *journal)
> +static int journal_start_thread(journal_t *journal)
>  {
> - kthread_run(kjournald, journal, "kjournald");
> + struct task_struct *t;
> +
> + t = kthread_run(kjournald, journal, "kjournald");
> + if (IS_ERR(t))
> + return PTR_ERR(t);
> +
>   wait_event(journal->j_wait_done_commit, journal->j_task != 0);

Note that this wait_event should exist at all, and the return
value of kthread_run should be assigned to journal->j_task.  Also
the code doesn't use the kthread primitives in other places leading
to crufty code.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-13 Thread Christoph Hellwig
On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
> struct fibmap_extent {
>   __u64 fe_start; /* starting offset in bytes */
>   __u64 fe_len;   /* length in bytes */
> }
> 
> struct fibmap {
>   struct fibmap_extent fm_start;  /* offset, length of desired mapping */
>   __u32 fm_extent_count;  /* number of extents in array */
>   __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
>   __u64 unused;
>   struct fibmap_extent fm_extents[0];
> }
> 
> #define FIEMAP_LEN_MASK   0xff
> #define FIEMAP_LEN_HOLE   0x01
> #define FIEMAP_LEN_UNWRITTEN  0x02
> 
> All offsets are in bytes to allow cases where filesystems are not going
> block-aligned/sized allocations (e.g. tail packing).  The fm_extents array
> returned contains the packed list of allocation extents for the file,
> including entries for holes (which have fe_start == 0, and a flag).

> One feature that XFS_IOC_GETBMAPX has that may be desirable is the
> ability to return unwritten extent information.  In order to do this XFS
> required expanding the per-extent struct from 32 to 48 bytes per extent,
> but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship)
> and keep 8 bytes or so for input/output flags per extent (would need to
> be masked before use).

I'd be much happier to have the separate per-extent flags value.
For one thing this allows much nicer representations of unwritten
extents or holes without taking away bits from the len value.  It also
allows to make interesting use of this in the future, e.g. telling
about an offline exttent for use in HSM applications.  Also for
this kernel<->user interface the wasted space shouldn't matter too
much - if you want to pass the above condensed structure over the
wire in lustre that shouldn't a problem, you'd have to convert
to an endian-neutral on the wire format anyway.  Not doing the
masking also make the interface quite a bit simpler to use.

One addition freature from the XFS getbmapx interface we should
provide is the ability to query layout of xattrs.  While other
filesystems might not have the exact xattr fork XFS has it fits
nicely into the interface.  Especially when we have Anton's suggested
flag for inline data.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-06 Thread Christoph Hellwig
On Tue, Mar 06, 2007 at 06:36:09AM -0800, Ulrich Drepper wrote:
> Christoph Hellwig wrote:
> > fallocate with the whence argument and flags is already quite complicated,
> > I'd rather have another call for placement decisions, that would
> > be called on an fd to do placement decissions for any further allocations
> > (prealloc, write, etc)
> 
> Yes, posix_fallocate shouldn't be made more complicated.  But I don't
> understand why requesting linear layout of the blocks should be an
> option.  It's always an advantage if the blocks requested this way are
> linear on disk.  So, the kernel should always do its best to make this
> happen, without needing an additional option.

There are HPC workloads where you have multi writers on multiple machines
that write to different parts of a file.  You preferably want each
of those regions in separate allocation groups.  (Or tell the customers
to use separate files for the regions..)
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Christoph Hellwig
On Mon, Mar 05, 2007 at 12:02:59PM -0800, Mingming Cao wrote:
> Yep, I think it makes sense to use preallocation for defragmentation.
> After all both preallocation and defragmentation shall call underlying 
> filesystem multiple block allocator to try to allocate a chunk of 
> contiguous blocks on disk. ext4 online defrag implementation by Takashi 
> already support to choose a "goal" allocation block to guide the ext4 
> block allocator to place the defraged file is a specific location.
> 
> Passing a little bit more hint to sys_fallocate() (i.e, goal block, 
> and/or whether the goal block is important over the size of prealloc 
> extent), might make it more useful for the orginial goal (get contigous 
> and guranteed blocks) and for defragmentation.

fallocate with the whence argument and flags is already quite complicated,
I'd rather have another call for placement decisions, that would
be called on an fd to do placement decissions for any further allocations
(prealloc, write, etc)
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Christoph Hellwig
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote:
> Theodore Tso wrote:
> > Given that glibc already has to support this for older kernels, I
> > would argue that there's no point putting in generic support for
> > filesystem that can't support a more advanced way of doing things.
> 
> Well, I'm sure the kernel can do better than the code we have in libc
> now.  The kernel has access to the bitmasks which say which blocks have
> already been allocated.

The layer of the kernel where a totally generic fallback would be
implemented does not have access to this information.  We could do
a mostly generic helper for block filesystems that allows to implement
fallocate this way without a lot of their own code.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Christoph Hellwig
On Sat, Mar 03, 2007 at 11:45:32PM +0100, Arnd Bergmann wrote:
> > I'd be more happy to have the write out zeroes loop in glibc. ?And
> > glibc needs to have it anyway, for older kernels.
> 
> A generic_fallocate makes sense to me iff we can do it in the kernel
> more significantly more efficiently than in glibc, e.g. by using only
> a single page in page cache instead of one for each page to be preallocated.

We can't do that with the current page cache interfaces.  But what
might make sense is to have a block_dump_prealloc that takes a get_block
callback to do what you propose.  It still wouldn't be entirely generic,
but would allow block based filesystems to do a not entirely dumb
implementation.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Christoph Hellwig
On Sun, Mar 04, 2007 at 08:11:17PM +, Anton Altaparmakov wrote:
> glibc cannot ever be smart enough because a file system driver will  
> always know better and be able to do things in a much more optimized  
> way.

Please read the thread again.  That is not what anyone proposed.
The issues we're discussing is whether fallback for a filesystem that
does not support preallocation natively should be done in kernelspace
or in userspace.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Christoph Hellwig
On Thu, Mar 01, 2007 at 05:29:15PM -0600, Eric Sandeen wrote:
> Amit K. Arora wrote:
> 
> Might want more error checking in there, something like (rough cut)...
> (or is some of this glibc's job?)

Yeah, we need to have this checks.  We can't rely on userspace not
passing arguments that might corrupt your filesystem or let you
escalate privilegues.

> which would keep things in line with posix_fallocate's specified errors, 
> too?

Yes, very good idea.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Christoph Hellwig
On Thu, Mar 01, 2007 at 10:44:16PM +, Dave Kleikamp wrote:
> Would EINVAL (or whatever) make it back to the caller of
> posix_fallocate(), or would glibc fall back to its current
> implementation?
> 
> Forgive me if I haven't put enough thought into it, but would it be
> useful to create a generic_fallocate() that writes zeroed pages for any
> non-existent pages in the range?  I don't know how glibc currently
> implements posix_fallocate(), but maybe the kernel could do it more
> efficiently, even in generic code.  Maybe we don't care, since the major
> file systems can probably do something better in their own code.

I'd be more happy to have the write out zeroes loop in glibc.  And
glibc needs to have it anyway, for older kernels.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Christoph Hellwig
On Fri, Mar 02, 2007 at 12:04:45AM +0530, Amit K. Arora wrote:
> This is to give a heads up on few patches that we will be soon coming up
> with. These patches implement a new system call sys_fallocate() and a
> new inode operation "fallocate", for persistent preallocation. The new
> system call, as Andrew suggested, will look like:
> 
>   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> 
> As we are developing and testing the required patches, we decided to
> post a preliminary patch and get inputs from the community to give it
> a right direction and shape. First, a little description on the feature.

Thanks a lot, this has been long overdue.

Please don't forget to Cc the XFS list to keep developers of the only
Linux filesystem supporting persistant allocations for a long time :)

Various people will beat you up for the above syscall as lots of
architectures really want 64bit arguments aligned in a proper way,
e.g. you at least need a pad after 'int fd'.  Then again I already
have suggestions for filling up that slot with useful information:

 - you really want a whence argument as to lseek, as it makes a lot
   of sense for applications to allocate from the end of the file
   or the current file positions.  The existing XFS ioctl already
   has this, and it's trivial to support this in any preallocation
   implementation I could imagine.
 - we should think about having a flag value for which kind of preallocation
   we want.  XFS currently has two:

ALLOCSP which updates the inode size and physically zeroes blocks
RESVSP which does not update inode size but creates and unwritten
   extent

   the current posix_fallocate semantics are somewhere in the middle, as
   it requires and update to the inode size, but does not specify at
   all what happens if you read from the newly allocated space.
   And yes, as and heads up to developers implementing this feature
   on new filesystems: don't just return new blocks, that's a gapping
   security hole :)

> +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
> +{
> + struct file *file;
> + struct inode *inode;
> + long ret = -EINVAL;
> + file = fget(fd);
> + if (!file)
> + goto out;
> + inode = file->f_path.dentry->d_inode;
> + if (inode->i_op && inode->i_op->fallocate)
> + ret = inode->i_op->fallocate(inode, offset, len);
> + else
> + ret = -ENOTTY;
> + fput(file);
> +out:
> +return ret;
> +}

This should use fget_light, and I'm sure the code could be written
in a slightly more readable:

asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
{
struct file *file = fget(fd);
 ret = -EINVAL;

if (file)
struct inode *inode = file->f_path.dentry->d_inode;
if (inode->i_op && inode->i_op->fallocate)
ret = inode->i_op->fallocate(inode, offset, len);
else
ret = -ENOTTY;
fput(file);
}

return ret;
}

p.s. you reference ext4_fallocate in the patch but don't actually
introduce it, it definitively won't compile as-is :)
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] delayed allocation for ext4

2006-12-23 Thread Christoph Hellwig
On Sat, Dec 23, 2006 at 02:31:23PM +1100, David Chinner wrote:
> >  - ext4-delayed-allocation.patch
> >delayed allocation itself, enabled by "delalloc" mount option.
> >extents support is also required. currently it works only
> >with blocksize=pagesize.
> 
> Ah, that's why you can get away with a page flag - you've ignored
> the partial page delay state problem. Any plans to use the
> existing method in the future so we will be able to use ext4 delalloc
> on machines with a page size larger than 4k?

I think fixing this up for blocksize < pagesize is an absolute requirement
to get things merged.  We don't need more filesystems that are crippled
on half of our platforms.

Note that recording delayed alloc state at a page granularity in addition
to just the buffer heads has a lot of advantages aswell and would help
xfs, too.  But I think it makes a lot more sense to record it as a radix
tree tag to speed up the gang lookups for delalloc conversion.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Directories > 2GB

2006-10-10 Thread Christoph Hellwig
On Mon, Oct 09, 2006 at 09:15:28PM -0500, Steve Lord wrote:
> Hi Dave,
> 
> My recollection is that it used to default to on, it was disabled
> because it needs to map the buffer into a single contiguous chunk
> of kernel memory. This was placing a lot of pressure on the memory
> remapping code, so we made it not default to on as reworking the
> code to deal with non contig memory was looking like a major
> effort.

Exactly.  The code works but tends to go OOM pretty fast at least
when the dir blocksize code is bigger than the page size.  I should
give the code a spin on my ppc box with 64k pages if it works better
there.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html