Re: [PATCH 0/5] fallocate system call

2007-04-30 Thread Chris Wedgwood
On Mon, Apr 30, 2007 at 03:56:32PM +1000, David Chinner wrote:
 On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote:

 IIRC, the argument for FA_ALLOCATE changing file size is that
 posix_fallocate() is supposed to change the file size.

But it's not posix_fallocate; it's something more generic. glibc can
do posix_fallocate using truncate + fallocate.

 Note that the way XFS implements growing the file size after the
 allocation is via a truncate

What's wrong with that?  That seems very reasonable.

 That's would what I did because otherwise you'd use ftruncate64().
 Without documented behaviour or an ext4 implementation, I have to
 ask what it's supposed to do, though ;)

How many *real* users are there for ext4?  Why does 'what ext4 does'
define 'the semantics'?

Surely semantics should be decided either by precedent (if there is an
existing relevant userbase) or sensible thought and some debate?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext2/3 block remapping tool

2007-04-30 Thread Jan Kara
On Fri 27-04-07 12:09:42, Andreas Dilger wrote:
 On Apr 26, 2007  21:29 +0200, Jan Kara wrote:
I've been lately playing with remapping ext2/ext3 blocks (especially how
  much it can give us in terms of speed of things like KDE start). For that
  I've written two simple tools (you can get them from
  ftp.suse.com/pub/people/jack/ext3remapper.tar.gz):
e2block2file to transform (preparsed) output from blktrace into a list
  of accessed files and offsets accessed
e2remapblocks to use output from e2block2file and remap blocks into big
  chunks in the order in which they were accessed.
 
 Does it map the whole file contiguously, or does it interleave blocks of the
 file in the order they are accessed?  I would hope that it maps the whole
 file contiguously, and let readahead work properly to fetch the whole file.
 Also, keeping the file contiguous avoids fragmentation later if that file is
 updated, deleted, etc, and conflicts with allocator/defrag/etc.
  No, it does interleave blocks of different files. Reading the whole file
is exactly what you often don't want. During startup KDE (which was my
benchmark) accesses basically two things: shared libraries and config files / 
icons.
Config files and icons usually fit into a single block so just mapping them
in the right order close together is fine. On the other hand, shared
libraries are large and you usually need just a few blocks scattered all
over them. So here we just remap those few blocks we need...
  I see the downsides of this approach. If the file is rewritten, you
loose the tight packing, but this is not going to happen often. I'm more
seriously concerned about the possibility, that this optimizatiom of
startup time may hurt running performace or more probably performance of
other apps...

(see README in the tools archive for more details)
  
So far the tools (especially e2remapblocks ;) work on unmounted
  filesystem. The ultimate goal is to be able to do similar things for
  mounted filesystems but I wanted to see whether block remapping is worth it
  and what kernel interfaces would be useful for achieving the goal.
 
 I'd prefer that such functionality be integrated with Takashi's online
 defrag tool, since it needs virtually the same functionality.  For that
  Yes, definitely these two have quite similar needs and I'd like to have
just one tool in the end.

 matter, this is also very similar to the block-mapped - extents tool
 from Aneesh.  It doesn't make sense to have so many separate tools for
 users, especially if they start interfering with each other (i.e. defrag
 undoes the remapping done by your tool).
  Agreed.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext2/3 block remapping tool

2007-04-30 Thread Theodore Tso
On Fri, Apr 27, 2007 at 12:09:42PM -0600, Andreas Dilger wrote:
 I'd prefer that such functionality be integrated with Takashi's online
 defrag tool, since it needs virtually the same functionality.  For that
 matter, this is also very similar to the block-mapped - extents tool
 from Aneesh.  It doesn't make sense to have so many separate tools for
 users, especially if they start interfering with each other (i.e. defrag
 undoes the remapping done by your tool).

Yep, in fact, I'm really glad that Jan is working on the remapping
tool because if the on-line defrag kernel interfaces don't have the
right support for it, then that means we need to fix the on-line
defrag patches.  :-)

While we're at it, someone want to start thinking about on-line
shrinking of ext4 filesystems?  Again, the same block remapping
interfaces for defrag and file access optimizations should also be
useful for shrinking filesystems (even if some of the files that need
to be relocated are being actively used).  If not, that probably means
we got the interface wrong.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext2/3 block remapping tool

2007-04-30 Thread Jan Kara
On Mon 30-04-07 08:09:30, Theodore Tso wrote:
 On Fri, Apr 27, 2007 at 12:09:42PM -0600, Andreas Dilger wrote:
  I'd prefer that such functionality be integrated with Takashi's online
  defrag tool, since it needs virtually the same functionality.  For that
  matter, this is also very similar to the block-mapped - extents tool
  from Aneesh.  It doesn't make sense to have so many separate tools for
  users, especially if they start interfering with each other (i.e. defrag
  undoes the remapping done by your tool).
 
 Yep, in fact, I'm really glad that Jan is working on the remapping
 tool because if the on-line defrag kernel interfaces don't have the
 right support for it, then that means we need to fix the on-line
 defrag patches.  :-)
  ;-) Exactly that was the reason why I wrote the userspace program - so
that I have something in hands when we start discussing how the kernel
interface will look like.

 While we're at it, someone want to start thinking about on-line
 shrinking of ext4 filesystems?  Again, the same block remapping
 interfaces for defrag and file access optimizations should also be
 useful for shrinking filesystems (even if some of the files that need
 to be relocated are being actively used).  If not, that probably means
 we got the interface wrong.
  Yes, that's a good idea. Currently it seems to me that block+inode
relocation (we also need for defrag) would be enough to support filesystem
shrinking. Actually, in some ancient times (like 6-7 years ago) I had
written ext2 online filesystem shrinking. Currently, the patch is probably
unusably obsolete but I can still dig it out and look what functions did I
need at that time.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-30 Thread Theodore Tso
On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote:
 chunkfs. The other is reverse maps (aka back pointers) for blocks -
 inodes and inodes - directories that obviate the need to have large
 amounts of memory to check for collisions.

Yes, I missed the fact that you had back pointers for blocks as well
as inodes.  So the block table in the tile header gets used for
determing if a block is free, much like is done with FAT, right?  

That's a clever system; I like it.  It does mean that there is a lot
more metadata updates, but since you're not journaling, that should
counter that effect to some extent.

IMHO, it's definitely worth a try to see how well it works!

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Implement renaming for debugfs

2007-04-30 Thread Jan Kara
  Hello,

  attached patch implements renaming for debugfs. I was asked for this
feature by WLAN guys and I guess it makes sence (they have some debug info
in the directory identified by interface name and that can change...).
Could someone have a look at what I wrote whether it looks reasonable?
Thanks.

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
Implement debugfs_rename() to allow renaming files/directories in debugfs.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/debugfs/inode.c linux-2.6.21-rc6-1-debugfs_rename/fs/debugfs/inode.c
--- linux-2.6.21-rc6/fs/debugfs/inode.c	2007-04-10 17:09:55.0 +0200
+++ linux-2.6.21-rc6-1-debugfs_rename/fs/debugfs/inode.c	2007-04-30 19:29:32.0 +0200
@@ -368,6 +368,72 @@ void debugfs_remove(struct dentry *dentr
 }
 EXPORT_SYMBOL_GPL(debugfs_remove);
 
+/**
+ * debugfs_rename - rename a file/directory in the debugfs filesystem
+ * @old_dir: a pointer to the parent dentry for the renamed object. This
+ *  should be a directory dentry.
+ * @old_dentry: dentry of an object to be renamed.
+ * @new_dir: a pointer to the parent dentry where the object should be
+ *  moved. This should be a directory dentry.
+ * @new_name: a pointer to a string containing the target name.
+ *
+ * This function renames a file/directory in debugfs.  The target must not
+ * exist for rename to succeed.
+ *
+ * This function will return a pointer to a dentry if it succeeds. If an
+ * error occurs, %NULL will be returned.
+ *
+ * If debugfs is not enabled in the kernel, the value -%ENODEV will be
+ * returned.
+ */
+struct dentry *debugfs_rename(struct dentry *old_dir, struct dentry *old_dentry,
+		struct dentry *new_dir, char *new_name)
+{
+	int error;
+	struct dentry *dentry = NULL, *trap;
+	const char *old_name;
+
+	error = simple_pin_fs(debug_fs_type, debugfs_mount,
+			  debugfs_mount_count);
+	if (error)
+		return NULL;
+	trap = lock_rename(new_dir, old_dir);
+	/* Source or destination directories don't exist? */
+	if (!old_dir-d_inode || !new_dir-d_inode)
+		goto exit;
+	/* Source does not exist, cyclic rename, or mountpoint? */
+	if (!old_dentry-d_inode || old_dentry == trap ||
+	d_mountpoint(old_dentry))
+		goto exit;
+	dentry = lookup_one_len(new_name, new_dir, strlen(new_name));
+	/* Lookup failed, cyclic rename or target exists? */
+	if (IS_ERR(dentry) || dentry == trap || dentry-d_inode)
+		goto exit;
+
+	old_name = fsnotify_oldname_init(old_dentry-d_name.name);
+
+	error = simple_rename(old_dir-d_inode, old_dentry, new_dir-d_inode,
+		dentry);
+	if (error) {
+		fsnotify_oldname_free(old_name);
+		goto exit;
+	}
+	d_move(old_dentry, dentry);
+	fsnotify_move(old_dir-d_inode, new_dir-d_inode, old_name,
+		old_dentry-d_name.name, S_ISDIR(dentry-d_inode-i_mode),
+		dentry-d_inode, old_dentry-d_inode);
+	fsnotify_oldname_free(old_name);
+	unlock_rename(new_dir, old_dir);
+	return dentry;
+exit:
+	if (dentry  !IS_ERR(dentry))
+		dput(dentry);
+	unlock_rename(new_dir, old_dir);
+	simple_release_fs(debugfs_mount, debugfs_mount_count);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(debugfs_rename);
+
 static decl_subsys(debug, NULL, NULL);
 
 static int __init debugfs_init(void)
diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/include/linux/debugfs.h linux-2.6.21-rc6-1-debugfs_rename/include/linux/debugfs.h
--- linux-2.6.21-rc6/include/linux/debugfs.h	2007-04-10 17:09:58.0 +0200
+++ linux-2.6.21-rc6-1-debugfs_rename/include/linux/debugfs.h	2007-04-30 19:45:54.0 +0200
@@ -38,6 +38,9 @@ struct dentry *debugfs_create_symlink(co
 
 void debugfs_remove(struct dentry *dentry);
 
+struct dentry *debugfs_rename(struct dentry *old_dir, struct dentry *old_dentry,
+struct dentry *new_dir, char *new_name);
+
 struct dentry *debugfs_create_u8(const char *name, mode_t mode,
  struct dentry *parent, u8 *value);
 struct dentry *debugfs_create_u16(const char *name, mode_t mode,
@@ -83,6 +86,12 @@ static inline struct dentry *debugfs_cre
 static inline void debugfs_remove(struct dentry *dentry)
 { }
 
+static inline struct dentry *debugfs_rename(struct dentry *old_dir, struct dentry *old_dentry,
+struct dentry *new_dir, char *new_name)
+{
+	return ERR_PTR(-ENODEV);
+}
+
 static inline struct dentry *debugfs_create_u8(const char *name, mode_t mode,
 	   struct dentry *parent,
 	   u8 *value)


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-30 Thread Matt Mackall
On Mon, Apr 30, 2007 at 01:26:24PM -0400, Theodore Tso wrote:
 On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote:
  chunkfs. The other is reverse maps (aka back pointers) for blocks -
  inodes and inodes - directories that obviate the need to have large
  amounts of memory to check for collisions.
 
 Yes, I missed the fact that you had back pointers for blocks as well
 as inodes.  So the block table in the tile header gets used for
 determing if a block is free, much like is done with FAT, right?  

We could eliminate the block bitmap, but I don't think there's much
reason to. It improves allocator performance with negligible footprint
and improves redundancy.
 
 That's a clever system; I like it.  It does mean that there is a lot
 more metadata updates, but since you're not journaling, that should
 counter that effect to some extent.

I had actually envisioned this as working with or without a journal.
I suspect there are ways to keep the performance downside here low.

 IMHO, it's definitely worth a try to see how well it works!

I'm not much of an FS hacker and I've got a lot of other projects in
the air, but I may give it a shot. Any help on this front would be
appreciated.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-30 Thread Andreas Dilger
On Apr 19, 2007  11:54 +1000, David Chinner wrote:
  struct fiemap {
  __u64 fm_start; /* logical start offset of mapping (in/out) */
  __u64 fm_len;   /* logical length of mapping (in/out) */
  __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
  __u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
  __u64 fm_unused;
  struct fiemap_extent fm_extents[0];
  }
  
  /* flags for the fiemap request */
  #define FIEMAP_FLAG_SYNC0x0001  /* flush delalloc data to disk*/
  #define FIEMAP_FLAG_HSM_READ0x0002  /* retrieve data from 
  HSM */
  #define FIEMAP_FLAG_INCOMPAT0xff00  /* must understand these flags*/
 
 No flags in the INCOMPAT range - shouldn't it be 0x3 at this point?

This is actually for future use.  Any flags that are added into this range
must be understood by both sides or it should be considered an error.  Flags
outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported.
If it turns out that 8 bits is too small a range for INCOMPAT flags, then
we can make 0x0100 an incompat flag that means e.g. 0x00ff are also
incompat flags also.

I'm assuming that all flags that will be in the original FIEMAP proposal
will be understood by the implementations.  Most filesystems can safely
ignore FLAG_HSM_READ, for example, since they don't support HSM, and for
that matter FLAG_SYNC is probably moot for most filesystems also because
they do block allocation at preprw time.

 SO, there's a HSM_READ flag above. If we are going to make this interface
 useful for filesystems that have HSMs interacting with their extents, the
 HSM needs to be able to query whether the extent is online (on disk), 
 has been migrated offline (on tape) or in dual-state (i.e. both online and
 offline).

Hmm, I'd thought offline would migrate to EXTENT_UNKNOWN, but I didn't
consider files that are both on disk and on secondary storage (which is
no longer just tape anymore).  I thought I'd call this FIEMAP_EXTENT_OFFLINE,
but that has a confusing connotation that the extent is inaccessible, instead
of just saying it is also on offline storage.  What about
FIEMAP_EXTENT_SECONDARY?  Other proposals welcome.

FIEMAP_EXTENT_SECONDARY could still be set even if the file is mapped.
That would mean an offline-only file would be EXTENT_SECONDARY|EXTENT_UNKNOWN,
while a dual-location file would be EXTENT_SECONDARY only.


  SUMMARY OF CHANGES
  ==
  - add separate fe_flags word with flags from various suggestions:
- FIEMAP_EXTENT_HOLE = extent has no space allocation
- FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data
- FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown
  (e.g. HSM, delalloc awaiting sync, etc)
 
 I'd like an explicit delalloc flag, not lumping it in with unknown.
 we *know* the extent is delalloc ;)

Sure, FIEMAP_EXTENT_DELALLOC is fine.  It is mostly redundant with
EXTENT_UNKNOWN (and I think it only makes sense if DELALLOC is given in
addition to UNKNOWN).  I'd like to keep a generic UNKNOWN flag that can
be used by applications that don't really care about why it is unmapped
and in case there are other reasons in the future that an extent might
be unmapped (e.g. fsck or storage layer reporting corruption or loss of
that part of the file).

   chook 681% xfs_bmap -vv fred
   fred:
EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSET  TOTAL 
   FLAGS
  0: [0..151]:288444888..288445039  8 (1696536..1696687)   152 
   00010
FLAG Values:
   01 Unwritten preallocated extent
   001000 Doesn't begin on stripe unit
   000100 Doesn't end   on stripe unit
   10 Doesn't begin on stripe width
   01 Doesn't end   on stripe width
  
  Can you clarify the terminology here?  What is a stripe unit and what is
  a stripe width? 
 
 Stripe unit is equivalent of the chunk size in an MD RAID. It's the amount
 of data that is written to each lun in a stripe before moving onto the
 next stripe element.
 
  Are there N * stripe_unit = stripe_width in e.g. a
  RAID 5 (N+1) array, or N-disk RAID 0?  Maybe vice versa?
 
 Yes, on simple configurations. In more complex HW RAID
 configurations, we'll typically set the stripe unit to the width of
 the RAID5 lun (N * segment size) and the stripe width to the number
 of luns we've striped across.

Can you propose reasonable flag names for these (I can't think of anything
very good) and a clear explanation of what they mean.  I suspect it will
only be XFS that uses them initially.  In mke2fs and ext4+mballoc there is
the concept of stripe unit and stripe width, but as yet they are not
communicated between the two very well.  I'd be much happier if this info
could be queried in a standard way from the block layer instead of the
user having to specify it and the filesystem having to track it.

   Ok, so the only way you can determine where you are 

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-30 Thread David Chinner
On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
 On Apr 19, 2007  11:54 +1000, David Chinner wrote:
   struct fiemap {
 __u64 fm_start; /* logical start offset of mapping (in/out) */
 __u64 fm_len;   /* logical length of mapping (in/out) */
 __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
 __u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
 __u64 fm_unused;
 struct fiemap_extent fm_extents[0];
   }
   
   /* flags for the fiemap request */
   #define FIEMAP_FLAG_SYNC  0x0001  /* flush delalloc data to disk*/
   #define FIEMAP_FLAG_HSM_READ  0x0002  /* retrieve data from 
   HSM */
   #define FIEMAP_FLAG_INCOMPAT0xff00/* must understand 
   these flags*/
  
  No flags in the INCOMPAT range - shouldn't it be 0x3 at this point?
 
 This is actually for future use.  Any flags that are added into this range
 must be understood by both sides or it should be considered an error.  Flags
 outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported.
 If it turns out that 8 bits is too small a range for INCOMPAT flags, then
 we can make 0x0100 an incompat flag that means e.g. 0x00ff are also
 incompat flags also.

Ah, ok. So it's not really a set of compatibility flags,
it's more a compulsory set. Under those terms, i don't really
see why this is necessary - either the filesystem will understand
the flags or it will return EINVAL or ignore them...

 I'm assuming that all flags that will be in the original FIEMAP proposal
 will be understood by the implementations.  Most filesystems can safely
 ignore FLAG_HSM_READ, for example, since they don't support HSM, and for
 that matter FLAG_SYNC is probably moot for most filesystems also because
 they do block allocation at preprw time.

Exactly my point - so why do we really need to encode a compulsory set
of flags in the API? 

  SO, there's a HSM_READ flag above. If we are going to make this interface
  useful for filesystems that have HSMs interacting with their extents, the
  HSM needs to be able to query whether the extent is online (on disk), 
  has been migrated offline (on tape) or in dual-state (i.e. both online and
  offline).
 
 Hmm, I'd thought offline would migrate to EXTENT_UNKNOWN, but I didn't

I disagree - why would you want to indicate the state is unknown when we know
very well that it is offline?

 consider files that are both on disk and on secondary storage (which is
 no longer just tape anymore).  I thought I'd call this FIEMAP_EXTENT_OFFLINE,
 but that has a confusing connotation that the extent is inaccessible, instead
 of just saying it is also on offline storage.  What about
 FIEMAP_EXTENT_SECONDARY?  Other proposals welcome.

Effectively, when your extent is offline in the HSM, it is inaccessable, and
you have to bring it back from tape so it becomes accessible again. i.e. some
action is necessary on behalf of the user to make it accessible. So I think
that OFFLINE is a good name for this state because it really is inaccessible.

Also, I don't think secondary is a good term because most large systems
have more than one tier of storage. One possibility is HSM_RESIDENT
which indicates the extent is current and resident with a HSM's archive

 FIEMAP_EXTENT_SECONDARY could still be set even if the file is mapped.
 That would mean an offline-only file would be EXTENT_SECONDARY|EXTENT_UNKNOWN,
 while a dual-location file would be EXTENT_SECONDARY only.

I much prefer OFFLINE|HSM_RESIDENT and HSM_RESIDENT as it is far more
descriptive as to what the state is (which certainly isn't unknown).

   SUMMARY OF CHANGES
   ==
   - add separate fe_flags word with flags from various suggestions:
 - FIEMAP_EXTENT_HOLE = extent has no space allocation
 - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data
 - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown
   (e.g. HSM, delalloc awaiting sync, etc)
  
  I'd like an explicit delalloc flag, not lumping it in with unknown.
  we *know* the extent is delalloc ;)
 
 Sure, FIEMAP_EXTENT_DELALLOC is fine.  It is mostly redundant with
 EXTENT_UNKNOWN (and I think it only makes sense if DELALLOC is given in
 addition to UNKNOWN).

I disagree that it is redundant - in case you hadn't already noticed I
dislike the idea of unknown meaning one of several possible known
states ;)

 I'd like to keep a generic UNKNOWN flag that can
 be used by applications that don't really care about why it is unmapped
 and in case there are other reasons in the future that an extent might
 be unmapped (e.g. fsck or storage layer reporting corruption or loss of
 that part of the file).

Sure.

chook 681% xfs_bmap -vv fred
fred:
 EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSET  TOTAL 
FLAGS
   0: [0..151]:288444888..288445039  8 (1696536..1696687)   152 
00010
 FLAG