Re: [PATCH 0/5] fallocate system call
On Mon, Apr 30, 2007 at 03:56:32PM +1000, David Chinner wrote: On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: IIRC, the argument for FA_ALLOCATE changing file size is that posix_fallocate() is supposed to change the file size. But it's not posix_fallocate; it's something more generic. glibc can do posix_fallocate using truncate + fallocate. Note that the way XFS implements growing the file size after the allocation is via a truncate What's wrong with that? That seems very reasonable. That's would what I did because otherwise you'd use ftruncate64(). Without documented behaviour or an ext4 implementation, I have to ask what it's supposed to do, though ;) How many *real* users are there for ext4? Why does 'what ext4 does' define 'the semantics'? Surely semantics should be decided either by precedent (if there is an existing relevant userbase) or sensible thought and some debate? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext2/3 block remapping tool
On Fri 27-04-07 12:09:42, Andreas Dilger wrote: On Apr 26, 2007 21:29 +0200, Jan Kara wrote: I've been lately playing with remapping ext2/ext3 blocks (especially how much it can give us in terms of speed of things like KDE start). For that I've written two simple tools (you can get them from ftp.suse.com/pub/people/jack/ext3remapper.tar.gz): e2block2file to transform (preparsed) output from blktrace into a list of accessed files and offsets accessed e2remapblocks to use output from e2block2file and remap blocks into big chunks in the order in which they were accessed. Does it map the whole file contiguously, or does it interleave blocks of the file in the order they are accessed? I would hope that it maps the whole file contiguously, and let readahead work properly to fetch the whole file. Also, keeping the file contiguous avoids fragmentation later if that file is updated, deleted, etc, and conflicts with allocator/defrag/etc. No, it does interleave blocks of different files. Reading the whole file is exactly what you often don't want. During startup KDE (which was my benchmark) accesses basically two things: shared libraries and config files / icons. Config files and icons usually fit into a single block so just mapping them in the right order close together is fine. On the other hand, shared libraries are large and you usually need just a few blocks scattered all over them. So here we just remap those few blocks we need... I see the downsides of this approach. If the file is rewritten, you loose the tight packing, but this is not going to happen often. I'm more seriously concerned about the possibility, that this optimizatiom of startup time may hurt running performace or more probably performance of other apps... (see README in the tools archive for more details) So far the tools (especially e2remapblocks ;) work on unmounted filesystem. The ultimate goal is to be able to do similar things for mounted filesystems but I wanted to see whether block remapping is worth it and what kernel interfaces would be useful for achieving the goal. I'd prefer that such functionality be integrated with Takashi's online defrag tool, since it needs virtually the same functionality. For that Yes, definitely these two have quite similar needs and I'd like to have just one tool in the end. matter, this is also very similar to the block-mapped - extents tool from Aneesh. It doesn't make sense to have so many separate tools for users, especially if they start interfering with each other (i.e. defrag undoes the remapping done by your tool). Agreed. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext2/3 block remapping tool
On Fri, Apr 27, 2007 at 12:09:42PM -0600, Andreas Dilger wrote: I'd prefer that such functionality be integrated with Takashi's online defrag tool, since it needs virtually the same functionality. For that matter, this is also very similar to the block-mapped - extents tool from Aneesh. It doesn't make sense to have so many separate tools for users, especially if they start interfering with each other (i.e. defrag undoes the remapping done by your tool). Yep, in fact, I'm really glad that Jan is working on the remapping tool because if the on-line defrag kernel interfaces don't have the right support for it, then that means we need to fix the on-line defrag patches. :-) While we're at it, someone want to start thinking about on-line shrinking of ext4 filesystems? Again, the same block remapping interfaces for defrag and file access optimizations should also be useful for shrinking filesystems (even if some of the files that need to be relocated are being actively used). If not, that probably means we got the interface wrong. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext2/3 block remapping tool
On Mon 30-04-07 08:09:30, Theodore Tso wrote: On Fri, Apr 27, 2007 at 12:09:42PM -0600, Andreas Dilger wrote: I'd prefer that such functionality be integrated with Takashi's online defrag tool, since it needs virtually the same functionality. For that matter, this is also very similar to the block-mapped - extents tool from Aneesh. It doesn't make sense to have so many separate tools for users, especially if they start interfering with each other (i.e. defrag undoes the remapping done by your tool). Yep, in fact, I'm really glad that Jan is working on the remapping tool because if the on-line defrag kernel interfaces don't have the right support for it, then that means we need to fix the on-line defrag patches. :-) ;-) Exactly that was the reason why I wrote the userspace program - so that I have something in hands when we start discussing how the kernel interface will look like. While we're at it, someone want to start thinking about on-line shrinking of ext4 filesystems? Again, the same block remapping interfaces for defrag and file access optimizations should also be useful for shrinking filesystems (even if some of the files that need to be relocated are being actively used). If not, that probably means we got the interface wrong. Yes, that's a good idea. Currently it seems to me that block+inode relocation (we also need for defrag) would be enough to support filesystem shrinking. Actually, in some ancient times (like 6-7 years ago) I had written ext2 online filesystem shrinking. Currently, the patch is probably unusably obsolete but I can still dig it out and look what functions did I need at that time. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote: chunkfs. The other is reverse maps (aka back pointers) for blocks - inodes and inodes - directories that obviate the need to have large amounts of memory to check for collisions. Yes, I missed the fact that you had back pointers for blocks as well as inodes. So the block table in the tile header gets used for determing if a block is free, much like is done with FAT, right? That's a clever system; I like it. It does mean that there is a lot more metadata updates, but since you're not journaling, that should counter that effect to some extent. IMHO, it's definitely worth a try to see how well it works! - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Implement renaming for debugfs
Hello, attached patch implements renaming for debugfs. I was asked for this feature by WLAN guys and I guess it makes sence (they have some debug info in the directory identified by interface name and that can change...). Could someone have a look at what I wrote whether it looks reasonable? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs Implement debugfs_rename() to allow renaming files/directories in debugfs. Signed-off-by: Jan Kara [EMAIL PROTECTED] diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/fs/debugfs/inode.c linux-2.6.21-rc6-1-debugfs_rename/fs/debugfs/inode.c --- linux-2.6.21-rc6/fs/debugfs/inode.c 2007-04-10 17:09:55.0 +0200 +++ linux-2.6.21-rc6-1-debugfs_rename/fs/debugfs/inode.c 2007-04-30 19:29:32.0 +0200 @@ -368,6 +368,72 @@ void debugfs_remove(struct dentry *dentr } EXPORT_SYMBOL_GPL(debugfs_remove); +/** + * debugfs_rename - rename a file/directory in the debugfs filesystem + * @old_dir: a pointer to the parent dentry for the renamed object. This + * should be a directory dentry. + * @old_dentry: dentry of an object to be renamed. + * @new_dir: a pointer to the parent dentry where the object should be + * moved. This should be a directory dentry. + * @new_name: a pointer to a string containing the target name. + * + * This function renames a file/directory in debugfs. The target must not + * exist for rename to succeed. + * + * This function will return a pointer to a dentry if it succeeds. If an + * error occurs, %NULL will be returned. + * + * If debugfs is not enabled in the kernel, the value -%ENODEV will be + * returned. + */ +struct dentry *debugfs_rename(struct dentry *old_dir, struct dentry *old_dentry, + struct dentry *new_dir, char *new_name) +{ + int error; + struct dentry *dentry = NULL, *trap; + const char *old_name; + + error = simple_pin_fs(debug_fs_type, debugfs_mount, + debugfs_mount_count); + if (error) + return NULL; + trap = lock_rename(new_dir, old_dir); + /* Source or destination directories don't exist? */ + if (!old_dir-d_inode || !new_dir-d_inode) + goto exit; + /* Source does not exist, cyclic rename, or mountpoint? */ + if (!old_dentry-d_inode || old_dentry == trap || + d_mountpoint(old_dentry)) + goto exit; + dentry = lookup_one_len(new_name, new_dir, strlen(new_name)); + /* Lookup failed, cyclic rename or target exists? */ + if (IS_ERR(dentry) || dentry == trap || dentry-d_inode) + goto exit; + + old_name = fsnotify_oldname_init(old_dentry-d_name.name); + + error = simple_rename(old_dir-d_inode, old_dentry, new_dir-d_inode, + dentry); + if (error) { + fsnotify_oldname_free(old_name); + goto exit; + } + d_move(old_dentry, dentry); + fsnotify_move(old_dir-d_inode, new_dir-d_inode, old_name, + old_dentry-d_name.name, S_ISDIR(dentry-d_inode-i_mode), + dentry-d_inode, old_dentry-d_inode); + fsnotify_oldname_free(old_name); + unlock_rename(new_dir, old_dir); + return dentry; +exit: + if (dentry !IS_ERR(dentry)) + dput(dentry); + unlock_rename(new_dir, old_dir); + simple_release_fs(debugfs_mount, debugfs_mount_count); + return NULL; +} +EXPORT_SYMBOL_GPL(debugfs_rename); + static decl_subsys(debug, NULL, NULL); static int __init debugfs_init(void) diff -rupX /home/jack/.kerndiffexclude linux-2.6.21-rc6/include/linux/debugfs.h linux-2.6.21-rc6-1-debugfs_rename/include/linux/debugfs.h --- linux-2.6.21-rc6/include/linux/debugfs.h 2007-04-10 17:09:58.0 +0200 +++ linux-2.6.21-rc6-1-debugfs_rename/include/linux/debugfs.h 2007-04-30 19:45:54.0 +0200 @@ -38,6 +38,9 @@ struct dentry *debugfs_create_symlink(co void debugfs_remove(struct dentry *dentry); +struct dentry *debugfs_rename(struct dentry *old_dir, struct dentry *old_dentry, +struct dentry *new_dir, char *new_name); + struct dentry *debugfs_create_u8(const char *name, mode_t mode, struct dentry *parent, u8 *value); struct dentry *debugfs_create_u16(const char *name, mode_t mode, @@ -83,6 +86,12 @@ static inline struct dentry *debugfs_cre static inline void debugfs_remove(struct dentry *dentry) { } +static inline struct dentry *debugfs_rename(struct dentry *old_dir, struct dentry *old_dentry, +struct dentry *new_dir, char *new_name) +{ + return ERR_PTR(-ENODEV); +} + static inline struct dentry *debugfs_create_u8(const char *name, mode_t mode, struct dentry *parent, u8 *value)
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Mon, Apr 30, 2007 at 01:26:24PM -0400, Theodore Tso wrote: On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote: chunkfs. The other is reverse maps (aka back pointers) for blocks - inodes and inodes - directories that obviate the need to have large amounts of memory to check for collisions. Yes, I missed the fact that you had back pointers for blocks as well as inodes. So the block table in the tile header gets used for determing if a block is free, much like is done with FAT, right? We could eliminate the block bitmap, but I don't think there's much reason to. It improves allocator performance with negligible footprint and improves redundancy. That's a clever system; I like it. It does mean that there is a lot more metadata updates, but since you're not journaling, that should counter that effect to some extent. I had actually envisioned this as working with or without a journal. I suspect there are ways to keep the performance downside here low. IMHO, it's definitely worth a try to see how well it works! I'm not much of an FS hacker and I've got a lot of other projects in the air, but I may give it a shot. Any help on this front would be appreciated. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Apr 19, 2007 11:54 +1000, David Chinner wrote: struct fiemap { __u64 fm_start; /* logical start offset of mapping (in/out) */ __u64 fm_len; /* logical length of mapping (in/out) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ __u64 fm_unused; struct fiemap_extent fm_extents[0]; } /* flags for the fiemap request */ #define FIEMAP_FLAG_SYNC0x0001 /* flush delalloc data to disk*/ #define FIEMAP_FLAG_HSM_READ0x0002 /* retrieve data from HSM */ #define FIEMAP_FLAG_INCOMPAT0xff00 /* must understand these flags*/ No flags in the INCOMPAT range - shouldn't it be 0x3 at this point? This is actually for future use. Any flags that are added into this range must be understood by both sides or it should be considered an error. Flags outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported. If it turns out that 8 bits is too small a range for INCOMPAT flags, then we can make 0x0100 an incompat flag that means e.g. 0x00ff are also incompat flags also. I'm assuming that all flags that will be in the original FIEMAP proposal will be understood by the implementations. Most filesystems can safely ignore FLAG_HSM_READ, for example, since they don't support HSM, and for that matter FLAG_SYNC is probably moot for most filesystems also because they do block allocation at preprw time. SO, there's a HSM_READ flag above. If we are going to make this interface useful for filesystems that have HSMs interacting with their extents, the HSM needs to be able to query whether the extent is online (on disk), has been migrated offline (on tape) or in dual-state (i.e. both online and offline). Hmm, I'd thought offline would migrate to EXTENT_UNKNOWN, but I didn't consider files that are both on disk and on secondary storage (which is no longer just tape anymore). I thought I'd call this FIEMAP_EXTENT_OFFLINE, but that has a confusing connotation that the extent is inaccessible, instead of just saying it is also on offline storage. What about FIEMAP_EXTENT_SECONDARY? Other proposals welcome. FIEMAP_EXTENT_SECONDARY could still be set even if the file is mapped. That would mean an offline-only file would be EXTENT_SECONDARY|EXTENT_UNKNOWN, while a dual-location file would be EXTENT_SECONDARY only. SUMMARY OF CHANGES == - add separate fe_flags word with flags from various suggestions: - FIEMAP_EXTENT_HOLE = extent has no space allocation - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown (e.g. HSM, delalloc awaiting sync, etc) I'd like an explicit delalloc flag, not lumping it in with unknown. we *know* the extent is delalloc ;) Sure, FIEMAP_EXTENT_DELALLOC is fine. It is mostly redundant with EXTENT_UNKNOWN (and I think it only makes sense if DELALLOC is given in addition to UNKNOWN). I'd like to keep a generic UNKNOWN flag that can be used by applications that don't really care about why it is unmapped and in case there are other reasons in the future that an extent might be unmapped (e.g. fsck or storage layer reporting corruption or loss of that part of the file). chook 681% xfs_bmap -vv fred fred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..151]:288444888..288445039 8 (1696536..1696687) 152 00010 FLAG Values: 01 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 10 Doesn't begin on stripe width 01 Doesn't end on stripe width Can you clarify the terminology here? What is a stripe unit and what is a stripe width? Stripe unit is equivalent of the chunk size in an MD RAID. It's the amount of data that is written to each lun in a stripe before moving onto the next stripe element. Are there N * stripe_unit = stripe_width in e.g. a RAID 5 (N+1) array, or N-disk RAID 0? Maybe vice versa? Yes, on simple configurations. In more complex HW RAID configurations, we'll typically set the stripe unit to the width of the RAID5 lun (N * segment size) and the stripe width to the number of luns we've striped across. Can you propose reasonable flag names for these (I can't think of anything very good) and a clear explanation of what they mean. I suspect it will only be XFS that uses them initially. In mke2fs and ext4+mballoc there is the concept of stripe unit and stripe width, but as yet they are not communicated between the two very well. I'd be much happier if this info could be queried in a standard way from the block layer instead of the user having to specify it and the filesystem having to track it. Ok, so the only way you can determine where you are
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: On Apr 19, 2007 11:54 +1000, David Chinner wrote: struct fiemap { __u64 fm_start; /* logical start offset of mapping (in/out) */ __u64 fm_len; /* logical length of mapping (in/out) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ __u64 fm_unused; struct fiemap_extent fm_extents[0]; } /* flags for the fiemap request */ #define FIEMAP_FLAG_SYNC 0x0001 /* flush delalloc data to disk*/ #define FIEMAP_FLAG_HSM_READ 0x0002 /* retrieve data from HSM */ #define FIEMAP_FLAG_INCOMPAT0xff00/* must understand these flags*/ No flags in the INCOMPAT range - shouldn't it be 0x3 at this point? This is actually for future use. Any flags that are added into this range must be understood by both sides or it should be considered an error. Flags outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported. If it turns out that 8 bits is too small a range for INCOMPAT flags, then we can make 0x0100 an incompat flag that means e.g. 0x00ff are also incompat flags also. Ah, ok. So it's not really a set of compatibility flags, it's more a compulsory set. Under those terms, i don't really see why this is necessary - either the filesystem will understand the flags or it will return EINVAL or ignore them... I'm assuming that all flags that will be in the original FIEMAP proposal will be understood by the implementations. Most filesystems can safely ignore FLAG_HSM_READ, for example, since they don't support HSM, and for that matter FLAG_SYNC is probably moot for most filesystems also because they do block allocation at preprw time. Exactly my point - so why do we really need to encode a compulsory set of flags in the API? SO, there's a HSM_READ flag above. If we are going to make this interface useful for filesystems that have HSMs interacting with their extents, the HSM needs to be able to query whether the extent is online (on disk), has been migrated offline (on tape) or in dual-state (i.e. both online and offline). Hmm, I'd thought offline would migrate to EXTENT_UNKNOWN, but I didn't I disagree - why would you want to indicate the state is unknown when we know very well that it is offline? consider files that are both on disk and on secondary storage (which is no longer just tape anymore). I thought I'd call this FIEMAP_EXTENT_OFFLINE, but that has a confusing connotation that the extent is inaccessible, instead of just saying it is also on offline storage. What about FIEMAP_EXTENT_SECONDARY? Other proposals welcome. Effectively, when your extent is offline in the HSM, it is inaccessable, and you have to bring it back from tape so it becomes accessible again. i.e. some action is necessary on behalf of the user to make it accessible. So I think that OFFLINE is a good name for this state because it really is inaccessible. Also, I don't think secondary is a good term because most large systems have more than one tier of storage. One possibility is HSM_RESIDENT which indicates the extent is current and resident with a HSM's archive FIEMAP_EXTENT_SECONDARY could still be set even if the file is mapped. That would mean an offline-only file would be EXTENT_SECONDARY|EXTENT_UNKNOWN, while a dual-location file would be EXTENT_SECONDARY only. I much prefer OFFLINE|HSM_RESIDENT and HSM_RESIDENT as it is far more descriptive as to what the state is (which certainly isn't unknown). SUMMARY OF CHANGES == - add separate fe_flags word with flags from various suggestions: - FIEMAP_EXTENT_HOLE = extent has no space allocation - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown (e.g. HSM, delalloc awaiting sync, etc) I'd like an explicit delalloc flag, not lumping it in with unknown. we *know* the extent is delalloc ;) Sure, FIEMAP_EXTENT_DELALLOC is fine. It is mostly redundant with EXTENT_UNKNOWN (and I think it only makes sense if DELALLOC is given in addition to UNKNOWN). I disagree that it is redundant - in case you hadn't already noticed I dislike the idea of unknown meaning one of several possible known states ;) I'd like to keep a generic UNKNOWN flag that can be used by applications that don't really care about why it is unmapped and in case there are other reasons in the future that an extent might be unmapped (e.g. fsck or storage layer reporting corruption or loss of that part of the file). Sure. chook 681% xfs_bmap -vv fred fred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..151]:288444888..288445039 8 (1696536..1696687) 152 00010 FLAG