Re: [RFC] ext3 freeze feature
On Sat, Jan 26, 2008 at 04:35:26PM +1100, David Chinner wrote: > On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote: > > The points of the implementation are followings. > > - Add calls of the freeze function (freeze_bdev) and > > the unfreeze function (thaw_bdev) in ext3_ioctl(). > > > > - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev) > > is registered to the delayed work queue to unfreeze the filesystem > > automatically after the lapse of the specified time. > > Seems like pointless complexity to me - what happens if a > timeout occurs while the filsystem is still freezing? > > It's not uncommon for a freeze to take minutes if memory > is full of dirty data that needs to be flushed out, esp. if > dm-snap is doing COWs for every write issued Sorry, ignore this bit - I just realised the timer is set up after the freeze has occurred Still, that makes it potentially dangerous to whatever is being done while the filesystem is frozen Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3 freeze feature
On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote: > The points of the implementation are followings. > - Add calls of the freeze function (freeze_bdev) and > the unfreeze function (thaw_bdev) in ext3_ioctl(). > > - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev) > is registered to the delayed work queue to unfreeze the filesystem > automatically after the lapse of the specified time. Seems like pointless complexity to me - what happens if a timeout occurs while the filsystem is still freezing? It's not uncommon for a freeze to take minutes if memory is full of dirty data that needs to be flushed out, esp. if dm-snap is doing COWs for every write issued > + case EXT3_IOC_FREEZE: { > + if (inode->i_sb->s_frozen != SB_UNFROZEN) > + return -EINVAL; > + freeze_bdev(inode->i_sb->s_bdev); > + case EXT3_IOC_THAW: { > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; > + if (inode->i_sb->s_frozen == SB_UNFROZEN) > + return -EINVAL; . > + /* Unfreeze */ > + thaw_bdev(inode->i_sb->s_bdev, inode->i_sb); That's inherently unsafe - you can have multiple unfreezes running in parallel which seriously screws with the bdev semaphore count that is used to lock the device due to doing multiple up()s for every down. Your timeout thingy guarantee that at some point you will get multiple up()s occuring due to the timer firing racing with a thaw ioctl. If this interface is to be more widely exported, then it needs a complete revamp of the bdev is locked while it is frozen so that there is no chance of a double up() ever occuring on the bd_mount_sem due to racing thaws. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3 freeze feature
On Fri, Jan 25, 2008 at 09:42:30PM +0900, Takashi Sato wrote: > >I am also wondering whether we should have system call(s) for these: > > > >On Jan 25, 2008 12:59 PM, Takashi Sato <[EMAIL PROTECTED]> wrote: > >>+ case EXT3_IOC_FREEZE: { > > > >>+ case EXT3_IOC_THAW: { > > > >And just convert XFS to use them too? > > I think it is reasonable to implement it as the generic system call, as you > said. Does XFS folks think so? Sure. Note that we can't immediately remove the XFS ioctls otherwise we'd break userspace utilities that use them Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Parallelize IO for e2fsck
On Tue, Jan 22, 2008 at 12:05:11AM -0700, Andreas Dilger wrote: > On Jan 22, 2008 14:38 +1100, David Chinner wrote: > > On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote: > > > I discussed this with Ted at one point also. This is a generic problem, > > > not just for readahead, because "fsck" can run multiple e2fsck in parallel > > > and in case of many large filesystems on a single node this can cause > > > memory usage problems also. > > > > > > What I was proposing is that "fsck.{fstype}" be modified to return an > > > estimated minimum amount of memory needed, and some "desired" amount of > > > memory (i.e. readahead) to fsck the filesystem, using some parameter like > > > "fsck.{fstype} --report-memory-needed /dev/XXX". If this does not > > > return the output in the expected format, or returns an error then fsck > > > will assume some amount of memory based on the device size and continue > > > as it does today. > > > > And while fsck is running, some other program runs that uses > > memory and blows your carefully calculated paramters to smithereens? > > Well, fsck has a rather restricted working environment, because it is > run before most other processes start (i.e. single-user mode). For fsck > initiated by an admin in other runlevels the admin would need to specify > the upper limit of memory usage. My proposal was only for the single-user > fsck at boot time. The simple case. ;) Because XFS has shutdown features, it's not uncommon to hear about people running xfs_repair on an otherwise live system. e.g. XFS detects a corrupted block, shuts down the filesystem, the admin unmounts it, runs xfs_repair, puts it back online. meanwhile, all the other filesystems and users continue unaffected. In this use case, getting feedback about memory usage is, IMO, very worthwhile. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Parallelize IO for e2fsck
On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote: > On Jan 16, 2008 13:30 -0800, Valerie Henson wrote: > > I have a partial solution that sort of blindly manages the buffer > > cache. First, the user passes e2fsck a parameter saying how much > > memory is available as buffer cache. The readahead thread reads > > things in and immediately throws them away so they are only in buffer > > cache (no double-caching). Then readahead and e2fsck work together so > > that readahead only reads in new blocks when the main thread is done > > with earlier blocks. The already-used blocks get kicked out of buffer > > cache to make room for the new ones. > > > > What would be nice is to take into account the current total memory > > usage of the whole fsck process and factor that in. I don't think it > > would be hard to add to the existing cache management framework. > > Thoughts? > > I discussed this with Ted at one point also. This is a generic problem, > not just for readahead, because "fsck" can run multiple e2fsck in parallel > and in case of many large filesystems on a single node this can cause > memory usage problems also. > > What I was proposing is that "fsck.{fstype}" be modified to return an > estimated minimum amount of memory needed, and some "desired" amount of > memory (i.e. readahead) to fsck the filesystem, using some parameter like > "fsck.{fstype} --report-memory-needed /dev/XXX". If this does not > return the output in the expected format, or returns an error then fsck > will assume some amount of memory based on the device size and continue > as it does today. And while fsck is running, some other program runs that uses memory and blows your carefully calculated paramters to smithereens? I think there is a clear need for applications to be able to register a callback from the kernel to indicate that the machine as a whole is running out of memory and that the application should trim it's caches to reduce memory utilisation. Perhaps instead of swapping immediately, a SIGLOWMEM could be sent to a processes that aren't masking the signal followed by a short grace period to allow the processes to free up some memory before swapping out pages from that process? With this sort of feedback, the fsck process can scale back it's readahead and remove cached info that is not critical to what it is currently doing and thereby prevent readahead thrashing as memory usage of the fsck process itself grows. Another example where this could be useful is to tell browsers to release some of their cache rather than having the VM swap it out. IMO, a scheme like this will be far more reliable than trying to guess what the optimal settings are going to be over the whole lifetime of a process Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Parallelize IO for e2fsck
On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote: > Hi y'all, > > This is a request for comments on the rewrite of the e2fsck IO > parallelization patches I sent out a few months ago. The mechanism is > totally different. Previously IO was parallelized by issuing IOs from > multiple threads; now a single thread issues fadvise(WILLNEED) and > then uses read() to complete the IO. Interesting. We ultimately rejected a similar patch to xfs_repair (pre-population the kernel block device cache) mainly because of low memory performance issues and it doesn't really enable you to do anything particularly smart with optimising I/O patterns for larger, high performance RAID arrays. The low memory problems were particularly bad; the readahead thrashing cause a slowdown of 2-3x compared to the baseline and often it was due to the repair process requiring all of memory to cache stuff it would need later. IIRC, multi-terabyte ext3 filesystems have similar memory usage problems to XFS, so there's a good chance that this patch will see the same sorts of issues. > Single disk performance doesn't change, but elapsed time drops by > about 50% on a big RAID-5 box. Passes 1 and 2 are parallelized. Pass > 5 is left as an exercise for the reader. Promising results, though Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] FIEMAP ioctl for ext4
On Tue, Nov 13, 2007 at 02:30:06AM +0530, Kalpak Shah wrote: > Recently there was discussion about an "FIle Extent MAP"(FIEMAP) ioctl for > efficiently mapping the extents and holes of a file. This will be many times > more efficient than FIBMAP by cutting down the number of ioctls. > > This patch adds the FIEMAP ioctl for ext4. The spec for the FIEMAP ioctl was > posted earlier by Andreas Dilger and can be found at: > http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg03944.html > } > + case EXT4_IOC_FIEMAP: { > + return ext4_fiemap(inode, filp, cmd, arg); > + } > > default: > return -ENOTTY; > Index: linux-2.6.23.1/include/linux/ext4_fs.h > === > --- linux-2.6.23.1.orig/include/linux/ext4_fs.h > +++ linux-2.6.23.1/include/linux/ext4_fs.h > @@ -228,15 +228,20 @@ struct ext4_new_group_data { > #define EXT4_IOC_SETFLAGS FS_IOC_SETFLAGS > #define EXT4_IOC_GETVERSION _IOR('f', 3, long) > #define EXT4_IOC_SETVERSION _IOW('f', 4, long) > +#define EXT4_IOC_GETRSVSZ_IOR('f', 5, long) > +#define EXT4_IOC_SETRSVSZ_IOW('f', 6, long) > #define EXT4_IOC_GROUP_EXTEND_IOW('f', 7, unsigned long) > #define EXT4_IOC_GROUP_ADD _IOW('f', 8,struct ext4_new_group_input) > +#define EXT4_IOC_FIEMAP _IOWR('f', 10, struct fiemap) Please make this common - we dont want a new ioctl for every filesystem; we want a single common to all filesystems. > +int ext4_fiemap(struct inode *inode, struct file *filp, unsigned int cmd, > + unsigned long arg) > +{ Most of this function will be common to all IOC_FIEMAP implementations. > + struct fiemap *fiemap_s; > + struct fiemap_internal fiemap_i; > + struct fiemap_extent *last_extent; > + ext4_fsblk_t start_blk; > + int fm_extent_size = sizeof(struct fiemap_extent); > + int err = 0; > + > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) > + return -EOPNOTSUPP; struct address_space *mapping = filp->f_mapping; if (!mapping->a_ops->fiemap) return -EOPNOTSUPP; > + > + fiemap_s = kmalloc(sizeof(*fiemap_s), GFP_KERNEL); > + if (fiemap_s == NULL) > + return -ENOMEM; > + if (copy_from_user(fiemap_s, (struct fiemap __user *)arg, > +sizeof(*fiemap_s))) > + return -EFAULT; This is common > + > + if (fiemap_s->fm_flags & EXT4_FIEMAP_FLAG_INCOMPAT_UNSUPP) > + return -EOPNOTSUPP; > + > + if (fiemap_s->fm_flags & FIEMAP_FLAG_SYNC) > + ext4_sync_file(filp, filp->f_dentry, 1); The common form is: if (fiemap_s->fm_flags & FIEMAP_FLAG_SYNC) filemap_write_and_wait(mapping); > + start_blk = fiemap_s->fm_start >> inode->i_sb->s_blocksize_bits; > + fiemap_i.fiemap_s = fiemap_s; > + fiemap_i.tot_mapping_len = 0; > + fiemap_i.cur_ext_ptr = (char *)(arg + sizeof(*fiemap_s)); > + fiemap_i.current_extent = 0; > + fiemap_i.err = 0; Seems common. > + > + /* > + * Walk the extent tree gathering extent information > + */ > + mutex_lock(&EXT4_I(inode)->truncate_mutex); > + err = ext4_ext_walk_space(inode, start_blk , EXT_MAX_BLOCK - start_blk, > + ext4_ext_fiemap_cb, &fiemap_i); > + mutex_unlock(&EXT4_I(inode)->truncate_mutex); This becomes: error = mapping->a_ops->fiemap(inode, ); and the lock, extent walk, etc becomes ext4_fiemap() which is set up in the a_ops for the filesystem. Any filesystems specific checks go there as well. > + if (err) > + return err; > + > + fiemap_s->fm_extent_count = fiemap_i.current_extent; > + fiemap_s->fm_length = fiemap_i.tot_mapping_len; > + /* > + * Mark last extent as EXTENT_LAST and copy the extent to userspace. > + */ > + if (fiemap_i.current_extent != 0 && > + fiemap_i.current_extent < fiemap_s->fm_extent_count && > + !(fiemap_s->fm_flags & FIEMAP_FLAG_NUM_EXTENTS)) { > + char *dest; > + > + last_extent = &fiemap_i.fm_extent; > + last_extent->fe_flags |= FIEMAP_EXTENT_LAST; > + dest = (char *)arg + sizeof(*fiemap_s) + fm_extent_size * > + (fiemap_s->fm_extent_count - 1); > + err = copy_to_user(dest, last_extent, fm_extent_size); > + if (err) > + return err; > + } > + err = copy_to_user((void *)arg, fiemap_s, sizeof(*fiemap_s)); > + > + return err; That's common, too. I don't want to see this implemented over and over again with minute variations and bugs. The common implementation should be called from in do_file_ioctl() like FIBMAP Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsub
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote: > By request on #linuxfs, here is the FIEMAP spec that we used to implement > the FIEMAP support for ext4. There was an ext4 patch posted on August 29 > to linux-ext4 entitled "[PATCH] FIEMAP ioctl". Link: http://marc.info/?l=linux-ext4&m=118838241209683&w=2 That's a very ext4 specific ioctl interface. Can we get this made generic like the FIBMAP interface so we don't have to replicate all the copyin/copyout handling and interface definitions everywhere? i.e. a ->extent_map aops callout to the filesystem in generic code just like ->bmap? > I've asked Kalpak to post > an updated version of that patch along with the changes to the "filefrag" > tool to use FIEMAP. Where can I find the test program that validates the implementation? Also, following the fallocate model, can we get the interface definition turned into a man page before anything is submitted upstream? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [xfs-masters] [RFC: 2.6 patch] make the *FS_SECURITY options no longer user visible
On Mon, Jul 30, 2007 at 08:27:47AM -0400, Stephen Smalley wrote: > On Mon, 2007-07-30 at 09:29 +1000, David Chinner wrote: > > On Sun, Jul 29, 2007 at 05:02:09PM +0200, Adrian Bunk wrote: > > > Please correct me if any of the following assumptions is wrong: > > > - SELinux is currently the only user of filesystem security labels > > > shipped with the Linux kernel > > > - if a user has SELinux enabled he wants his filesystems to support > > > security labels > > > > > > Based on these assumption, it doesn't make sense to have the > > > *FS_SECURITY user visible since we can perfectly determine automatically > > > when turning them on makes sense. > > > > Hmmm. The code in XFS is not dependent on selinux, but this change > > would mean that testing the security xattr namespace would require a > > selinux enabled kernel. > > > > I agree that the default for these should be "y" and selected if > > selinux is enabled, but forcing us to use selinux enabled kernels > > (on distro's that may not support selinux) just to test the > > security xattr namespace is a bit of a pain. > > You can enable SECURITY_SELINUX in the kernel config but still have it > boot disabled by default via SECURITY_SELINUX_BOOTPARAM_VALUE=0. Ok, that shouldn't cause a problem then. Objection withdrawn. ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
On Sun, Jul 29, 2007 at 04:09:20PM +0400, Alex Tomas wrote: > David Chinner wrote: > >On Fri, Jul 27, 2007 at 11:51:56AM +0400, Alex Tomas wrote: > >But this is really irrelevant - the issue at hand is what we want > >for VFS level delalloc support. IMO, that mechanism needs to support > >both XFS and ext4, and I'd prefer if it doesn't perpetuate the > >bufferhead abuses of the past (i.e. define an iomap structure > >instead of overloading bufferheads yet again). > > I'm not sure I understand very well. ->get_blocks abuses bufferheads to provide an offset/length/state mapping. That's all it needs. That what the iomap structure is used for. It's smaller than a bufferhead, it's descriptive of it's use and you don't get it confused with the other 10 ways bufferheads are used and abused. > where would you track uptodate, dirty and other states then? > do you propose to separate block states from block mapping? No. They still get tracked in the bufferheads attached to the page. That's what bufferheads were originally intended for(*). Cheers, Dave. (*) I recently proposed a separate block map tree for this rather than using buffer heads for this because of the memory footprint of N bufferheads per page on contiguous mappings. That's future work, not something we really need to consider here. Chris Mason's extent map tree patches are a start on this concept. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [xfs-masters] [RFC: 2.6 patch] make the *FS_SECURITY options no longer user visible
On Sun, Jul 29, 2007 at 05:02:09PM +0200, Adrian Bunk wrote: > Please correct me if any of the following assumptions is wrong: > - SELinux is currently the only user of filesystem security labels > shipped with the Linux kernel > - if a user has SELinux enabled he wants his filesystems to support > security labels > > Based on these assumption, it doesn't make sense to have the > *FS_SECURITY user visible since we can perfectly determine automatically > when turning them on makes sense. Hmmm. The code in XFS is not dependent on selinux, but this change would mean that testing the security xattr namespace would require a selinux enabled kernel. I agree that the default for these should be "y" and selected if selinux is enabled, but forcing us to use selinux enabled kernels (on distro's that may not support selinux) just to test the security xattr namespace is a bit of a pain. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
On Fri, Jul 27, 2007 at 11:51:56AM +0400, Alex Tomas wrote: > David Chinner wrote: > >Using a new API for new functionality is a bad thing? > > if existing API can be used ... Sure, but using the existing APIs is no good if the only filesystem in the kernel that supports delalloc cannot use the new code > >Also, looking at the way mpage_da_map_blocks() is done - if we have > >an 128MB delalloc extent - ext4 will allocate that will allocate it > >in one go, right? What happens if we then crash after only writing a > >few megabytes of that extent? stale data exposure? XFS can allocate > >multiple gigabytes in a single get_blocks call so even if ext4 can't > >do this, it's a problem for XFS. > > what happens if IO to 2nd MB is completed, while IO to 1st MB is not > (probably sitting in queue) ? do you update on-disk size in this case? > how do you track this? We're updating the in-memory on-disk inode here, not the actual inode on disk. That means that if we crashed right here, the file size on disk would not be changed at all and the filesystem would behave as if both writes did not ever occur and we simply end up with empty "preallocated" blocks beyond EOF But this is really irrelevant - the issue at hand is what we want for VFS level delalloc support. IMO, that mechanism needs to support both XFS and ext4, and I'd prefer if it doesn't perpetuate the bufferhead abuses of the past (i.e. define an iomap structure instead of overloading bufferheads yet again). > >So without the ability to attach specific I/O completions to bios > >or support for unwritten extents directly in __mpage_writepage, > >there is no way XFS can use this "generic" delayed allocation code. > > I didn't say "generic", see Subject: :) No, you didn't, but VFS level functionality implies that functionality is both generic and able to be used by all filesystems. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
[please don't top post!] On Thu, Jul 26, 2007 at 05:33:08PM +0400, Alex Tomas wrote: > Jeff Garzik wrote: > >The XFS one is proven and the work was already completed. > > > >What were the specific technical issues that made it unsuitable for ext4? > > > >I would rather not reinvent the wheel, particularly if the reinvention > >is less capable than the existing work. > > It duplicates fs/mpage.c in bio building and introduces new generic API > (iomap, map_blocks_t, etc). Using a new API for new functionality is a bad thing? > In contrast, my trivial implementation re-use > existing code in fs/mpage.c, doesn't introduce new API and I tend to think > provides quite the same functionality. I can be wrong, of course ... No, it doesn't provide the same functionality. Firstly, XFS attaches a different I/O completion to delalloc writes to allow us to update the file size when the write is beyond the current on disk EOF. This code cannot do that as all it does is allocation and present "normal looking" buffers to the generic code path. Secondly, apart from delalloc, XFS cannot use the generic code paths for writeback because unwritten extent conversion also requires custom I/O completion handlers. Given that __mpage_writepage() only calls ->writepage when it is confused, XFS simply cannot use this API. Also, looking at the way mpage_da_map_blocks() is done - if we have an 128MB delalloc extent - ext4 will allocate that will allocate it in one go, right? What happens if we then crash after only writing a few megabytes of that extent? stale data exposure? XFS can allocate multiple gigabytes in a single get_blocks call so even if ext4 can't do this, it's a problem for XFS. So without the ability to attach specific I/O completions to bios or support for unwritten extents directly in __mpage_writepage, there is no way XFS can use this "generic" delayed allocation code. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5][TAKE8] manpage for fallocate
On Wed, Jul 18, 2007 at 08:41:55PM -0700, Mark Fasheh wrote: > On Sat, Jul 14, 2007 at 12:16:25AM +0530, Amit K. Arora wrote: > > After a successful call, subsequent writes are guaranteed not to fail > > because > > of lack of disk space. > > If a write to an unwritten region requires a node split, that could result > in the allocation of new meta data which obviously could fail if the disk is > truly full. % git-log 84e1e99f112dead8f9ba036c02d24a9f5ce7f544 |head -10 commit 84e1e99f112dead8f9ba036c02d24a9f5ce7f544 Author: David Chinner <[EMAIL PROTECTED]> Date: Mon Jun 18 16:50:27 2007 +1000 [XFS] Prevent ENOSPC from aborting transactions that need to succeed During delayed allocation extent conversion or unwritten extent conversion, we need to reserve some blocks for transactions reservations. We need to reserve these blocks in case a btree split occurs and we need to allocate some blocks. -- IOWs, XFS didn't provide this guarantee until about a month ago > Granted that's unlikely to happen but maybe we should be conservative and > say something like: > > "After a successful call, subsequent writes are guaranteed to never require > allocation of file data." ? Well, the above phrasing is taken directly from the posix_fallocate() man page, and it is intended that sys_fallocate() is used to implement posix_fallocate(). In that case, the semantics we have to provide are "writes are guaranteed not to fail due to lack of disk space". Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] introduce fallocate support into xfs_io
FYI. Initial support for fallocate-based pre-allocation in xfs_io for testing. This currently only works on ia64 because of the hard coded syscall number and will require autoconf magic to conditionally compile in this support. This allows simple command-line based testing of fallocate based allocation such as: # ~/xfs_io -f -c "falloc_resvsp 0 1024k" -c "bmap -vp" -c stat /mnt/scratch/fred /mnt/scratch/fred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSETTOTAL FLAGS 0: [0..2047]: 96..2143 0 (96..2143)2048 1 fd.path = "/mnt/scratch/fred" fd.flags = non-sync,non-direct,read-write stat.ino = 131 stat.type = regular file stat.size = 0 stat.blocks = 2048 fsxattr.xflags = 0x2 [-p] fsxattr.projid = 0 fsxattr.extsize = 0 fsxattr.nextents = 1 fsxattr.naextents = 0 dioattr.mem = 0x200 dioattr.miniosz = 512 dioattr.maxiosz = 2147483136 Or more complex cases: # ~/xfs_io -f \ > -c "falloc_allocsp 0 1024k" \ > -c "unresvsp 32k 32k" \ > -c "unresvsp 128k 64k" \ > -c "unresvsp 512k 256k" \ > -c"pwrite 0 16k" \ > -c "pwrite 96k 128k" \ > -c "pwrite 640k 384k" \ > -c "bmap -vp" \ > -c "falloc_resvsp 0 1024k" \ > -c "bmap -vvp" /mnt/scratch/fred wrote 16384/16384 bytes at offset 0 16 KiB, 4 ops; 0. sec (274.123 MiB/sec and 70175.4386 ops/sec) wrote 131072/131072 bytes at offset 98304 128 KiB, 32 ops; 0. sec (338.753 MiB/sec and 86720.8672 ops/sec) wrote 393216/393216 bytes at offset 655360 384 KiB, 96 ops; 0. sec (386.200 MiB/sec and 98867.1473 ops/sec) /mnt/scratch/fred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSETTOTAL FLAGS 0: [0..31]: 96..127 0 (96..127) 32 1: [32..63]:128..159 0 (128..159) 32 1 2: [64..127]: hole64 3: [128..191]: 224..287 0 (224..287) 64 1 4: [192..447]: 288..543 0 (288..543) 256 5: [448..1023]: 544..1119 0 (544..1119)576 1 6: [1024..1279]:hole 256 7: [1280..2047]:1376..21430 (1376..2143) 768 /mnt/scratch/fred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSETTOTAL FLAGS 0: [0..31]: 96..127 0 (96..127) 32 1: [32..191]: 128..287 0 (128..287) 160 1 2: [192..447]: 288..543 0 (288..543) 256 3: [448..1279]: 544..1375 0 (544..1375)832 1 4: [1280..2047]:1376..21430 (1376..2143) 768 FLAG Values: 01 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 10 Doesn't begin on stripe width 01 Doesn't end on stripe width Yes, that looks like it filled all the holes properly, and the allocator allocated the right holes on disk to merge adjacent extents when hole filling. ;) --- xfsprogs/io/prealloc.c | 72 + 1 file changed, 72 insertions(+) Index: xfs-cmds/xfsprogs/io/prealloc.c === --- xfs-cmds.orig/xfsprogs/io/prealloc.c2006-11-15 19:00:31.0 +1100 +++ xfs-cmds/xfsprogs/io/prealloc.c 2007-07-16 15:25:44.041513574 +1000 @@ -26,6 +26,8 @@ static cmdinfo_t allocsp_cmd; static cmdinfo_t freesp_cmd; static cmdinfo_t resvsp_cmd; static cmdinfo_t unresvsp_cmd; +static cmdinfo_t falloc_allocsp_cmd; +static cmdinfo_t falloc_resvsp_cmd; static int offset_length( @@ -119,6 +121,56 @@ unresvsp_f( return 0; } +/* + * Hack, hack, hackety-hack-hack. + * + * This only works for ia64... + */ +#define __NR_fallocate1303 + +/* + * someday there'll be a real header file + */ +#define FALLOC_FL_KEEP_SIZE 0x01 +#define FALLOC_ALLOCATE 0x0 +#define FALLOC_RESV_SPACE FALLOC_FL_KEEP_SIZE + +static int +fallocate_allocsp_f( + int argc, + char**argv) +{ + xfs_flock64_t segment; + + if (!offset_length(argv[1], argv[2], &segment)) + return 0; + + if (syscall(__NR_fallocate, file->fd, FALLOC_ALLOCATE, + segment.l_start, segment.l_len)) { + perror("FALLOC_ALLOCATE"); + return 0; + } + return 0; +} + +static int +fallocate_resvsp_f( + int argc, + char**argv) +{ + xfs_flock64_t segment; + + if (!offset_length(argv[1], argv[2], &segment)) + return 0; + + if (syscall(__NR_fallocate, file->fd, FALLOC_RESV_SPACE, + segment.l_start, segment.l_len)) { + perror("FALLOC_ALLOCATE"); + return 0; + } + return 0; +} + void prealloc_init(void) { @@ -156,8 +208,28 @@ prealloc_init(void) unresvsp_cmd.oneline = _("frees reserved space
[PATCH] xfs: implement fallocate V2
Initial implementation of ->fallocate for XFS. Version 2: o Make allocation and setting the file size atomic. o Drop deallocate/punch functionality o use mode field appropriately to determine if size needs changing. --- fs/xfs/linux-2.6/xfs_iops.c | 47 1 file changed, 47 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c 2007-07-16 14:16:02.090255611 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c 2007-07-16 14:50:07.087885337 +1000 @@ -51,6 +51,7 @@ #include #include #include +#include /* * Get a XFS inode from a given vnode. @@ -812,6 +813,51 @@ xfs_vn_removexattr( return namesp->attr_remove(vp, attr, xflags); } +/* + * generic space allocation vector. + * + * This should really through a bhv_vop before stuffing around + * with xfs_inodes and such. + */ +STATIC long +xfs_vn_fallocate( + struct inode*inode, + int mode, + loff_t offset, + loff_t len) +{ + longerror = -EOPNOTSUPP; + bhv_vnode_t *vp = vn_from_inode(inode); + bhv_desc_t *bdp; + loff_t new_size = 0; + xfs_flock64_t bf; + + bf.l_whence = 0; + bf.l_start = offset; + bf.l_len = len; + + bdp = bhv_lookup_range(VN_BHV_HEAD(vp), VNODE_POSITION_XFS, + VNODE_POSITION_XFS); + + xfs_ilock(xfs_vtoi(vp), XFS_IOLOCK_EXCL); + error = xfs_change_file_space(bdp, XFS_IOC_RESVSP, &bf, 0, NULL, + ATTR_NOLOCK); + if (!error && !(mode & FALLOC_FL_KEEP_SIZE) && + offset + len > i_size_read(inode)) + new_size = offset + len; + + /* Change file size if needed */ + if (new_size) { + bhv_vattr_t va; + + va.va_mask = XFS_AT_SIZE; + va.va_size = new_size; + error = bhv_vop_setattr(vp, &va, ATTR_NOLOCK, NULL); + } + + xfs_iunlock(xfs_vtoi(vp), XFS_IOLOCK_EXCL); + return error; +} const struct inode_operations xfs_inode_operations = { .permission = xfs_vn_permission, @@ -822,6 +868,7 @@ const struct inode_operations xfs_inode_ .getxattr = xfs_vn_getxattr, .listxattr = xfs_vn_listxattr, .removexattr= xfs_vn_removexattr, + .fallocate = xfs_vn_fallocate, }; const struct inode_operations xfs_dir_inode_operations = { -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ia64 fallocate system call
sys_fallocate for ia64. This uses the empty slot originally reserved for move_pages. Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]> --- arch/ia64/kernel/entry.S |2 +- include/asm-ia64/unistd.h |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: 2.6.x-xfs-new/arch/ia64/kernel/entry.S === --- 2.6.x-xfs-new.orig/arch/ia64/kernel/entry.S 2007-07-16 14:18:51.432168485 +1000 +++ 2.6.x-xfs-new/arch/ia64/kernel/entry.S 2007-07-16 14:22:08.582454284 +1000 @@ -1581,7 +1581,7 @@ sys_call_table: data8 sys_sync_file_range // 1300 data8 sys_tee data8 sys_vmsplice - data8 sys_ni_syscall// reserved for move_pages + data8 sys_fallocate data8 sys_getcpu data8 sys_epoll_pwait // 1305 data8 sys_utimensat Index: 2.6.x-xfs-new/include/asm-ia64/unistd.h === --- 2.6.x-xfs-new.orig/include/asm-ia64/unistd.h2007-06-08 21:36:31.0 +1000 +++ 2.6.x-xfs-new/include/asm-ia64/unistd.h 2007-07-16 14:22:41.166204402 +1000 @@ -292,7 +292,7 @@ #define __NR_sync_file_range 1300 #define __NR_tee 1301 #define __NR_vmsplice 1302 -/* 1303 reserved for move_pages */ +#define __NR_fallocate 1303 #define __NR_getcpu1304 #define __NR_epoll_pwait 1305 #define __NR_utimensat 1306 - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6][TAKE7] manpage for fallocate
On Fri, Jul 13, 2007 at 06:16:01PM +0530, Amit K. Arora wrote: > Following is the modified version of the manpage originally submitted by > David Chinner. Please use `nroff -man fallocate.2 | less` to view. > > This includes changes suggested by Heikki Orsila and Barry Naujok. Can we get itemised change logs for all these patches from now on? > .TH fallocate 2 > .SH NAME > fallocate \- allocate or remove file space If fallocate is just being used for allocating space this is wrong. maybe - "manipulate file space" instead? dd> .TP > .B FALLOC_RESV_SPACE > provides the same functionality as > .B FALLOC_ALLOCATE > except it does not ever change the file size. This allows allocation > of zero blocks beyond the end of file and is useful for optimising "of zeroed blocks" -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote: > > Why don't we just merge the interface for preallocation (essentially > enough to satisfy posix_fallocate() and the simple XFS requirement for > space reservation without changing file size), which there is clear agreement > on (I hope :)). After all, this was all that we set out to do when we > started. > > And leave all the dealloc/punch/hsm type features for separate future patches/ > debates, those really shouldn't hold up the basic fallocate interface. > I agree with Christoph that we are just diverging too much in trying to > club those decisions here. > > Dave, Andreas, Ted ? Sure. I'll just make XFS work with whatever it is that gets merged. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Teach do_mpage_readpage() about unwritten buffers
On Mon, Jul 02, 2007 at 08:28:27PM -0700, Andrew Morton wrote: > On Tue, 3 Jul 2007 11:10:19 +1000 David Chinner <[EMAIL PROTECTED]> wrote: > > Seems sane, although one does wonder whether it's a worthy tradeoff. We > add additional overhead to readpage[s]() just to avoid some IO during > mkswap? It removes the hidden magic from XFS that hides unwritten extents behind holes so that they get zero filled by readpages. With other filesystems starting to use unwritten extents, these sorts of tricksy hacks should be avoided and we should be explicitly handling unwritten buffers just like we explicitly handle holes. The side effect of this change is that other things (like sys_swapon) behave sanely when faced with unwritten extents (i.e. they are not detected incorrectly as holes). FWIW, using unwritten extents for swap files means you can allocate gigabytes of swap space on demand, even under severe memory pressure because you don't need to write those gigabytes of data to disk. Also, the only way to read the contents of the swap file is to go directly to the underlying device so even if you screw up the permissions on the swap file it will still always read as zeros. So there are some benefits here Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Sat, Jun 30, 2007 at 11:21:11AM +0100, Christoph Hellwig wrote: > On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote: > > > Can you clarify - what is the current behaviour when ENOSPC (or some other > > > error) is hit? Does it keep the current fallocate() or does it free it? > > > > Currently it is left on the file system implementation. In ext4, we do > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may > > end up with partial (pre)allocation. This is inline with dd and > > posix_fallocate, which also do not free the partially allocated space. > > I can't find anything in the specification of posix_fallocate > (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html) > that tells what should happen to allocate blocks on error. Yeah, and AFAICT glibc leaves them behind ATM. > But common sense would be to not leak disk space on failure of this > syscall, and this definitively should not be left up to the filesystem, > either we always leak it or always free it, and I'd strongly favour > the latter variant. We can't simply walk the range an remove unwritten extents, as some of them may have been present before the fallocate() call. That makes it extremely difficult to undo a failed call and not remove more pre-existing pre-allocations. Given the current behaviour for posix_fallocate() in glibc, I think that retaining the same error semantic and punting the cleanup to userspace (where the app will fail with ENOSPC anyway) is the only sane thing we can do here. Trying to undo this in the kernel leads to lots of extra rarely used code in error handling paths... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Thu, Jun 28, 2007 at 11:49:13PM +0530, Amit K. Arora wrote: > On Wed, Jun 27, 2007 at 09:18:04AM +1000, David Chinner wrote: > > On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote: > > > On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote: > > > > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote: > > > > > Can you clarify - what is the current behaviour when ENOSPC (or some > > > > > other > > > > > error) is hit? Does it keep the current fallocate() or does it free > > > > > it? > > > > > > > > Currently it is left on the file system implementation. In ext4, we do > > > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may > > > > end up with partial (pre)allocation. This is inline with dd and > > > > posix_fallocate, which also do not free the partially allocated space. > > > > > > Since I believe the XFS allocation ioctls do it the opposite way (free > > > preallocated space on error) this should be encoded into the flags. > > > Having it "filesystem dependent" just means that nobody will be happy. > > > > No, XFs does not free preallocated space on error. it is up to the > > application to clean up. > > Since XFS also does not free preallocated space on error and this > behavior is inline with dd, posix_fallocate() and the current ext4 > implementation, do we still need FA_FL_FREE_ENOSPC flag ? Not at the moment. > > > What I mean is that any data read from the file should have the > > > "appearance" > > > of being zeroed (whether zeroes are actually written to disk or not). > > > What > > > I _think_ David is proposing is to allow fallocate() to return without > > > marking the blocks even "uninitialized" and subsequent reads would return > > > the old data from the disk. > > > > Correct, but for swap files that's not an issue - no user should be able > > too read them, and FA_MKSWAP would really need root privileges to execute. > > Will the FA_MKSWAP mode still be required with your suggested change of > teaching do_mpage_readpage() about unwritten extents being in place ? > Or, will you still like to have FA_MKSWAP mode ? budgie:/mnt/test # xfs_io -f -c "resvsp 0 1048576" -c "truncate 1048576" swap_file budgie:/mnt/test # mkswap swap_file Setting up swapspace version 1, size = 1032 kB budgie:/mnt/test # swapon -v swap_file swapon on swap_file budgie:/mnt/test # swapon -s FilenameTypeSizeUsedPriority /dev/sda2 partition 9437152 0 -1 /mnt/test/swap_file file992 0 -2 budgie:/mnt/test # xfs_bmap -vvp swap_file swap_file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSETTOTAL FLAGS 0: [0..31]: 96..127 0 (96..127) 32 1: [32..2047]: 128..2143 0 (128..2143) 2016 1 FLAG Values: 01 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 10 Doesn't begin on stripe width 01 Doesn't end on stripe width Looks like the changes work, so FA_MKSWAP is not necessary for XFS. We can drop that for the moment unless anyone else sees a need for it. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Thu, Jun 28, 2007 at 09:28:36AM +1000, Nathan Scott wrote: > On Wed, 2007-06-27 at 23:36 +1000, David Chinner wrote: > > Allows setup_swap_extents() to use preallocated files on XFS > > filesystems for swap files without ever needing to convert them. > > Using unwritten extents (as opposed to the MKSWAP flag mentioned > earlier) has the unfortunate down side of requiring transactions, > possibly additional IO, and memory allocation during swap. (but, > this patch should probably go in regardless, as teaching generic > code about unwritten extents is not a bad idea). I don't think it does - swapfile I/O looks like it goes direct to bio without passing through the filesystem. When the swapfile is mapped, it scans and records the extent map of the entire swapfile in a separate structure and AFAICT the swap code uses that built map without touching the filesystem at all. If that is true then the written/unwritten state of the extents is irrelevant; all we need is allocated disk space for the file and swapping should work. And it's not like anyone should be reading the contents of that swapfile through the filesystem, either. ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jun 26, 2007 at 11:49:15PM -0400, Andreas Dilger wrote: > On Jun 27, 2007 09:14 +1000, David Chinner wrote: > > Someone on the XFs list had an interesting request - preallocated > > swap files. You can't use unwritten extents for this because > > of sys_swapon()s use of bmap() (XFS returns holes for reading > > unwritten extents), so we need a method of preallocating that does > > not zero or mark the extent unread. i.e. FA_MKSWAP. > > Is there a reason why unwritten extents return 0 to bmap()? It's a fallout of xfs_get_blocks not mapping unwritten extents on read because we want do_mpage_readpage() to treat them as a hole. i.e. zero fill them instead of doing I/O. This is the way XFS was shoehorned into the generic read path :/ > This > would seem to be the only impediment from using fallocated files > for swap files. Maybe if FIEMAP was used by mkswap to get an > "UNWRITTEN" flag back instead of "HOLE" it wouldn't be a problem. Probably. If we taught do_mpage_readpage() about unwritten mappings, then would could map them on read if and then sys_swapon can remain blissfully unaware of unwritten extents. I think this is pretty much all I need to do to acheive that is (untested): --- Teach do_mpage_readpage() about unwritten extents so we can always map them in get_blocks rather than they are are holes on read. Allows setup_swap_extents() to use preallocated files on XFS filesystems for swap files without ever needing to convert them. Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]> --- fs/mpage.c |5 +++-- fs/xfs/linux-2.6/xfs_aops.c | 13 +++-- 2 files changed, 6 insertions(+), 12 deletions(-) Index: 2.6.x-xfs-new/fs/mpage.c === --- 2.6.x-xfs-new.orig/fs/mpage.c 2007-05-29 16:17:59.0 +1000 +++ 2.6.x-xfs-new/fs/mpage.c2007-06-27 22:39:35.568852270 +1000 @@ -207,7 +207,8 @@ do_mpage_readpage(struct bio *bio, struc * Map blocks using the result from the previous get_blocks call first. */ nblocks = map_bh->b_size >> blkbits; - if (buffer_mapped(map_bh) && block_in_file > *first_logical_block && + if (buffer_mapped(map_bh) && !buffer_unwritten(map_bh) && + block_in_file > *first_logical_block && block_in_file < (*first_logical_block + nblocks)) { unsigned map_offset = block_in_file - *first_logical_block; unsigned last = nblocks - map_offset; @@ -242,7 +243,7 @@ do_mpage_readpage(struct bio *bio, struc *first_logical_block = block_in_file; } - if (!buffer_mapped(map_bh)) { + if (!buffer_mapped(map_bh) || buffer_unwritten(map_bh)) { fully_mapped = 0; if (first_hole == blocks_per_page) first_hole = page_block; Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-06-05 22:14:39.0 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2007-06-27 22:39:29.545636749 +1000 @@ -1340,16 +1340,9 @@ __xfs_get_blocks( return 0; if (iomap.iomap_bn != IOMAP_DADDR_NULL) { - /* -* For unwritten extents do not report a disk address on -* the read case (treat as if we're reading into a hole). -*/ - if (create || !(iomap.iomap_flags & IOMAP_UNWRITTEN)) { - xfs_map_buffer(bh_result, &iomap, offset, - inode->i_blkbits); - } - if (create && (iomap.iomap_flags & IOMAP_UNWRITTEN)) { - if (direct) + xfs_map_buffer(bh_result, &iomap, offset, inode->i_blkbits); + if (iomap.iomap_flags & IOMAP_UNWRITTEN) { + if (create && direct) bh_result->b_private = inode; set_buffer_unwritten(bh_result); } Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7][TAKE5] ext4: support new modes
On Wed, Jun 27, 2007 at 12:59:08AM +0530, Amit K. Arora wrote: > On Tue, Jun 26, 2007 at 12:14:00PM -0400, Andreas Dilger wrote: > > On Jun 26, 2007 17:37 +0530, Amit K. Arora wrote: > > > > I also thought another proposed flag was to determine whether mtime (and > > > > maybe ctime) is changed when doing prealloc/dealloc space? Default > > > > should > > > > probably be to change mtime/ctime, and have FA_FL_NO_MTIME. Someone > > > > else > > > > should decide if we want to allow changing the file w/o changing ctime, > > > > if > > > > that is required even though the file is not visibly changing. Maybe > > > > the > > > > ctime update should be implicit if the size or mtime are changing? > > > > > > Is it really required ? I mean, why should we allow users not to update > > > ctime/mtime even if the file metadata/data gets updated ? It sounds > > > a bit "unnatural" to me. > > > Is there any application scenario in your mind, when you suggest of > > > giving this flexibility to userspace ? > > > > One reason is that XFS does NOT update the mtime/ctime when doing the > > XFS_IOC_* allocation ioctls. Not totally correct. XFS_IOC_ALLOCSP/FREESP change timestamps if they change the file size (via the truncate call made to change the file size). If they don't change the file size, then they are a no-op and should not change the file size. XFS_IOC_RESVSP/UNRESVSP don't change timestamps just like they don't change file size. That is by design AFAICT so these calls can be used by HSM-type applications that don't want to change timestamps when punching out data blocks or preallocating new ones. > Hmm.. I personally will call it a bug in XFS code then. :) No, I'd call it useful. :) > > > I think, modifying ctime/mtime should be dependent on the other flags. > > > E.g., if we do not zero out data blocks on allocation/deallocation, > > > update only ctime. Otherwise, update ctime and mtime both. > > > > I'm only being the advocate for requirements David Chinner has put > > forward due to existing behaviour in XFS. This is one of the reasons > > why I think the "flags" mechanism we now have - we can encode the > > various different behaviours in any way we want and leave it to the > > caller. > > I understand. May be we can confirm once more with David Chinner if this > is really required. Will it really be a compatibility issue if new XFS > preallocations (ie. via fallocate) update mtime/ctime? It should be left up to the filesystem to decide. Only the filesystem knows whether something changed and the timestamp should or should not be updated. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jun 26, 2007 at 11:42:50AM -0400, Andreas Dilger wrote: > On Jun 26, 2007 16:15 +0530, Amit K. Arora wrote: > > On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote: > > > In XFS one of the (many) ALLOC modes is to zero existing data on allocate. > > > For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on > > > each extent. For some workloads this would be much faster than truncate > > > and reallocate of all the blocks in a file. > > > > In ext4, we already mark each extent having preallocated blocks as > > uninitialized. This is done as part of following code (which is part of > > patch 5/7) in ext4_ext_get_blocks() : > > What I meant is that with XFS_IOC_ALLOCSP the previously-written data > is ZEROED OUT, unlike with fallocate() which leaves previously-written > data alone and only allocates in holes. > > So, if you had a sparse file with some data in it: > > A BB > > fallocate() would allocate the holes: > > 0A0BB > > XFS_IOC_ALLOCSP would overwrite everything: > > 0 No, it wouldn't. XFS_IOC_ALLOCSP would give you: A BB because it only allocates the space between the old EOF and the new EOF. Graphic demonstration - write 4k @ 4k, 4k @ 16k, allocsp out to 32k: budgie:~ # xfs_io -f \ > -c "pwrite 4096 4096" \ > -c "pwrite 16384 4096" \ > -c "bmap -vvp" \ > -c "allocsp 32768 0" \ > -c "bmap -vvp" \ > /mnt/test/alfred wrote 4096/4096 bytes at offset 4096 4 KiB, 1 ops; 0. sec (108.507 MiB/sec and 2.7778 ops/sec) wrote 4096/4096 bytes at offset 16384 4 KiB, 1 ops; 0. sec (260.417 MiB/sec and 6.6667 ops/sec) /mnt/test/alfred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..7]: hole 8 1: [8..15]: 5226864..5226871 4 (1022160..1022167) 8 2: [16..31]:hole 16 3: [32..39]:5226888..5226895 4 (1022184..1022191) 8 /mnt/test/alfred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..7]: hole 8 1: [8..15]: 5226864..5226871 4 (1022160..1022167) 8 2: [16..31]:hole 16 3: [32..63]:5226888..5226919 4 (1022184..1022215)32 budgie:~ # Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote: > On Jun 25, 2007 19:15 +0530, Amit K. Arora wrote: > > +#define FA_FL_DEALLOC 0x01 /* default is allocate */ > > +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */ > > +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC > > */ > > In XFS one of the (many) ALLOC modes is to zero existing data on allocate. No, none of the XFS allocation modes do that. XFS_IOC_ALLOCSP, which does write zeros to disk, only allocates and writes zeros in the range between the old file size and the new file size. XFS_IOC_RESVSP, which alocates unwritten extents, only allocates where extents do not currently exist. It does not zero existing extents. IOWs, you can't overwrite existing data with XFS preallocation. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote: > On Jun 26, 2007 16:02 +0530, Amit K. Arora wrote: > > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote: > > > Can you clarify - what is the current behaviour when ENOSPC (or some other > > > error) is hit? Does it keep the current fallocate() or does it free it? > > > > Currently it is left on the file system implementation. In ext4, we do > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may > > end up with partial (pre)allocation. This is inline with dd and > > posix_fallocate, which also do not free the partially allocated space. > > Since I believe the XFS allocation ioctls do it the opposite way (free > preallocated space on error) this should be encoded into the flags. > Having it "filesystem dependent" just means that nobody will be happy. No, XFs does not free preallocated space on error. it is up to the application to clean up. > What I mean is that any data read from the file should have the "appearance" > of being zeroed (whether zeroes are actually written to disk or not). What > I _think_ David is proposing is to allow fallocate() to return without > marking the blocks even "uninitialized" and subsequent reads would return > the old data from the disk. Correct, but for swap files that's not an issue - no user should be able too read them, and FA_MKSWAP would really need root privileges to execute. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
On Mon, Jun 25, 2007 at 06:58:10PM +0530, Amit K. Arora wrote: > 2) The above new patches (4/7 and 7/7) are based on the dicussion >between Andreas Dilger and David Chinner on the mode argument, >when later posted a man page on fallocate. Can you include the man page in this patch set, please? That way it can be kept up to date with the rest of the patch set. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote: > On Jun 25, 2007 20:33 +0530, Amit K. Arora wrote: > > I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as > > *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323 post. > > If it is decided that these flags are also needed, I will update this > > patch. Thanks! > > Can you clarify - what is the current behaviour when ENOSPC (or some other > error) is hit? Does it keep the current fallocate() or does it free it? > > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we > don't want to expose uninitialized disk blocks to userspace. I'm not > sure if this makes sense at all. Someone on the XFs list had an interesting request - preallocated swap files. You can't use unwritten extents for this because of sys_swapon()s use of bmap() (XFS returns holes for reading unwritten extents), so we need a method of preallocating that does not zero or mark the extent unread. i.e. FA_MKSWAP. I thinkthis would be: #define FA_FL_NO_ZERO_SPACE 0x08/* default is to zero space */ #define FA_MKSWAP (FA_ALLOCATE | FA_FL_NO_ZERO_SPACE) That way we can allocate large swap files that don't need zeroing in a single, fast operation, and hence potentially bring new swap space online without needed very much memory at all (i.e. should succeed in most near-OOM conditions). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iov_iter_fault_in_readable fix
On Thu, Jun 14, 2007 at 06:31:53PM +0100, Christoph Hellwig wrote: > On Wed, Jun 13, 2007 at 05:57:59PM +0400, Dmitriy Monakhov wrote: > > Function prerform check for signgle region, with out respect to > > segment nature of iovec, For example writev no longer works :) > > Btw, could someone please start to collect all sniplets like this in > a nice simple regression test suite? If no one wants to start a new > one we should probably just put it into xfsqa (which should be useable > for other filesystems aswell despite the name) Yeah, it can run a subset of the tests on NFS and UDF filesystems as well and there are some specific UDF-only tests in it too. I think the NFS test group is mostly generic tests that don't use or test specific XFS features. I'd be happy to accumulate tests like these in xfsqa... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote: > On Jun 14, 2007 09:52 +1000, David Chinner wrote: > > B FA_PREALLOCATE > > provides the same functionality as > > B FA_ALLOCATE > > except it does not ever change the file size. This allows allocation > > of zero blocks beyond the end of file and is useful for optimising > > append workloads. > > TP > > B FA_DEALLOCATE > > removes the underlying disk space with the given range. The disk space > > shall be removed regardless of it's contents so both allocated space > > from > > B FA_ALLOCATE > > and > > B FA_PREALLOCATE > > as well as from > > B write(3) > > will be removed. > > B FA_DEALLOCATE > > shall never remove disk blocks outside the range specified. > > So this is essentially the same as "punch". Depends on your definition of "punch". > There doesn't seem to be > a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the > end. ftruncate() > > B FA_DEALLOCATE > > shall never change the file size. If changing the file size > > is required when deallocating blocks from an offset to end > > of file (or beyond end of file) is required, > > B ftuncate64(3) > > should be used. > > This also seems to be a bit of a wart, since it isn't a natural converse > of either of the above functions. How about having two modes, > similar to FA_ALLOCATE and FA_PREALLOCATE? whatever. > Say, FA_PUNCH (which > would be as you describe here - deletes all data in the specified > range changing the file size if it overlaps EOF, Punch means different things to different people. To me (and probably most XFS aware ppl) punch implies no change to the file size. i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching holes to leave the file size unchanged. This is the behaviour I described for FA_DEALLOCATE. > and FA_DEALLOCATE, > which only deallocates unused FA_{PRE,}ALLOCATE space? That's an "unwritten-to-hole" extent conversion. Is that really useful for anything? That's easily implemented with FIEMAP and FA_DEALLOCATE. Anyway, because we can't agree on a single pair of flags: FA_ALLOCATE== posix_fallocate() FA_DEALLOCATE == unwritten-to-hole ??? FA_RESV_SPACE == XFS_IOC_RESVSP64 FA_UNRESV_SPACE== XFS_IOC_UNRESVSP64 > We might also consider making @mode be a mask instead of an enumeration: > > FA_FL_DEALLOC 0x01 (default allocate) > FA_FL_KEEP_SIZE 0x02 (default extend/shrink size) > FA_FL_DEL_DATA0x04 (default keep written data on DEALLOC) i.e: #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA > I suppose it might be a bit late in the game to add a "goal" > parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make > the API more suitable for XFS? It would suffice for the simpler operations, I think, but we'll rapidly run out of flags and we'll still need another interface for doing complex stuff. > The goal could be a single __u64, or > a struct with e.g. __u64 byte offset (possibly also __u32 lun like > in FIEMAP). I guess the one potential limitation here is the > number of function parameters on some architectures. To be useful it needs to __u64. > > B ENOSPC > > There is not enough space left on the device containing the file > > referred to by > > IR fd. > > Should probably say whether space is removed on failure or not. In Right. I'd say on error you need to FA_DEALLOCATE to ensure any space allocated was freed back up. That way the error handling in the allocate functions is much simpler (i.e. no need to undo there). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote: > Did you get time to write the above man page ? It will help to push > further patches in time (eg. for FA_PREALLOCATE mode). First pass is attached. `nroff -man fallocate.2 | less` to view. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group .TH fallocate 2 .SH NAME fallocate \- allocate or remove file space .SH SYNOPSIS .nf .B #include .PP .BI "int syscall(int, int fd, int mode, loff_t offset, loff_t len); .Op .SH DESCRIPTION The .BR fallocate syscall allows a user to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .IR offset and continuing for .IR len bytes. The .I mode parameter determines the operation to be performed on the given range. Currently there are three modes: .TP .B FA_ALLOCATE allocates and initialises to zero the disk space within the given range. After a successful call, subsequent writes are guaranteed not to fail because of lack of disk space. If the size of the file is less than IR offset + len , then the file is increased to this size; otherwise the file size is left unchanged. B FA_ALLOCATE closely resembles B posix_fallocate(3) and is intended as a method of optimally implementing this function. B FA_ALLOCATE may allocate a larger range that was specified. TP B FA_PREALLOCATE provides the same functionality as B FA_ALLOCATE except it does not ever change the file size. This allows allocation of zero blocks beyond the end of file and is useful for optimising append workloads. TP B FA_DEALLOCATE removes the underlying disk space with the given range. The disk space shall be removed regardless of it's contents so both allocated space from B FA_ALLOCATE and B FA_PREALLOCATE as well as from B write(3) will be removed. B FA_DEALLOCATE shall never remove disk blocks outside the range specified. B FA_DEALLOCATE shall never change the file size. If changing the file size is required when deallocating blocks from an offset to end of file (or beyond end of file) is required, B ftuncate64(3) should be used. SH "RETURN VALUE" BR fallocate() returns zero on success, or an error number on failure. Note that IR errno is not set. SH "ERRORS" TP B EBADF I fd is not a valid file descriptor, or is not opened for writing. TP B EFBIG I offset+len exceeds the maximum file size. TP B EINVAL I offset or I len was less than 0. TP B ENODEV I fd does not refer to a regular file or a directory. TP B ENOSPC There is not enough space left on the device containing the file referred to by IR fd. TP B ESPIPE I fd refers to a pipe of file descriptor. B ENOSYS The filesystem underlying the file descriptor does not support this operation. SH AVAILABILITY The BR fallocate () system call is available since 2.6.XX SH "SEE ALSO" BR syscall (2), BR posix_fadvise (3) BR ftruncate (3)
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote: > On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote: > > Minimal definition to replace what applicaitons use on XFS and to > > support poasix_fallocate are the thre that have been mentioned so > > far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them > > all in a man page... > > Hi Dave, > > Did you get time to write the above man page ? It will help to push > further patches in time (eg. for FA_PREALLOCATE mode). No, I didn't. Instead of working on new preallocation stuff, I've been spending all my time fixing bugs found by new and interesting (ab)uses of preallocation and hole punching. > The idea I had was to push the patch with bare minimum functionality > (FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other > new mode(s) based on the man page you planned to provide. Push them. I'll just make XFS work with whatever is provided. Is there a test harness for the syscall yet? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] i_version update
On Wed, May 30, 2007 at 04:32:57PM -0700, Mingming Cao wrote: > On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote: > > On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote: > > > Hi, > > > > > > This is an update of the i_version patch. > > > The i_version field is a 64bit counter that is set on every inode > > > creation and that is incremented every time the inode data is modified > > > (similarly to the "ctime" time-stamp). > > > > My understanding (please correct me if I'm wrong) is that the > > requirements are much more rigourous than simply incrementing an in > > memory counter on every change. i.e. the this counter has to > > survive server crashes intact so clients never see the counter go > > backwards. That means version number changes need to be journalled > > along with the operation that caused the change of the version > > number. > > > Yeah, the i_version is the in memeory counter. From the patch it looks > like the counter is being updated inside ext4_mark_iloc_dirty(), so it > is being journalled and being flush to on-disk ext4 inode structure > immediately (On-disk ext4 inode structure is being modified/expanded to > store the counter in the first patch). Ok, that catches most things (I missed that), but the version number still needs to change on file data changes, right? So if we are overwriting the file, we're calling __mark_inode_dirty(I_DIRTY_PAGES) which means you don't get the callout and so the version number doesn't change or get logged. In that case, the version number is not doing what it needs to do, right? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] i_version update
On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote: > Hi, > > This is an update of the i_version patch. > The i_version field is a 64bit counter that is set on every inode > creation and that is incremented every time the inode data is modified > (similarly to the "ctime" time-stamp). My understanding (please correct me if I'm wrong) is that the requirements are much more rigourous than simply incrementing an in memory counter on every change. i.e. the this counter has to survive server crashes intact so clients never see the counter go backwards. That means version number changes need to be journalled along with the operation that caused the change of the version number. > The aim is to fulfill a NFSv4 requirement for rfc3530: > "5.5. Mandatory Attributes - Definitions > Name # DataType Access Description > ___ > change3 uint64 READ A value created by the > server that the client can use to determine if file > data, directory contents or attributes of the object File data writes are included in this list of things that need to increment the version field. Hence to fulfill the crash requirement, that implies server data writes either need to be synchronous or journalled... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote: > On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote: > > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote: > > > > Following changes were made to the previous version: > > > 1) Added description before sys_fallocate() definition. > > > 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, > > > posix_fallocate should return EINVAL for len <= 0. > > > 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE > > > 4) Do not return ENODEV for dirs (let individual file systems decide if > > > they want to support preallocation to directories or not. > > > 5) Check for wrap through zero. > > > 6) Update c/mtime if fallocate() succeeds. > > > > Please don't make this always happen. c/mtime updates should be dependent > > on the mode being used and whether there is visible change to the file. If > > no > > userspace visible changes to the file occurred, then timestamps should not > > be changed. > > i_blocks will be updated, so it seems reasonable to update ctime. mtime > shouldn't be changed, though, since the contents of the file will be > unchanged. That's assuming blocks were actually allocated - if the prealloc range already has underlying blocks there is no change and so we should not be changing mtime either. Only the filesystem will know if it has changed the file, so I think that timestamp updates need to be driven down to that level, not done blindy at the highest layer Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote: > This patch implements sys_fallocate() and adds support on i386, x86_64 > and powerpc platforms. Can you please pick up the ia64 support patch I posted as well? > Changelog: > - > Note: The changes below are from the initial post (dated 26th April, > 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel > version on which this patch is based. TAKE2 was based on 2.6.21 and this > is based on 2.6.22-rc1. > > Following changes were made to the previous version: > 1) Added description before sys_fallocate() definition. > 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, > posix_fallocate should return EINVAL for len <= 0. > 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE > 4) Do not return ENODEV for dirs (let individual file systems decide if > they want to support preallocation to directories or not. > 5) Check for wrap through zero. > 6) Update c/mtime if fallocate() succeeds. Please don't make this always happen. c/mtime updates should be dependent on the mode being used and whether there is visible change to the file. If no userspace visible changes to the file occurred, then timestamps should not be changed. e.g. FA_ALLOCATE that changes file size requires same semantics of ftruncate() extending the file, otherwise no change in timestamps should occur. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote: > On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote: > > All I'm really interested in right now is that the fallocate > > _interface_ can be used as a *complete replacement* for the > > pre-existing XFS-specific ioctls that are already used by > > applications. What ext4 can or can't do right now is irrelevant to > > this discussion - the interface definition needs to take priority > > over implementation > > Would you like to write up an interface definition description (likely > man page) and post it for review, possibly with a mention of apps using > it today ? Yeah, I started doing that yesterday as i figured it was the only way to cut the discussion short > One reason for introducing the mode parameter was to allow the interface to > evolve incrementally as more options / semantic questions are proposed, so > that we don't have to make all the decisions right now. > So it would be good to start with a *minimal* definition, even just one mode. > The rest could follow as subsequent patches, each being reviewed and debated > separately. Otherwise this discussion can drag on for a long time. Minimal definition to replace what applicaitons use on XFS and to support poasix_fallocate are the thre that have been mentioned so far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them all in a man page... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote: > On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote: > > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > > > I have the updated patches ready which take care of Andrew's comments. > > > Will run some tests and post them soon. > > > > > > But, before submitting these patches, I think it will be better to > > > finalize on certain things which might be worth some discussion here: > > > > > > 1) Should the file size change when preallocation is done beyond EOF ? > > > - Andreas and Chris Wedgwood are in favor of not changing the file size > > > in this case. I also tend to agree with them. Does anyone has an > > > argument in favor of changing the filesize ? If not, I will remove the > > > code which changes the filesize, before I resubmit the concerned ext4 > > > patch. > > > > I think there needs to be both. If we don't have a mechanism to atomically > > change the file size with the preallocation, then applications that use > > stat() to work out if they need to preallocate more space will end up > > racing. > > By "both" above, do you mean we should give user the flexibility if it wants > the filesize changed or not ? It can be done by having *two* modes for > preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we > use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not > change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() > will change the filesize if required (i.e. when allocation is beyond EOF) > and also update [cm]time. This way, the application can decide what it > wants. Yes, that's right. > This will be helpfull for the partial allocation scenario also. Think of the > case when we do not change the filesize in fallocate() and expect > applications/posix_fallocate() to do ftruncate() after fallocate() for this. > Now if fallocate() results in a partial allocation with -ENOSPC error > returned, applications/posix_fallocate() will not know for what length > ftruncate() has to be called. :( Well, posix_fallocate() either gets all the space or it fails. If you truncate to extend the file size after an ENOSPC, then that is a buggy implementation. The same could be said for any application, or even the fallocate() call itself if it changes the filesize without having completely preallocated the space asked > Hence it may be a good idea to give user the flexibility if it wants to > atomically change the file size with preallocation or not. But, with more > flexibility there comes inconsistency in behavior, which is worth > considering. We've got different modes to specify different behaviour. That's what the mode field was put there for in the first place - the interface is *designed* to support different preallocation behaviours > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of > > > normal (non-preallocated) blocks (blocks allocated via regular > > > write/truncate operations) also (i.e. work as punch()) ? > > > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what > > i did for FA_UNALLOCATE as well. > > Ok. But, some people may not expect/like this. I think, we can keep it on > the backburner for a while, till other issues are sorted out. How can it be a "backburner" issue when it defines the implementation? I've already implemented some thing in XFS that sort of does what I think that the interface is supposed to do, but I need that interface to be nailed down before proceeding any further. All I'm really interested in right now is that the fallocate _interface_ can be used as a *complete replacement* for the pre-existing XFS-specific ioctls that are already used by applications. What ext4 can or can't do right now is irrelevant to this discussion - the interface definition needs to take priority over implementation Cheers, Dave, -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote: > I have the updated patches ready which take care of Andrew's comments. > Will run some tests and post them soon. > > But, before submitting these patches, I think it will be better to finalize > on certain things which might be worth some discussion here: > > 1) Should the file size change when preallocation is done beyond EOF ? >- Andreas and Chris Wedgwood are in favor of not changing the > file size in this case. I also tend to agree with them. Does anyone > has an argument in favor of changing the filesize ? > If not, I will remove the code which changes the filesize, before I > resubmit the concerned ext4 patch. I think there needs to be both. If we don't have a mechanism to atomically change the file size with the preallocation, then applications that use stat() to work out if they need to preallocate more space will end up racing. > 2) For FA_UNALLOCATE mode, should the file system allow unallocation >of normal (non-preallocated) blocks (blocks allocated via >regular write/truncate operations) also (i.e. work as punch()) ? Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what i did for FA_UNALLOCATE as well. >- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still > we need to finalize on the convention here as a general guideline > to all the filesystems that implement fallocate. > > 3) If above is true, the file size will need to be changed >for "unallocation" when block holding the EOF gets unallocated. No - we punch a hole. If you want the filesize to change, then you use ftruncate() to remove the blocks at EOF and change the file size atomically. > 4) Should we update mtime & ctime on a successfull allocation/ >unallocation ? >- David Chinner raised this question in following post: > http://lkml.org/lkml/2007/4/29/407 > I think it makes sense to update the [mc]time for a successfull > preallocation/unallocation. Does anyone feel otherwise ? > It will be interesting to know how XFS behaves currently. Does XFS > update [mc]time for preallocation ? No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size changes. If the filesize changes, it behaves exactly the same way that ftruncate() behaves. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21-ext4-1
On Tue, May 08, 2007 at 03:05:56PM -0700, Mingming Cao wrote: > On Tue, 2007-05-08 at 12:50 +1000, David Chinner wrote: > > On Mon, May 07, 2007 at 01:56:23PM -0700, Mingming Cao wrote: > > > In any case, it would be useful to add a new set of testsuites for the > > > new fallocate() syscall and fsstress in LTP testsuites to automatically > > > the preallocation code in ext4/XFS. > > > > I hacked an existing XFS test prog to do manual testing of the fallocate() > > syscall. In the XFSQA suite we have various pre-alloc enhanced utils (e.g. > > fsstress, fsx, etc) that we should probably update to be able to use both > > fallocate and xfsctl so we can test both. > > > > Here's all the programs we use that have preallocation awareness: > > > > chook 982% grep RESVSP ltp/*.c | awk '/^ltp/ { split($1,a,":"); print > > a[1] ;}' | uniq > > ltp/doio.c > > ltp/fsstress.c > > ltp/fsx.c > > ltp/growfiles.c > > ltp/iogen.c > > chook 983% grep RESVSP src/* | awk '/^src/ { split($1,a,":"); print a[1] > > ;}' | uniq > > src/alloc.c > > src/fstest.c > > src/iopat.c > > src/randholes.c > > src/resvtest.c > > src/unwritten_mmap.c > > src/unwritten_sync.c > > > Thanks for sharing the info. I think Amit used a fsx version with > preallocation awareness test, but I don't know if we have other ltp > tests that aware preallocation. I would very appreciate if you could > share your preallocation test with us. All the tests are in CVS: http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/ I took this one: http://oss.sgi.com/cgi-bin/cvsweb.cgi/~checkout~/xfs-cmds/xfstests/src/alloc.c?rev=1.9;content-type=text%2Fplain and modified it. Patch is below - it's full of conditional XFS functionality still - you shoul dbe able to get it to run easily on ext4, though. > > BTW, have you guys tested mmap writes into unwritten extents? ;) > > > I am not sure, Amit, have you done some mmap write test into > uninitialized extents? > > Sorry, I still not quite clear what's the mapped problem you are worry > about. Could you explain to me a bit more? thanks! XFS needs a ->page_mkwrite() callout to correctly map pages that have been dirtied by mmap that span unwritten extents. mmap reads (i.e. when the fault first occurred) treat unwritten extents like holes and so we need to remap them when they are dirtied to set all the unwritten state in the bufferheads correctly for writeback. See test 166 here: http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/ http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfstests/src/unwritten_mmap.c The same behaviour is needed for delalloc extents to prevent ENOSPC errors on writeback - the mmap write needs to do the freespace accounting at the time the page is dirtied and that can only be done through the ->page_mkwrite callout. Otherwise ENOSPC will occur in the writeback path and that is a major pain This may not be a problem for ext4, but I thought I better point out a couple of the more subtle problems mmap can introduce Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- xfstests/src/Makefile |7 xfstests/src/falloc.c | 376 ++ 2 files changed, 381 insertions(+), 2 deletions(-) Index: xfs-cmds/xfstests/src/falloc.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ xfs-cmds/xfstests/src/falloc.c 2007-04-30 12:41:13.862302450 +1000 @@ -0,0 +1,376 @@ +/* + * Copyright (c) 2000-2003,2007 Silicon Graphics, Inc. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "global.h" + +/* should end up include somewhere */ +#define FA_ALLOCATE0x1 +#define FA_DEALLOCATE 0x2 +#define FA_PREALLOCATE 0x3 + +/* ia64 */ +#define __NR_fallocate 1305 +/* + * Block I/O parameterization. A basic block (BB) is the lowest size of + * filesystem allocation, and must equal 512. Length units given to bio + * routines are in BB's. + */ + +/* Assume that if we have BTOBB, then we have the re
Re: 2.6.21-ext4-1
On Mon, May 07, 2007 at 01:56:23PM -0700, Mingming Cao wrote: > In any case, it would be useful to add a new set of testsuites for the > new fallocate() syscall and fsstress in LTP testsuites to automatically > the preallocation code in ext4/XFS. I hacked an existing XFS test prog to do manual testing of the fallocate() syscall. In the XFSQA suite we have various pre-alloc enhanced utils (e.g. fsstress, fsx, etc) that we should probably update to be able to use both fallocate and xfsctl so we can test both. Here's all the programs we use that have preallocation awareness: chook 982% grep RESVSP ltp/*.c | awk '/^ltp/ { split($1,a,":"); print a[1] ;}' | uniq ltp/doio.c ltp/fsstress.c ltp/fsx.c ltp/growfiles.c ltp/iogen.c chook 983% grep RESVSP src/* | awk '/^src/ { split($1,a,":"); print a[1] ;}' | uniq src/alloc.c src/fstest.c src/iopat.c src/randholes.c src/resvtest.c src/unwritten_mmap.c src/unwritten_sync.c BTW, have you guys tested mmap writes into unwritten extents? ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote: > On Fri, 4 May 2007 16:07:31 +1000 David Chinner <[EMAIL PROTECTED]> wrote: > > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> > > > wrote: > > > > > > > This patch implements the fallocate() system call and adds support for > > > > i386, x86_64 and powerpc. > > > > > > > > ... > > > > +{ > > > > + struct file *file; > > > > + struct inode *inode; > > > > + long ret = -EINVAL; > > > > + > > > > + if (len == 0 || offset < 0) > > > > + goto out; > > > > > > The posix spec implies that negative `len' is permitted - presumably > > > "allocate > > > ahead of `offset'". How peculiar. > > > > I just checked the man page for posix_fallocate() and it says: > > > > EINVAL offset or len was less than zero. > > > > We should probably follow this lead. > > Yes, I think so. I'm suspecting that > http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html > is just buggy. Or I can't read. > > I mean, if we're going to support negative `len' then is the byte at > `offset' inside or outside the segment? Head spins. I don't think we should care. If we provide a syscall with the semantics of "allocate from offset to offset+len" then glibc's implementation can turn negative length into two separate fallocate syscalls > > > > + ret = -ENODEV; > > > > + if (!S_ISREG(inode->i_mode)) > > > > + goto out_fput; > > > > > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > > > seems a bit silly of them. > > > > H - I thought that the intention of sys_fallocate() was to > > be generic enough to eventually allow preallocation on directories. > > If that is the case, then this check will prevent that > > The above opengroup page only permits S_ISREG. Preallocating directories > sounds quite useful to me, although it's something which would be pretty > hard to emulate if the FS doesn't support it. And there's a decent case to > be made for emulating it - run-anywhere reasons. Does glibc emulation support > directories? Quite unlikely. > > But yes, sounds like a desirable thing. Would XFS support it easily if the > above > check was relaxed? No - right now empty blocks are pruned from the directory immediately so I don't think we really have a concept of empty blocks in the btree structure. dir2 is bloody complex, so adding preallocation is probably not going to be simple to do. It's not high on my list to add, either, because we can typically avoid the worst case directory fragmentation by using larger directory block sizes (e.g. 16k instead of the default 4k on a 4k block size fs). IIRC directory preallocation has been talked about more for ext3/4 Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote: > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote: > > > This patch implements the fallocate() system call and adds support for > > i386, x86_64 and powerpc. > > > > ... > > +{ > > + struct file *file; > > + struct inode *inode; > > + long ret = -EINVAL; > > + > > + if (len == 0 || offset < 0) > > + goto out; > > The posix spec implies that negative `len' is permitted - presumably "allocate > ahead of `offset'". How peculiar. I just checked the man page for posix_fallocate() and it says: EINVAL offset or len was less than zero. We should probably follow this lead. > > + > > + ret = -ENODEV; > > + if (!S_ISREG(inode->i_mode)) > > + goto out_fput; > > So we return ENODEV against an S_ISBLK fd, as per the posix spec. That > seems a bit silly of them. H - I thought that the intention of sys_fallocate() was to be generic enough to eventually allow preallocation on directories. If that is the case, then this check will prevent that Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote: > On 2 May 2007, at 10:15, David Chinner wrote: > >On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: > >>And all applications will run against a multitude of > >>kernels. So version X of the application will run on kernel 2.4.*, > >>2.6.*, a.b.*, etc... For future expandability of the interface I > >>think it is important to have both compulsory and non-compulsory > >>flags. > > > >Ah, so that's what you want - a mutable interface. i.e. versioning. > > > >So how does compusory flags help here? What happens if a voluntary > >flag now becomes compulsory? Or vice versa? How is the application > >supposed to deal with this dynamically? > > > >I suggested a version number for this right back at the start of > >this discussion and got told that we don't want versioned interfaces > >because we should make the effort to get it right the first time. > >I don't think this can be called "getting it right". > > Look at ext2/3/4. They do it that way and it works well. No > versioning just compatible and incompatible flags... The proposal is > to do the same here. Just because it works for extN doesn't make it right for this interface. > >>For example there is no reason why FIEMAP_HSM_READ needs to be > >>compulsory. Most filesystems do not support HSM so can safely ignore > >>it. > > > >They might be able to safely ignore it, but in reality it should > >be saying "I don't understand this". If the application *needs* to > >use a flag like this, then it should be told that the filesystem is > >not capable of doing what it was asked! > > That is where you are completely wrong! (-: Or rather you are wrong > for my example, i.e. you are wrong/right depending on the type of > flag in question. And that is the crux of the argument. My point is that *any* flag returns an error if the filesystem does not support it. > HSM_READ is definitely _NOT_ required because all > it means is "if the file is OFFLINE, bring it ONLINE and then return > the extent map". You've got the definition of HSM_READ wrong. If the flag is *not* set, then we bring everything back online and return the full extent map. Specifying the flag indicates that we do *not* want the offline extents brought back online. i.e. it is a HSM or a datamover (e.g. backup program) that is querying the extents and we want to known *exactly* what the current state of the file is right now. So, if the HSM_READ flag is set, then the application is expecting the filesytem to be part of a HSM. Hence if it's not, it should return an error because somebody has done something wrong. > >OTOH if the application does not need to use the flag, then it > >shouldn't be using it and we shouldn't be silently ignoring > >incorrect usage of the provided API. > > > >What you are effectively saying about these "voluntary" flags > >is that their behaviour is _undefined_. That is, if you use > >these flags what you get on a successful call is undefined; > >it may or may not contain what you asked for but you can't > >tell if it really did what you want or returned the information > >you asked for. > > > >This is a really bad semantic to encode into an API. > > That is your opinion. There is nothing undefined in the API at all. > You just fail to understand it... FIEMAP returned success. Did it do what I asked? I don't know because it's allowed to return success when it did ignored me. This is as silly an interface definition as saying you can implement fsync() with { return 0; }. So, when fsync() succeeded did it write my data to disk? I don't know; it's allowed to return success when it ignored me. It's crazy, isn't it? It makes writing applications portable across operating systems a real PITA (ask the MySQL folk ;) because POSIX really does allow fsync() to be implemented like this. I use this example because the "allow some filesystems to silently ignore flags they don't understand" is a portability problem for applications - rather than a cross-OS issue it is a cross-filesystem issue. That is, if different filesystems behave differently to the same request they will have to be handled specifically by the application. Every filesystem should behave in *exactly* the same way to the FIEMAP ioctls - if they don't support something they throw an error, if they do then they return the correct data. > >>And vice versa, an application might specify some weird and funky yet > >>to be developed feature that it expects the FS to p
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Wed, May 02, 2007 at 09:23:38AM +0100, Anton Altaparmakov wrote: > On a different issue, do you think it would be worth adding an option > flags like FIEMAP_DONT_RELOCATE or something similar that would be a > compulsory flag and if set the FS is not allowed to move the file > around/change the block allocation of the file. We already have an inode flag in XFS to say this - the defrag tool checks it and ignores the file if it is set. > Or alternatively a flag like FIEMAP_MAKE_DIRECT or something to tell > the FS we want to access the actual raw blocks so the FS can make > sure the data is on block aligned boundaries and if the FS does not > support this (e.g. ZFS or a compressed or encrypted NTFS file) then > it can return -ENOTSUP. > > Perhaps this is totally the wrong interface and such a "prepare file > for direct access" API should be a different ioctl() or syscall or > whatever. It just seems very simple and appropriate to combine it > here as people who use FIEMAP are at least sometimes going to be > wanting to access those blocks directly as well and it feels right to > be able to communicate this to the FS in the same call, kind of like > an "open intent" of "I want to use the data directly on disk"... I think this is wrong interface for this. Sure, use it to get the mappings (that's what it's for) but what you do with the mappings after that is not part of FIEMAP Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Tue, May 01, 2007 at 07:46:53PM +0100, Anton Altaparmakov wrote: > On 1 May 2007, at 15:20, David Chinner wrote: > >> > >>So, either the filesystem will understand the flag or iff the > >>unknown flag > >>is in the incompat set, it will return EINVAL or else the unknown > >>flag will > >>be safely ignored. > > > >My point was that there is a difference between specification and > >implementation - if the specification says something is compulsory, > >then they must be implemented in the filesystem. This is easy > >enough to ensure by code review - we don't need additional interface > >complexity for this > > You are wrong about this because you are missing the point that you > have no code to review. The users that will use those flags are > going to be applications that run in user space. Chances are you > will never see their code. Heck, they might not even be open source > applications... Ummm - the specification defines what is compulsory for *filesystems* to implement, not what applications can use. We don't need to see what the applications do - what we care about is that all filesystems implement the compulsory part of the specification. That's the code we review, and that's what I was referring to. > And all applications will run against a multitude of > kernels. So version X of the application will run on kernel 2.4.*, > 2.6.*, a.b.*, etc... For future expandability of the interface I > think it is important to have both compulsory and non-compulsory flags. Ah, so that's what you want - a mutable interface. i.e. versioning. So how does compusory flags help here? What happens if a voluntary flag now becomes compulsory? Or vice versa? How is the application supposed to deal with this dynamically? I suggested a version number for this right back at the start of this discussion and got told that we don't want versioned interfaces because we should make the effort to get it right the first time. I don't think this can be called "getting it right". > For example there is no reason why FIEMAP_HSM_READ needs to be > compulsory. Most filesystems do not support HSM so can safely ignore > it. They might be able to safely ignore it, but in reality it should be saying "I don't understand this". If the application *needs* to use a flag like this, then it should be told that the filesystem is not capable of doing what it was asked! OTOH if the application does not need to use the flag, then it shouldn't be using it and we shouldn't be silently ignoring incorrect usage of the provided API. What you are effectively saying about these "voluntary" flags is that their behaviour is _undefined_. That is, if you use these flags what you get on a successful call is undefined; it may or may not contain what you asked for but you can't tell if it really did what you want or returned the information you asked for. This is a really bad semantic to encode into an API. > And vice versa, an application might specify some weird and funky yet > to be developed feature that it expects the FS to perform and if the > FS cannot do it (either because it does not support it or because it > failed to perform the operation) the application expects the FS to > return an error and not to ignore the flag. An example could be the > asked for FIEMAP_XATTR_FORK flag. If that is implemented, and the FS > ignores it it will return the extent map for the file data instead of > the XATTR_FORK! Not what the application wanted at all. Ouch! So > this is definitely a compulsory flag if I ever saw one. Yes, the correct answer is -EOPNOTSUPP or -EINVAL in this case. But we don't need a flag defined in the user visible API to tell us that we need to return an error here. > So as you see you must support both voluntary and compulsory flags... No, you've managed to convince me that they are not necessary and they are in fact a Bad Idea... ;) > Also consider what I said above about different kernels. A new > feature is implemented in kernel 2.8.13 say that was not there before > and an application is updated to use that feature. There will be > lots of instances where that application will still be run on older > kernels where this feature does not exist. This is *exactly* where silently ignoring flags really falls down. On 2.8.13, the flag is silently ignored. On 2.8.14, the flag does something and it returns different structure contents for the same state. Now how does the application writer know which is correct or how to tell the difference? They have to guess or write detection code which is exactly what we want to avoid. I objected to the UNKNOWN flag because it wasn't explicit in it'
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Tue, May 01, 2007 at 03:30:40PM -0700, Andreas Dilger wrote: > On May 01, 2007 14:22 +1000, David Chinner wrote: > > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't > > > > I disagree - why would you want to indicate the state is unknown when we > > know > > very well that it is offline? > > If you don't like "UNKNOWN", what about "UNMAPPED"? I just want a > catch-all flag that indicates "this extent contains data but there is > nothing sensible to be returned for the extent mapping." Yes, I like that much more. Good suggestion. ;) > > Effectively, when your extent is offline in the HSM, it is inaccessable, and > > you have to bring it back from tape so it becomes accessible again. i.e. > > some > > action is necessary on behalf of the user to make it accessible. So I think > > that OFFLINE is a good name for this state because it really is > > inaccessible. > > What you are calling OFFLINE I would prefer to call UNMAPPED, since that > can be used by applications as a catch-all for "no mapping". There can > be further flags that give refinements to UNMAPPED that some applications > might care about them (e.g. HSM_RESIDENT), but many users/apps will not > if they just want the number of fragments in a given file. Agreed - UNMAPPED does make a lot more sense in this case. > > > Can you propose reasonable flag names for these (I can't think of anything > > > very good) and a clear explanation of what they mean. I suspect it will > > > only be XFS that uses them initially. In mke2fs and ext4+mballoc there is > > > the concept of stripe unit and stripe width, but as yet they are not > > > communicated between the two very well. I'd be much happier if this info > > > could be queried in a standard way from the block layer instead of the > > > user having to specify it and the filesystem having to track it. > > > > My preference is definitely for a separate ioctl to grab the > > filesystem geometry so this stuff can be calculated in userspace. > > i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't > > bother trying to define names until we decide which appraoch we take > > to implement this. > > Hmm, previously you wrote "This information could be easily passed up in the > flags fields if the filesystem has geometry information". So, I _think_ > what you are saying is that you want 4 flags to convey this start/end > alignment information, but the exact semantics of what a "stripe unit" and > a "stripe width" is filesystem specific? Right. > I definitely do NOT want to get into any issues of querying the block > device geometry here. I was just making a passing comment that ext4+mballoc > can already do RAID-specific allocation alignment, but it depends on the > admin to specify this information and it would be nice if there was some > easy way to get this from userspace/kernel interfaces. > > Having an API that can request "tell me the number of blocks from this > offset until the next physical disk boundary" or similar would be useful > to any allocator, and the block layer already needs to know this when > submitting IO. The block layer knows this once you get inside the volume manager. I think the issue is that there is no common export interface for this information. > > In XFS, mkfs.xfs does the work of getting this information > > to see in the filesystem superblock. Here's the code for getting > > sunit/swidth from the underlying block device: > > > > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/ > > > > Not much in common there ;) > > It looks like this might be just what e2fsprogs needs also. More than likely. > > > It does make sense to specify zero for the fm_extent_count array and a > > > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the > > > extent data itself, for the non-verbose mode of filefrag, and for > > > pre-allocating a buffer large enough to hold the file if that is > > > important. > > > > Rather than rely on implicit behaviour of "pass in extent count of > > zero and a don't try to return any extents" to return the number of > > extents on the file, why not just explicitly define this as a valid > > input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS > > That's what I said, isn't it? FIEMAP_FLAG_NO_EXTENTS. I wonder if my > clever-clever for "return
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote: > On 1 May 2007, at 05:22, David Chinner wrote: > >On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > >> The FIBMAP ioctl is for privileged users > >> only, and I wonder if FIEMAP should be the same, or at least > >>disallow > >> mapping files that the user can't access especially with > >>FLAG_SYNC and/or > >> FLAG_HSM_READ. > > > >I see little reason for restricting FI[BE]MAP to privileged users - > >anyone should be able to determine if files they have permission to > >access are fragmented. > > Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the > machine. Perhaps for non-privileged users FIEMAP has to be read- > only? As soon as any of the FLAG_* flags come into play you make it > privileged. For example fancy any user being able to fill up your > file system by calling FIEMAP with FLAG_HSM_READ on all files > recursively? By that reasoning, users should not be allowed to recall any files without root privileges. HSMs don't work that way, though - any user is allowed to recall any files they have permission to access either by manual command or by trying to read the file daata. If that runs the filesytem out of space, then the HSM either hasn't been configured properly or it's failed to manage the space correctly. Either way, that's not the fault of the user for recalling their own files. Hence allowing FIEMAP to be executed by the user does not open up any DOS conditions that don't already exist in normal HSM-managed filesystem. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote: > On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote: > > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > > > This is actually for future use. Any flags that are added into this > > > range must be understood by both sides or it should be considered an > > > error. Flags outside the FIEMAP_FLAG_INCOMPAT do not necessarily need > > > to be supported. If it turns out that 8 bits is too small a range for > > > INCOMPAT flags, then we can make 0x0100 an incompat flag that means > > > e.g. 0x00ff are also incompat flags also. > > > > Ah, ok. So it's not really a set of "compatibility" flags, it's more a > > "compulsory" set. Under those terms, i don't really see why this is > > necessary - either the filesystem will understand the flags or it will > > return EINVAL or ignore them... > > > > > I'm assuming that all flags that will be in the original FIEMAP proposal > > > will be understood by the implementations. Most filesystems can safely > > > ignore FLAG_HSM_READ, for example, since they don't support HSM, and for > > > that matter FLAG_SYNC is probably moot for most filesystems also because > > > they do block allocation at preprw time. > > > > Exactly my point - so why do we really need to encode a compulsory set of > > > > Because flags have meaning, independent of whether or not the filesystem > understands them. And if the filesystem chooses to ignore critically > important flags (instead of returning EINVAL), bad things may happen. > > So, either the filesystem will understand the flag or iff the unknown flag > is in the incompat set, it will return EINVAL or else the unknown flag will > be safely ignored. My point was that there is a difference between specification and implementation - if the specification says something is compulsory, then they must be implemented in the filesystem. This is easy enough to ensure by code review - we don't need additional interface complexity for this Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote: > On Apr 19, 2007 11:54 +1000, David Chinner wrote: > > > struct fiemap { > > > __u64 fm_start; /* logical start offset of mapping (in/out) */ > > > __u64 fm_len; /* logical length of mapping (in/out) */ > > > __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ > > > __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ > > > __u64 fm_unused; > > > struct fiemap_extent fm_extents[0]; > > > } > > > > > > /* flags for the fiemap request */ > > > #define FIEMAP_FLAG_SYNC 0x0001 /* flush delalloc data to disk*/ > > > #define FIEMAP_FLAG_HSM_READ 0x0002 /* retrieve data from > > > HSM */ > > > #define FIEMAP_FLAG_INCOMPAT0xff00/* must understand > > > these flags*/ > > > > No flags in the INCOMPAT range - shouldn't it be 0x3 at this point? > > This is actually for future use. Any flags that are added into this range > must be understood by both sides or it should be considered an error. Flags > outside the FIEMAP_FLAG_INCOMPAT do not necessarily need to be supported. > If it turns out that 8 bits is too small a range for INCOMPAT flags, then > we can make 0x0100 an incompat flag that means e.g. 0x00ff are also > incompat flags also. Ah, ok. So it's not really a set of "compatibility" flags, it's more a "compulsory" set. Under those terms, i don't really see why this is necessary - either the filesystem will understand the flags or it will return EINVAL or ignore them... > I'm assuming that all flags that will be in the original FIEMAP proposal > will be understood by the implementations. Most filesystems can safely > ignore FLAG_HSM_READ, for example, since they don't support HSM, and for > that matter FLAG_SYNC is probably moot for most filesystems also because > they do block allocation at preprw time. Exactly my point - so why do we really need to encode a compulsory set of flags in the API? > > SO, there's a HSM_READ flag above. If we are going to make this interface > > useful for filesystems that have HSMs interacting with their extents, the > > HSM needs to be able to query whether the extent is online (on disk), > > has been migrated offline (on tape) or in dual-state (i.e. both online and > > offline). > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't I disagree - why would you want to indicate the state is unknown when we know very well that it is offline? > consider files that are both on disk and on secondary storage (which is > no longer just tape anymore). I thought I'd call this FIEMAP_EXTENT_OFFLINE, > but that has a confusing connotation that the extent is inaccessible, instead > of just saying it is also on offline storage. What about > FIEMAP_EXTENT_SECONDARY? Other proposals welcome. Effectively, when your extent is offline in the HSM, it is inaccessable, and you have to bring it back from tape so it becomes accessible again. i.e. some action is necessary on behalf of the user to make it accessible. So I think that OFFLINE is a good name for this state because it really is inaccessible. Also, I don't think "secondary" is a good term because most large systems have more than one tier of storage. One possibility is "HSM_RESIDENT" which indicates the extent is current and resident with a HSM's archive > FIEMAP_EXTENT_SECONDARY could still be set even if the file is mapped. > That would mean an offline-only file would be EXTENT_SECONDARY|EXTENT_UNKNOWN, > while a dual-location file would be EXTENT_SECONDARY only. I much prefer OFFLINE|HSM_RESIDENT and HSM_RESIDENT as it is far more descriptive as to what the state is (which certainly isn't unknown). > > > SUMMARY OF CHANGES > > > == > > > - add separate fe_flags word with flags from various suggestions: > > > - FIEMAP_EXTENT_HOLE = extent has no space allocation > > > - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data > > > - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown > > > (e.g. HSM, delalloc awaiting sync, etc) > > > > I'd like an explicit delalloc flag, not lumping it in with "unknown". > > we *know* the extent is delalloc ;) > > Sure, FIEMAP_EXTENT_DELALLOC is fine. It is mostly redundant with > EXTENT_UNKNOWN (and I think it only makes sense if DELALLOC is given in > addition to UNKNOWN). I disagree that it is redundant - in case you hadn't already noticed I dislik
Re: [PATCH 0/5] fallocate system call
On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote: > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote: > > > For FA_ALLOCATE, it's supposed to change the file size if we > > allocate past EOF, right? > > I would argue no. Use truncate for that. I'm going from the ext4 implementation because the semantics have not been documented yet. IIRC, the argument for FA_ALLOCATE changing file size is that posix_fallocate() is supposed to change the file size. I think that having a mode for real preallocation and another for posix_fallocate is a valid thing to do... Note that the way XFS implements growing the file size after the allocation is via a truncate > > For FA_DEALLOCATE, does it change the filesize at all? > > Same as above. > > > Or does > > it just punch a hole in the file? > > Yes. That's would what I did because otherwise you'd use ftruncate64(). Without documented behaviour or an ext4 implementation, I have to ask what it's supposed to do, though ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Add preallocation beyond EOF to fallocate
Add new mode to ->fallocate() to allow allocation to occur beyond the current EOF without changing the file size. Implement in XFS ->fallocate() vector. Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_iops.c |8 +--- include/linux/fs.h |1 + 2 files changed, 6 insertions(+), 3 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c 2007-04-30 11:02:16.0 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c 2007-04-30 11:09:48.233375382 +1000 @@ -833,11 +833,13 @@ xfs_vn_fallocate( VNODE_POSITION_XFS); switch (mode) { - case FA_ALLOCATE: /* changes file size */ - error = xfs_change_file_space(bdp, XFS_IOC_RESVSP, - &bf, 0, NULL, 0); + case FA_ALLOCATE:/* changes file size */ if (offset + len > i_size_read(inode)) do_setattr = offset + len; + /* FALL THROUGH */ + case FA_PREALLOCATE:/* no filesize change */ + error = xfs_change_file_space(bdp, XFS_IOC_RESVSP, + &bf, 0, NULL, 0); break; case FA_DEALLOCATE: /* XXX: changes file size? this just punches a hole */ Index: 2.6.x-xfs-new/include/linux/fs.h === --- 2.6.x-xfs-new.orig/include/linux/fs.h 2007-04-27 18:48:01.0 +1000 +++ 2.6.x-xfs-new/include/linux/fs.h2007-04-30 11:08:05.790903661 +1000 @@ -269,6 +269,7 @@ extern int dir_notify_enable; */ #define FA_ALLOCATE0x1 #define FA_DEALLOCATE 0x2 +#define FA_PREALLOCATE 0x3 #ifdef __KERNEL__ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] XFS ->fallocate() support
Add XFS support for ->fallocate() vector. Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_iops.c | 48 1 file changed, 48 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c 2007-02-07 13:24:32.0 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c 2007-04-30 11:02:16.225095992 +1000 @@ -812,6 +812,53 @@ xfs_vn_removexattr( return namesp->attr_remove(vp, attr, xflags); } +STATIC long +xfs_vn_fallocate( + struct inode*inode, + int mode, + loff_t offset, + loff_t len) +{ + longerror = -EOPNOTSUPP; + bhv_vnode_t *vp = vn_from_inode(inode); + bhv_desc_t *bdp; + int do_setattr = 0; + xfs_flock64_t bf; + + bf.l_whence = 0; + bf.l_start = offset; + bf.l_len = len; + + bdp = bhv_lookup_range(VN_BHV_HEAD(vp), VNODE_POSITION_XFS, + VNODE_POSITION_XFS); + + switch (mode) { + case FA_ALLOCATE: /* changes file size */ + error = xfs_change_file_space(bdp, XFS_IOC_RESVSP, + &bf, 0, NULL, 0); + if (offset + len > i_size_read(inode)) + do_setattr = offset + len; + break; + case FA_DEALLOCATE: + /* XXX: changes file size? this just punches a hole */ + error = xfs_change_file_space(bdp, XFS_IOC_UNRESVSP, + &bf, 0, NULL, 0); + break; + default: + break; + } + + /* Change file size if needed */ + if (!error && do_setattr) { + bhv_vattr_t va; + + va.va_mask = XFS_AT_SIZE; + va.va_size = do_setattr; + error = bhv_vop_setattr(vp, &va, 0, NULL); + } + + return error; +} struct inode_operations xfs_inode_operations = { .permission = xfs_vn_permission, @@ -822,6 +869,7 @@ struct inode_operations xfs_inode_operat .getxattr = xfs_vn_getxattr, .listxattr = xfs_vn_listxattr, .removexattr= xfs_vn_removexattr, + .fallocate = xfs_vn_fallocate, }; struct inode_operations xfs_dir_inode_operations = { - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ia64 fallocate syscall
ia64 fallocate syscall support. Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]> --- arch/ia64/kernel/entry.S |1 + include/asm-ia64/unistd.h |3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) Index: 2.6.x-xfs-new/arch/ia64/kernel/entry.S === --- 2.6.x-xfs-new.orig/arch/ia64/kernel/entry.S 2007-03-29 19:01:41.0 +1000 +++ 2.6.x-xfs-new/arch/ia64/kernel/entry.S 2007-04-27 19:12:56.829396661 +1000 @@ -1612,5 +1612,6 @@ sys_call_table: data8 sys_vmsplice data8 sys_ni_syscall// reserved for move_pages data8 sys_getcpu + data8 sys_fallocate .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: 2.6.x-xfs-new/include/asm-ia64/unistd.h === --- 2.6.x-xfs-new.orig/include/asm-ia64/unistd.h2007-03-29 19:03:37.0 +1000 +++ 2.6.x-xfs-new/include/asm-ia64/unistd.h 2007-04-27 19:18:18.215568425 +1000 @@ -293,11 +293,12 @@ #define __NR_vmsplice 1302 /* 1303 reserved for move_pages */ #define __NR_getcpu1304 +#define __NR_fallocate 1305 #ifdef __KERNEL__ -#define NR_syscalls281 /* length of syscall table */ +#define NR_syscalls282 /* length of syscall table */ #define __ARCH_WANT_SYS_RT_SIGACTION - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] fallocate system call
On Thu, Apr 26, 2007 at 11:20:56PM +0530, Amit K. Arora wrote: > Based on the discussion, this new patchset uses following as the > interface for fallocate() system call: > > asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Ok, so now for the hard questions - what are the semantics of FA_ALLOCATE and FA_DEALLOCATE? For FA_ALLOCATE, it's supposed to change the file size if we allocate past EOF, right? What's the return value supposed to be? Zero for success, error otherwise? Does this update a/m/ctime at all? How persistent is this preallocation? Should it be there "forever" or for the lifetime of the currently open fd that it was preallocated on? For FA_DEALLOCATE, does it change the filesize at all? Or does it just punch a hole in the file? If it does change file size, what happens when you punch out preallocation beyond EOF? What's the return value supposed to be? > Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for > preallocation and deallocation of preallocated blocks respectively. More > modes can be added, when required. FWIW, we definitely need a FA_PREALLOCATE mode (FA_ALLOCATE but does not change file size) so we can preallocate beyond EOF for apps which use O_APPEND (i.e. changing file size would cause problems for them). > ToDos: > = > 1> Implementation on other architectures (other than i386, x86_64, > ppc64 and s390(x)) I'll have ia64 soon. > 2> A generic file system operation to handle fallocate > (generic_fallocate), for filesystems that do _not_ have the fallocate > inode operation implemented. > 3> Changes to glibc, > a) to support fallocate() system call > b) so that posix_fallocate() and posix_fallocate64() call > fallocate() system call > 4> Changes to XFS to implement the fallocate inode operation And that's what I'm doing now, hence all the questions ;) BTW, do you have a test program for this, or will I need to write one myself? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Wed, Apr 18, 2007 at 06:21:39PM -0600, Andreas Dilger wrote: > On Apr 16, 2007 21:22 +1000, David Chinner wrote: > > On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: > > > struct fiemap_extent { > > > __u64 fe_start; /* starting offset in bytes */ > > > __u64 fe_len; /* length in bytes */ > > > } > > > > > > struct fiemap { > > > struct fiemap_extent fm_start; /* offset, length of desired mapping */ > > > __u32 fm_extent_count; /* number of extents in array */ > > > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > > > __u64 unused; > > > struct fiemap_extent fm_extents[0]; > > > } > > > > > > #define FIEMAP_LEN_MASK 0xff > > > #define FIEMAP_LEN_HOLE 0x01 > > > #define FIEMAP_LEN_UNWRITTEN 0x02 > > > > I'm not sure I like stealing bits from the length to use a flags - > > I'd prefer an explicit field per fiemap_extent for this. > > Christoph expressed the same concern. I'm not dead set against having an > extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may > mean the need for 50% more ioctls if the file is large. I don't think this overhead is a huge problem - just pass in a larger buffer (e.g. xfs_bmap can ask for thousands of extents in a single ioctl call as we can extract the number of extents in an inode via XFS_IOC_FSGETXATTRA). > Below is an aggregation of the comments in this thread: > > struct fiemap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > __u32 fe_lun; /* logical storage device number in array */ > } Oh, I missed the bit about the fe_lun - I was thinking something like that might be useful in future > struct fiemap { > __u64 fm_start; /* logical start offset of mapping (in/out) */ > __u64 fm_len; /* logical length of mapping (in/out) */ > __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ > __u32 fm_extent_count; /* number of extents in fm_extents (in/out) */ > __u64 fm_unused; > struct fiemap_extent fm_extents[0]; > } > > /* flags for the fiemap request */ > #define FIEMAP_FLAG_SYNC 0x0001 /* flush delalloc data to disk*/ > #define FIEMAP_FLAG_HSM_READ 0x0002 /* retrieve data from HSM */ > #define FIEMAP_FLAG_INCOMPAT0xff00/* must understand these flags*/ No flags in the INCOMPAT range - shouldn't it be 0x3 at this point? > /* flags for the returned extents */ > #define FIEMAP_EXTENT_HOLE0x0001 /* no space allocated */ > #define FIEMAP_EXTENT_UNWRITTEN 0x0002 /* uninitialized space > */ > #define FIEMAP_EXTENT_UNKNOWN 0x0004 /* in use, location unknown */ > #define FIEMAP_EXTENT_ERROR 0x0008 /* error mapping space */ > #define FIEMAP_EXTENT_NO_DIRECT 0x0010 /* no direct data > access */ SO, there's a HSM_READ flag above. If we are going to make this interface useful for filesystems that have HSMs interacting with their extents, the HSM needs to be able to query whether the extent is online (on disk), has been migrated offline (on tape) or in dual-state (i.e. both online and offline). > SUMMARY OF CHANGES > == > - use fm_* fields directly in request instead of making it a fiemap_extent > (though they are layed out identically) > > - separate flags word for fm_flags: > - FIEMAP_FLAG_SYNC = range should be synced to disk before returning > mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise > - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified > (this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag) > - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future > if there is agreement on whether that is desirable to have or if it is > better to call ioctl(FIEMAP) on an XATTR fd. > - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel > must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we > don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it > > - __u64 fm_unused does not take up an extra space on all power-of-two buffer > sizes (would otherwise be at end of buffer), and may be handy in the future. > > - add separate fe_flags word with flags from various suggestions: > - FIEMAP_EXTENT_HOLE = extent has no space alloca
Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote: > I'm interested in getting input for implementing an ioctl to efficiently > map file extents & holes (FIEMAP) instead of looping over FIBMAP a billion > times. We already have customers with single files in the 10TB range and > we additionally need to get the mapping over the network so it needs to > be efficient in terms of how data is passed, and how easily it can be > extracted from the filesystem. > > I had come up with a plan independently and was also steered toward > XFS_IOC_GETBMAP* ioctls which are in fact very similar to my original > plan, though I think the XFS structs used there are a bit bloated. Yeah, they were designed with having a long term stable ABI that limited expandability. Hence the "future" fields that never got used ;) > There was also recent discussion about SEEK_HOLE and SEEK_DATA as > implemented by Sun, but even if we could skip the holes we still might > need to do millions of FIBMAPs to see how large files are allocated > on disk. Conversely, having filesystems implement an efficient FIBMAP > ioctl (or ->fiemap() method) could in turn be leveraged for SEEK_HOLE > and SEEK_DATA instead of doing looping over ->bmap() inside the kernel > as I saw one patch. Yup. > struct fibmap_extent { > __u64 fe_start; /* starting offset in bytes */ > __u64 fe_len; /* length in bytes */ > } > > struct fibmap { > struct fibmap_extent fm_start; /* offset, length of desired mapping */ > __u32 fm_extent_count; /* number of extents in array */ > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */ > __u64 unused; > struct fibmap_extent fm_extents[0]; > } > > #define FIEMAP_LEN_MASK 0xff > #define FIEMAP_LEN_HOLE 0x01 > #define FIEMAP_LEN_UNWRITTEN 0x02 I'm not sure I like stealing bits from the length to use a flags - I'd prefer an explicit field per fibmap_extent for this. Given that xfs_bmap uses extra information from the filesystem (geometry) to display extra (and frequently used) information about the alignment of extents. ie: chook 681% xfs_bmap -vv fred fred: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..151]:288444888..288445039 8 (1696536..1696687) 152 00010 FLAG Values: 01 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 10 Doesn't begin on stripe width 01 Doesn't end on stripe width This information could be easily passed up in the flags fields if the filesystem has geometry information (there go 4 more flags ;). Also - what are the explicit sync semantics of this ioctl? The XFS ioctl causes a fsync of the file first to convert delalloc extents to real extents before returning the bmap. Is this functionality going to be the same? If not, then we need a DELALLOC flag to indicate extents that haven't been allocated yet. This might be handy to have, anyway > All offsets are in bytes to allow cases where filesystems are not going > block-aligned/sized allocations (e.g. tail packing). So it'll be ok for a few years yet ;) > The fm_extents array > returned contains the packed list of allocation extents for the file, > including entries for holes (which have fe_start == 0, and a flag). Internalling in XFS, we pass these around as: #define DELAYSTARTBLOCK ((xfs_fsblock_t)-1LL) #define HOLESTARTBLOCK ((xfs_fsblock_t)-2LL) And the offset passed out through XFS_IOC_GETBMAP[X] is a block number of -1 for the start of a hole. Hence we don't need a flag for this. We can expose delalloc extents like this as well without needing flags... > The ->fm_extents[] array includes all of the holes in addition to > allocated extents because this avoids the need to return both the logical > and physical address for every extent and does not make processing any > harder. Doesn't really make it any easier to map to disk, either. > One feature that XFS_IOC_GETBMAPX has that may be desirable is the > ability to return unwritten extent information. You got that with the unwritten flag above. > required expanding the per-extent struct from 32 to 48 bytes per extent, not sure I follow your maths here? > but I'd rather limit a single extent to e.g. 2^56 bytes (oh, what hardship) > and keep 8 bytes or so for input/output flags per extent (would need to ^ bits? > be masked before use). > > > Caller works something like: > > char buf[4096]; > struct fibmap *fm = (struct fibmap *)buf; > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent); > > fm->fm_extent.fe_start = 0; /* start of file */ > fm->fm_extent.fe_len = -1; /* end of file */ > fm->fm_extent_count = count; /* max extents in fm_extents[] array */ > fm->fm_flags = 0;
Re: [RFC] Heads up on sys_fallocate()
On Tue, Mar 06, 2007 at 10:46:56AM -0600, Eric Sandeen wrote: > Ulrich Drepper wrote: > > Christoph Hellwig wrote: > >> fallocate with the whence argument and flags is already quite complicated, > >> I'd rather have another call for placement decisions, that would > >> be called on an fd to do placement decissions for any further allocations > >> (prealloc, write, etc) > > > > Yes, posix_fallocate shouldn't be made more complicated. But I don't > > understand why requesting linear layout of the blocks should be an > > option. It's always an advantage if the blocks requested this way are > > linear on disk. So, the kernel should always do its best to make this > > happen, without needing an additional option. > > > > Agreed on both points. The hints would be for things like start block, > or speculative EOF preallocation, not contiguity, which I think should > always be the goal. ISTR having had this discussion before ;) About guided preallocation for defrag: http://marc.info/?t=11624785951&r=1&w=2 e.g.: The sorts of policies we need for effective use of preallocation: http://marc.info/?l=linux-fsdevel&m=116184475308164&w=2 http://marc.info/?l=linux-fsdevel&m=116278169519095&w=2 Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] delayed allocation for ext4
On Sat, Dec 23, 2006 at 10:09:57PM +0300, Alex Tomas wrote: > > Good day, > > >>>>> David Chinner (DC) writes: > > DC> So that mean's we'll have 2 separate mechanisms for marking > DC> pages as delalloc. XFS uses the BH_delay flag to indicate > DC> that a buffer (block) attached to the page is using delalloc. > > well, for blocksize=pagesize we can save 56 bytes on every page. Sure, but it means that ext4 w/ delalloc won't work on lots of machines > DC> FWIW, how does this mechanism deal with block size < page size? > DC> Don't you have to track delalloc on a block basis rather than > DC> a page basis? > > I'm still thinking how better to deal with that w/o much code duplication. Code duplication in ext4, or across all filesystems? > DC> Ah, that's why you can get away with a page flag - you've ignored > DC> the partial page delay state problem. Any plans to use the > DC> existing method in the future so we will be able to use ext4 delalloc > DC> on machines with a page size larger than 4k? > > what do you mean by "exsiting"? BH_delay? Yes. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] delayed allocation for ext4
On Sat, Dec 23, 2006 at 09:27:18AM +, Christoph Hellwig wrote: > On Sat, Dec 23, 2006 at 02:31:23PM +1100, David Chinner wrote: > > > - ext4-delayed-allocation.patch > > >delayed allocation itself, enabled by "delalloc" mount option. > > >extents support is also required. currently it works only > > >with blocksize=pagesize. > > > > Ah, that's why you can get away with a page flag - you've ignored > > the partial page delay state problem. Any plans to use the > > existing method in the future so we will be able to use ext4 delalloc > > on machines with a page size larger than 4k? > > I think fixing this up for blocksize < pagesize is an absolute requirement > to get things merged. We don't need more filesystems that are crippled > on half of our platforms. > > Note that recording delayed alloc state at a page granularity in addition > to just the buffer heads has a lot of advantages aswell and would help > xfs, too. But I think it makes a lot more sense to record it as a radix > tree tag to speed up the gang lookups for delalloc conversion. I'm not sure it will make that much difference, really. Looking up by delalloc tag is only going to save a few tail pages in pagevec we use for the look up and could be more expensive if delalloc pages are sparsely distributed through the file. We'd still have to keep the bufferheads around for partial page state, and that becomes an interesting exercise in keeping things coherent between the radix tree and the buffer heads. Of course, then there's the unwritten state that XFS also carries around per block (bufferhead) which has all the same issues as the delalloc state. I'd hate to have a generic method for handling delalloc state which is different from the handling of the unwritten state and needing two different sets of code to handle what is essentially the same thing Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] delayed allocation for ext4
On Fri, Dec 22, 2006 at 11:20:08PM +0300, Alex Tomas wrote: > > Good day, > > probably the previous set of patches (including mballoc/lg) > is too large. so, I reworked delayed allocation a bit so > that it can be used on top of regular balloc, though it > still can be used with extents-enabled files only. > > this time series contains just 3 patches: > > - booked-page-flag.patch >adds PG_booked bit to page->flags. it's used in delayed >allocation to mark space is already reserved for page >(including possible metadata) So that mean's we'll have 2 separate mechanisms for marking pages as delalloc. XFS uses the BH_delay flag to indicate that a buffer (block) attached to the page is using delalloc. FWIW, how does this mechanism deal with block size < page size? Don't you have to track delalloc on a block basis rather than a page basis? > - ext4-block-reservation.patch >this is scalable free space management. every time we >delay allocation of some page, a space (including metadata) >should be reserved > > - ext4-delayed-allocation.patch >delayed allocation itself, enabled by "delalloc" mount option. >extents support is also required. currently it works only >with blocksize=pagesize. Ah, that's why you can get away with a page flag - you've ignored the partial page delay state problem. Any plans to use the existing method in the future so we will be able to use ext4 delalloc on machines with a page size larger than 4k? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Wed, Dec 06, 2006 at 09:35:30PM -0500, Josef Sipek wrote: > On Thu, Dec 07, 2006 at 12:44:27PM +1100, David Chinner wrote: > > Maybe we should be using EAs for this sort of thing instead of flags > > on the inode? If we keep adding inode flags for generic features > > then we are going to force more than just XFS into inode format > > changes eventually > > Aren't EAs slow? Maybe not on XFS but on other filesystems... Only when they don't fit in the inode itself and extra disk seeks are needed to retrieve them. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Wed, Dec 06, 2006 at 07:56:19PM -0500, Josef Sipek wrote: > On Wed, Dec 06, 2006 at 08:11:00PM +1100, David Chinner wrote: > > They are defined but unused in 2.6.19, right? I can't see anywhere > > in the 2.6.19 ext2/3/4/reiser trees that actually those flags, > > including setting and retrieving them from disk. JFS i can see > > sets, clears and retreives them, but not the fielsystems you > > mention. Though I might just be blind. ;) > > > > If all we need to add to XFS is support for those flags, then XFS > > support would be trivial to add. > > > > Oh, damn. I take that back. We're almost out of flag space in the on > > disk inode - these two flags would use the last 2 flag bits so this > > may require an on disk inode format change in XFS. This will be > > a little more complex than I first thought, but not impossible > > as we already support two on-disk inode format versions. > > Hrm. I was toying around with the idea of using a flag to mark inodes as > whiteouts (similar to what BSD does) for Unionfs. I remember that Jan Blunck > tried similar thing in his implementation of VFS unionfs mounts. > > I am not entirely convinced that whiteout inode flag is the right way to do > things, but I'm just raising this now as I wouldn't want to wait for new > ondisk format for XFS to say that Unionfs supports XFS. (Assuming that it is > the right approach.) ;-) Maybe we should be using EAs for this sort of thing instead of flags on the inode? If we keep adding inode flags for generic features then we are going to force more than just XFS into inode format changes eventually Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Tue, Dec 05, 2006 at 11:41:28AM -0500, Nikolai Joukov wrote: > > > As we promised on the linux-ext4 list on October 31, here is the patch > > > that adds secure deletion via a trash-bin functionality for ext4. It is a > > > compromise solution that combines secure deletion with the trash-bin > > > support > > > (the latter had been requested by even more people than the former :-). > > > > Given that almost all of the code for this uses vfs interfaces and > > only a couple of simple filesystem hooks, is there any reason for > > this being ext4 specific? i.e. why aren't you hooking the vfs layer > > so we get a single undelete/secure delete implementation for all > > filesystems? > > You are right. Actually, we mentioned it as a benefit number 4 of this > approach in the original post. Most of the code is not > file-system--specific and can be used by any other (all other?) file > system(s). The only complication is that only ext2/3/4 and reiser file > systems already support the per-file secure deletion and undelete > attributes. They are defined but unused in 2.6.19, right? I can't see anywhere in the 2.6.19 ext2/3/4/reiser trees that actually those flags, including setting and retrieving them from disk. JFS i can see sets, clears and retreives them, but not the fielsystems you mention. Though I might just be blind. ;) If all we need to add to XFS is support for those flags, then XFS support would be trivial to add. Oh, damn. I take that back. We're almost out of flag space in the on disk inode - these two flags would use the last 2 flag bits so this may require an on disk inode format change in XFS. This will be a little more complex than I first thought, but not impossible as we already support two on-disk inode format versions. > Since ext4 is in early development now, we believe it'd be easier to get > such code into ext4 than into the main-line VFS. If there's enough > interested among the kernel maintainers, we'd be happy to move this > functionality to the VFS and provide f/s hooks for > secure-deletion/trash-bin. I'd certainly like to see new functionality like this added at the VFS rather than in any particular filesystem. Don't know about anyone else, though > I guess, the right thing to do would be to move the common trash-bin > (tb.c and tb.h) code into the /fs and /include/linux directories. And call them trashbin.[ch] ;) > Also, VFS would require just a couple of trivial changes to support > something like '-o trashbin' mount-time option for all file systems. > In addition, individual file systems may support per-file attributes for > this (e.g., ext2/3/4). Sounds like a good idea to me. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Secure Deletion and Trash-Bin Support for Ext4
On Mon, Dec 04, 2006 at 01:33:55PM -0500, Nikolai Joukov wrote: > As we promised on the linux-ext4 list on October 31, here is the patch > that adds secure deletion via a trash-bin functionality for ext4. It is a > compromise solution that combines secure deletion with the trash-bin support > (the latter had been requested by even more people than the former :-). Given that almost all of the code for this uses vfs interfaces and only a couple of simple filesystem hooks, is there any reason for this being ext4 specific? i.e. why aren't you hooking the vfs layer so we get a single undelete/secure delete implementation for all filesystems? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Defragmentation interface
On Mon, Nov 06, 2006 at 06:44:58PM +0100, Jan Kara wrote: > > On Fri, Nov 03, 2006 at 03:30:30PM +0100, Jan Kara wrote: > > > > BTW, does use of sysfs mean ASCII encoding of all the data > > > > passing between kernel and userspace? > > > Not necessarify but mostly yes. At least I intend to have all the > > > files I have proposed in ASCII. > > > > Ok - that's how you're looking to avoid 32/64bit compatibility issues? > Yes. > > > It will make the interface quite verbose, though, and entail significant > > encoding and decoding costs > It would be verbose. On the other hand for most things it should not > matter (not too much data goes through the interface and it's not too > performance critical). Except when you have a filesystem with fragmented free space. That could be a lot of information (think searching through terabytes of fragmetned free space before finding a contiguous block big enough). > > Right. More complicated requests are something that we need to > > support in XFS in the short-medium term. We _need_ an interface to > > XFS that allows complex, compound allocation policies to be > > accessible from userspace - and this is not just for defrag > > programs. > > > > I think a set of well defined allocation primitives suits a syscall > > interface far better than a per-filesystem sysfs interface. > I'm only afraid of one thing: Once you define a syscall it's hard to > change anything and for this kind of thing I'm not sure we are able to > tell what we'll need in two years... That is basically my main > concern with implementing this interface as a syscall. True, but there's only so many ways you can ask for free space to be found or allocated. And there's a version number in the policy structure so that interface is extensible. Also, for the limited scope of the interfaces I don't see much change being needed over time but the change tht is needed can be handled by the strucutre version. For the sysfs interface, it needs to be very flexible and extensible because of the amount of filesystem specific information it can expose and would need to expose to be usable by multiple filesystems in a generic manner... Hence I think different criteria apply here - the syscalls implement single mechanisms, whereas the sysfs interface allows deeper, more intrusive delving into filesystem internals while at the same time providing mechanisms for modification of the filesystem. If we are going to provide simple mechanisms to do certain operations, then I'd prefer to see it done as specific, targeted syscalls rather than buried within a multi-purpose, cross-filesystem sysfs interface. > > > > - every time you fail an allocation, you need to reread this file. > > > Yes, that's the most serious disadvantage I see. Do you see any way > > > out of it in any interface? > > > > I haven't really thought about solutions for this interface - the > > syscall interface doesn't have this problem because of the way you > > can specify where you want free blocks from > But that does not solve the problem with having to repeat the search, > does it? Only with the syscall interface filesystem can possibly search > for free blocks more efficiently.. Right - the repeat search is not a problem because the overhead is far lower with the syscall interface. > > > > - stripe unit and stripe width need to be exposed so defrag too > > > > can make correct placement decisions. > > > fs-specific thing... > > > > As Andreas said, this isn't fs-specific. XFS takes sunit and swidth > > as mkfs parameters so it can align both metadata and data optimally > > for RAID devices. Other fileystems have different methods of > > specifying this (ext2/3/4 use -E stride-size for this), but it would > > need to be exposed in some way > I see. But then shouldn't we expose it regardless the interface > (sysfs/syscall) we choose so that userspace can take it into account > when picking where to allocate? Yes. In terms of the syscall interface, it is simple to do without having to tell the application about alignment #define POLICY_ALLOC_ALIGN_SUNIT #define POLICY_ALLOC_ALIGN_SWIDTH Now the kernel only returns blocks that are correctly aligned. If the fileystem can't find any aligned blocks, set the fallback policy to do the same search but without the alignment restriction. So, the interface doesn't need to expose the actual values, just a method to support aligned allocations. Once again, leverage the smarts the existing fiesystem allocator already has rather than requiring alignment calculations to be done by the applicaiton. Come to think of it, it would probably be better to do aligned allocation by default and to have flags to turn off alignment. Either way, the application doesn't need to know what the alignment restrictions are with the syscall interface - it's just another policy decision > > > > > meta/nodes/ > > > > > - this should be a directory containing things specific for
Re: [RFC] Defragmentation interface
On Fri, Nov 03, 2006 at 03:30:30PM +0100, Jan Kara wrote: > > > So in this email I try to propose some interface which should hopefully > > > address most of the concerns. The type of the interface is sysfs like > > > (idea taken from ext2meta) - that has a few advantages: > > > - no 32/64-bit compatibility issues > > > - easily extensible > > > - generally nice ;) > > > > - complex > > - over-engineered > > - little common code between filesystems > The first two may be but actually I don't think you'll have too much > common code among fs anyway whatever interface you choose. > > > BTW, does use of sysfs mean ASCII encoding of all the data > > passing between kernel and userspace? > Not necessarify but mostly yes. At least I intend to have all the > files I have proposed in ASCII. Ok - that's how you're looking to avoid 32/64bit compatibility issues? It will make the interface quite verbose, though, and entail significant encoding and decoding costs > > > Each filesystem willing to support this interface implements special > > > filesystem (e.g. ext3meta, XFSmeta, ...) and admin/defrag-tool mounts it > > > to some directory. > > > > - not useful for wider audiences like applications that would like > > to direct allocation > Why not? A simple tool could stat file, get ino, put some number in > alloc_goal... - Root permissions. - multiple files need to be opened, read, written, closed - high overhead of searching for free blocks in the area you want - difficult to control alloc_goal with multi-threaded programs - potential for each filesystem to have a different meta structures > > > There are parts of this interface which should be > > > common for all filesystems (so that tools don't have to care about > > > particular filesystem and still get some useful results), other parts > > > are fs-specific. Here is basic structure I propose: > > > > > > meta/features > > > - bitmap of features supported by the interface (ext2/3-like) so that > > > the tool can verify whether it understands the interface and don't > > > mess with it otherwise > > > > - grow very large, very quickly if it has to support all the > > different quirks of different filesystems. > Yes, that may be a problem... > > > > meta/allocation/free_blocks > > > - RO file - if you read from fpos F, you'll get a list of extents > > > describing areas with free blocks (as many as fits into supplied > > > buffer) starting from block F. Fpos of your file descriptor is > > > shifted to the first unreported free block. > > > > - linear search properties == Bad. (think fs sizes of hundreds of > > terabytes - XFS is already deployed with filesystems of this size) > OK, so what do you propose? You want syscall find_free_blocks() and > my idea of it was that it will do basically the same think as my > interface. Using the above interface I guess you'd have to seek and read until you found records with block numbers near to what you'd require. It is effectively: find_free_blocks(fd, policy, &list, nblocks) struct policy { __u64 version; __u64 blkno; __u64 len; __u64 group; __u64 policy; __u64 fallback_policy; } #define ALLOC_POLICY_EXACT_LEN (1<<0)ULL #define ALLOC_POLICY_EXACT_BLOCK(1<<1)ULL #define ALLOC_POLICY_EXACT_GROUP(1<<2)ULL #define ALLOC_POLICY_MIN_LEN(1<<3)ULL #define ALLOC_POLICY_NEAR_BLOCK (1<<4)ULL #define ALLOC_POLICY_NEAR_GROUP (1<<5)ULL #define ALLOC_POLICY_NEXT_BLOCK (1<<6)ULL #define ALLOC_POLICY_NEXT_GROUP (1<<7)ULL The sysfs interface you propose is effectively: memset(&policy, 0, sizeof(policy)); policy.policy = ALLOC_POLICY_NEXT_BLOCK; do { find_free_blocks(fd, &policy, &list, nblocks); /* process free block list */ . /* get next blocks */ policy.blkno = list[nblocks - 1].blkno } while (policy.blkno != EOF); However, this can be optimised for a given search where the location is known beforehand to: memset(&policy, 0, sizeof(policy)); policy.policy = ALLOC_POLICY_NEAR_BLOCK; policy.blkno = X; find_free_blocks(fd, &policy, &list, nblocks); If you then chose to allocate from this list and it fails, you simply redo the above. With the sysfs interface, if you want to find a single contiguous run of blocks, you'd probably just have to read the entire file and search it for the pattern of blocks you want. With XFS, we already have this information indexed in btrees, so we don't want to have to read the entire btree just to find something we could with a single btree lookup. i.e: memset(&policy, 0, sizeof(policy)); policy.policy = ALLOC_POLICY_EXACT_LEN; policy.len = X; find_free_blocks(fd, &policy, &list, nblocks); Or indeed, something close to the block we
Re: [RFC] Defragmentation interface
On Thu, Nov 02, 2006 at 03:39:29PM +0100, Jan Kara wrote: > Hi, > > from the thread after my patch implementing ext3 online > defragmentation I found out that probably the only (and definitely the > biggest) issue is the interface. Someone wants is common enough so that > we can profit from common tools for several filesystems, others object > that some applications, e.g. defragmenter, need to know something about > ext3 internals to work reasonably well. Moreover ioctl() is ugly and has > some compatibility issues, on the other hand ext2meta is too lowlevel, > fs-specific and it would be hard to do any reasonable application > crash-safe... > So in this email I try to propose some interface which should hopefully > address most of the concerns. The type of the interface is sysfs like > (idea taken from ext2meta) - that has a few advantages: > - no 32/64-bit compatibility issues > - easily extensible > - generally nice ;) - complex - over-engineered - little common code between filesystems BTW, does use of sysfs mean ASCII encoding of all the data passing between kernel and userspace? > Each filesystem willing to support this interface implements special > filesystem (e.g. ext3meta, XFSmeta, ...) and admin/defrag-tool mounts it > to some directory. - not useful for wider audiences like applications that would like to direct allocation > There are parts of this interface which should be > common for all filesystems (so that tools don't have to care about > particular filesystem and still get some useful results), other parts > are fs-specific. Here is basic structure I propose: > > meta/features > - bitmap of features supported by the interface (ext2/3-like) so that > the tool can verify whether it understands the interface and don't > mess with it otherwise - grow very large, very quickly if it has to support all the different quirks of different filesystems. > meta/allocation/free_blocks > - RO file - if you read from fpos F, you'll get a list of extents > describing areas with free blocks (as many as fits into supplied > buffer) starting from block F. Fpos of your file descriptor is > shifted to the first unreported free block. - linear search properties == Bad. (think fs sizes of hundreds of terabytes - XFS is already deployed with filesystems of this size) - cannot use smart requests like given me free blocks near X, in AG Y or Z, etc. - some filesystems have more than one data area - e.g. XFS has the realtime volume. - every time you fail an allocation, you need to reread this file. > meta/super/blocksize > - filesystem block size fcntl(FIGETBSZ). Also: - some filesystems can use different block sizes for different structures (e.g XFs directory blocks canbe larger than the fsb) - stripe unit and stripe width need to be exposed so defrag too can make correct placement decisions. - extent size hints, etc. Hence this will require the spuer/ directory to be extensible in a filesystem specific interface. > meta/super/id > - filesystem ID (for paranoid tools to verify that they are accessing > really the right meta-filesystem) - UUID, please. > meta/nodes/ > - this should be a directory containing things specific for a fs-object > with identification . In case of ext3 these would be inode > numbers, I guess this should be plausible also for XFS and others > but I'm open to suggestions... > - directory contains the following: > alloc_goal > - block number with current allocation goal The kernel has to store this across syscalls until you write into data/alloc? That sounds dangerous... > data/extents > - if you read from this file, you get a list of extents describing > data blocks (and holes) of the file. The listing starts at logical > block fpos. Fpos is shifted to the first unreported data block. fcntl(FIBMAP) > data/alloc > - you write there a number L and fs allocates L blocks to a file > (preferable from alloc_goal) starting from file-block fpos. Fpos > is shifted after the last block allocated in this call. You seek to the position you want (in blocks or bytes?), then write a number into the file (in blocks or bytes)? That's messy compared to a function call with an offset and length in it > data/reloc > - you write there and relocation of data happens as follows: > All blocks that are allocated both in original file and > are relocated to . Write returns number of relocated > blocks. You can only relocate to a new inode (which in XFS will change the inode number)? What happens if there are blocks in duplicate offsets in both inodes? What happens if all the blocks aren't relocated - how do you handle this? Let me get this straight - the interface you propose for moving data about is: read and process extents into an internal structure find range where you want to relocate find free space you want to relocate
Re: [RFC] Ext3 online defrag
On Thu, Oct 26, 2006 at 01:37:22PM +0200, Jan Kara wrote: > > On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote: > > We don't need to expose anything filesystem specific to userspace to > > implement this. Online data movement (i.e. the defrag mechanism) > > becomes something like: > > > > do { > > get_free_list(dst_fd, location, len, list) > > /* select extent to use */ > Upto this point I can imagine we can be perfectly generic. > > > alloc_from_list(dst_fd, list[X], off, len) > > } while (ENOALLOC) > > move_data(src_fd, dst_fd, off, len); > With these two it's not clear how well can we do with just a generic > interface. Every filesystem needs to have some additional metadata to > keep list of data blocks. In case of ext2/ext3/reiserfs this is not > a negligible amount of space and placement of these metadata is important > for performance. Yes, the same can be said for XFS. However, XFS's extent btree implementation uses readahead to hide a lot of the latency involved with reading extent map, and it only needs to read it once per inode lifecycle > So either we focus only on data blocks and let > implementation of alloc_from_list() allocate metadata wherever it wants > (but then we get suboptimal performace because there need not be space > for indirect blocks close before our provided extent) I think the first step would be to focus on data blocks using something like the above. There are many steps to full filesystem defragmentation, but data fragmetnation is typically the most common symptom of fragmentation that we see. > or we allocate > metadata from the provided list, but then we need some knowledge of fs > to know how much should we expect to spend on metadata and where these > metadata should be placed. That's the second step, I think. For example, we could count the metadata blocks used in metadata structure (say an block list), allocate a new chunk like above, and then execute a "move_metadata()" type of operation, which the filesystem does internally in a transactionally safe manner. Once again, generic interface, filesystem specific implementations. > For example if you know that indirect block > for your interval is at block B, then you'd like to allocate somewhere > close after this point or to relocate that indirect block (and all the > data it references to). But for that you need to know you have something > like indirect blocks => filesystem knowledge. *nod* This is far less of a problem with extent based filesystems - coalescing all the fragments into a single extent removes the need for indirect blocks and you get the extent list for free when you read the inode. When we do have a fragmented file, XFS uses readahead to speed btree searching and reading, so it hides a lot of the latency overhead that fragmented metadata can cause. Either way, these lists can still be optimised by allocating a set of contiguous blocks and copying the metadata into them and updating the pointers to the new blocks. It can be done separately to the data moving and really should be done after the data has been defragmented > So I think that to get this working, we also need some way to tell > the program that if it wants to allocate some data, it also needs to > count with this amount of metadata and some of it is already allocated > in given blocks... If you want to do it all in one step. However, it's not quite that simple for something like XFS. An allocation may require a btree split (or three, actually) and the number of blocks required is dependent on the height of the btrees. So we don't know how many blocks we'll need ahead of time, and we'd have to reach deep into the allocator and abuse it badly to do anything like this. It's not something I want to even contemplate doing. :/ Also, we don't want to be mingling global metadata with inode specific metadata so we don't want to put most of the new metadata blocks near the extent we are putting the data into. That means I'd prefer to be able to optimise metadata objects separately. e.g. rewrite a btree into a single contiguous extent with the btree blocks laid out so the readahead patterns result in sequential I/O. The kernel would need to do this in XFS because we'd have to lock the entire btree a block at a time, copy it and then issue a "swap btree" transaction. most other journalling filesystems will have similar requirements, I think, for doing this online That's a very similar concept to the move_data() interface... > > I see substantial benefit moving forward from having filesystem > > independent interfaces. Many features that filesystems implement > > are common, and as time goes on the common feature set of the > > different filesystems gets larger. So why shouldn't we be > > trying to make common operations generic so that every filesystem > > can benefit from the latest and greatest tool? > So you prefer to handle only "data blocks" part of the
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 11:33:16PM -0400, Theodore Tso wrote: > On Thu, Oct 26, 2006 at 11:40:20AM +1000, David Chinner wrote: > > We don't need to expose anything filesystem specific to userspace to > > implement this. Online data movement (i.e. the defrag mechanism) > > becomes something like: > > > > do { > > get_free_list(dst_fd, location, len, list) > > /* select extent to use */ > > alloc_from_list(dst_fd, list[X], off, len) > > } while (ENOALLOC) > > move_data(src_fd, dst_fd, off, len); > > > > And this would work on any filesystem type that implemented these > > interfaces. Hence tools like a startup file optimiser would > > only need to be written once, rather than needing a different > > tool for every different filesystem type. > > Yeah, but that's simply not enough. Not enough for what? > A good defragger needs to know Oh, we're back to defrag again. :/ > about a filesystem's allocation policies, and move files so they are > optimally located, given the filesystem layout. For example, in > ext2/3/4 we will want to move blocks so they in the same block group > as the inode. That's filesystem specific information; other > filesystems will require different policies. Of which a good chunk of policies will be common. the above policy has been around for many, many years and is implemented in many, many filesystems (even XFS). > > get_free_list(dst_fd, location, len, list) location == allocation policy. e.g: give me a list of free blocks: - anywhere (default filesystem policy applies) - near block number X - at block X - in block/allocation group Y - of the largest contiguous regions in (one of the above) - at least N blocks in length - near inode src_fd - in storage tier 3 then you select one of the regions that was returned at attempt to allocate that. You can put whatever filesystems specific stuff you need around this to arrive at the decision of where to put the file, but you've got to allocate the new blocks, move the data to them, and swap them over. Every defragger needs to do this, regardless of the filesystem type. So why not provide a framework for it, especially as the framework is useful for far more than just as the data movement part of a defrag application. > > Remember, I'm not just talking about defrag - I'm talking about > > an interface that is actually useful to apps that might care > > about how data is laid out on disk but the applications writers > > don't know anyhting about how filesystem X or Y or Z is > > implemented. Putting the burden of learning about fileystem > > internals on application developers is not the correct solution. > > Unfortunately, if you want to do a good job, a defragger *has* to know > about some very low-level filesystem specific information, if it wants > to do a good job. Back to defrag. Again. Bigger picture, guys, bigger picture. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote: > On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote: > > On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote: > > > On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote: > > > > On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: > > > > So why are you arguing that an interface is no good because it > > > > is fundamentally racy? ;) > > > > > > My point was that it is silly to introduce obviously racy code into the > > > kernel, when -- inside the kernel -- it could be handled race-free. > > > > So how do you then get the generic interface to allocate blocks > > specified by userspace race free? > > As has been repeatedly stated, there is no "generic". There MUST be > filesystem-specific knowledge during these operations. What information? All we need to know is where the free disk space is, and have a method to attempt to allocate from it. That's _easy_ to abstract into a common interface via the VFS > > > Further, in the case being discussed in this thread, ext2meta has > > > already been proven a workable solution. > > > > Sure, but that's not a generic solution to a problem common to > > all filesystems > > You clearly don't know what I'm talking about. ext2meta is an example > of a filesystem-specific metadata access method, applicable to tasks > such as online optimization. I know exactly what ext2meta is. I said it's not a generic solution and you say its a filesystem specific solution. I think we're agreeing here. ;) We don't need to expose anything filesystem specific to userspace to implement this. Online data movement (i.e. the defrag mechanism) becomes something like: do { get_free_list(dst_fd, location, len, list) /* select extent to use */ alloc_from_list(dst_fd, list[X], off, len) } while (ENOALLOC) move_data(src_fd, dst_fd, off, len); And this would work on any filesystem type that implemented these interfaces. Hence tools like a startup file optimiser would only need to be written once, rather than needing a different tool for every different filesystem type. Remember, I'm not just talking about defrag - I'm talking about an interface that is actually useful to apps that might care about how data is laid out on disk but the applications writers don't know anyhting about how filesystem X or Y or Z is implemented. Putting the burden of learning about fileystem internals on application developers is not the correct solution. I see substantial benefit moving forward from having filesystem independent interfaces. Many features that filesystems implement are common, and as time goes on the common feature set of the different filesystems gets larger. So why shouldn't we be trying to make common operations generic so that every filesystem can benefit from the latest and greatest tool? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote: > On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote: > > On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: > > So why are you arguing that an interface is no good because it > > is fundamentally racy? ;) > > My point was that it is silly to introduce obviously racy code into the > kernel, when -- inside the kernel -- it could be handled race-free. So how do you then get the generic interface to allocate blocks specified by userspace race free? > > > Every major filesystem has a libfoofs library that makes it trivial to > > > read the metadata, so all you need to do is use an existing lib. > > > > IOWs, you are advocating that any application that wants to use this > > special allocation technique needs to link against every different > > filesystem library and it then needs to implement filesystem > > specific searches through their metadata? Nobody in their right > > mind would ever want to use an interface like this. > > Online defrag is OBVIOUSLY highly filesystem specific. Parts of it are, but data movement and allocation hints need to be provided by every filesystem that wants to implement this efficiently. These features are also useful outside of defrag as well - I can think of several applications that would benefit from being able to direct where in the filesystem they want data to reside. If userspace directed allocation requires deep knowledge of the filesystem metadata (this is what you are saying they need to do, right?), then these applications will never, ever make use of this interface and we'll continue to have problems with them. I guess my point is that we are going to implement features like this in XFS and if other filesystems are going to be doing the same thing then we should try to come up with generic solutions rather than reinvent the wheel over an over again. > Further, in the case being discussed in this thread, ext2meta has > already been proven a workable solution. Sure, but that's not a generic solution to a problem common to all filesystems Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: > On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote: > > But it a race that is _easily_ handled, and applications only need to > > implement one interface, not a different method for every > > filesystem that requires deeep filesystem knowledge. > > > > Besides, you still have to handle the case where the block you want > > has already been allocated because reading the metadata from > > userspace doesn't prevent the kernel from allocating the block you > > want before you ask for it... > > The race is easily handled either way, by having the block move fail > when you tell the kernel the destination blocks. So why are you arguing that an interface is no good because it is fundamentally racy? ;) > The difference is that you don't unnecessarily bloat the kernel. By that argument, we should rip out the bmap interface (FIBMAP) because you can get all that information by reading the metadata from userspace. > Every major filesystem has a libfoofs library that makes it trivial to > read the metadata, so all you need to do is use an existing lib. IOWs, you are advocating that any application that wants to use this special allocation technique needs to link against every different filesystem library and it then needs to implement filesystem specific searches through their metadata? Nobody in their right mind would ever want to use an interface like this. Also, this simply doesn't work for XFS because the cached metadata is in a different address space to the block device. Hence it can be tens of seconds between the kernel modifying a metadata buffer and userspace being able to see that modification. You need to freeze the filesystem for the XFS userspace tools to guarantee a consistent view of an online filesystem from the block device. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Tue, Oct 24, 2006 at 10:42:57PM -0400, Jeff Garzik wrote: > On Wed, Oct 25, 2006 at 12:30:02PM +1000, Barry Naujok wrote: > > Could we have a more abstract method for asking the filesystem where the > > free blocks are and then using the same block addressing to tell the > > fs where to allocate/move the file's data to? > > That's fundamentally racy, so you might as well just read the > filesystem metadata from userspace. No need to go through the kernel > for that. But it a race that is _easily_ handled, and applications only need to implement one interface, not a different method for every filesystem that requires deeep filesystem knowledge. Besides, you still have to handle the case where the block you want has already been allocated because reading the metadata from userspace doesn't prevent the kernel from allocating the block you want before you ask for it... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Tue, Oct 24, 2006 at 03:44:16PM -0400, Theodore Tso wrote: > On Tue, Oct 24, 2006 at 11:59:28PM +1000, David Chinner wrote: > > That's the wrong way to look at it. if you want the userspace > > process to specify a location, then you should preallocate it first > > before doing anything else. There is no need to clutter a simple > > data mover interface with all sorts of unnecessary error handling. > > This is doable, but it adds a huge amount of complexity before we > could implement on-line defragmentation. > > First of all, we would need a way of allowing userpsace to specify > which blocks should be used in the preallocation. Not initially. Create a file, and call posix_fallocate() on it. Later, the filesystem can provide something that the defrag tool can use for fine-grained control of where the preallocated blocks are on disk. > Secondly, we would need a way of marking blocks as "preallocated but > not pre-zeroed"; otherwise we would have to zero out all of the blocks > in order to assure security (don't want userspace programs seeing the > previous contents of the data blocks), only to do the copy and the > extents vector swap. The unlinked inode method avoids this problem because no user space process can see the inode to open it. Also, posix_fallocate() zeroes the disk blocks so even this protects against data exposure. So, now all that remains for an initial implementation is the swap extents transaction and the data mover syscall. For a smart, fast implementation, I agree that you need unwritten extents (which XFS already has), then a fast filesystem implementation of posix_fallocate() that utilises unwritten extents (which XFS already has), and finally another interface that allows you to allocate unwritten extents in an arbitrary location within the filesystem (which no filesystem currently has). > That's a huge amount of work, and while the above two features can be > useful for other things, it's not clear it's worth it to require this > as the only way to implement on-line defragging. You're right that > it's a way of making things be more generic, but it means that each > filesystem needs to have a huge amount of additional complexity and > potential filesystem format changes before they could take advantage > of this general framework. I disagree - it's not a huge amount of work to get some thing working and to solidify the generic interfaces and only format change is a new transaction. Any filesystem that supports the swap extent/blocks method would then work better than XFs's current online defrag tool which currently does not use preallocation, nor does it use splice. > (For example, you'd never be able to do this with the FAT filesystem, > or the ext2 or ext3 filesystems; it would work for ext4 only *after* > we implement the above mentioned new features and the associated > filesystem format changes.) Sure, but they can use the slow, unoptimised posix_fallocate() method for allocating disk space Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Tue, Oct 24, 2006 at 11:26:26AM -0500, Dave Kleikamp wrote: > On Wed, 2006-10-25 at 02:01 +1000, David Chinner wrote: > > On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote: > > > On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote: > > > > That's the wrong way to look at it. if you want the userspace > > > > process to specify a location, then you should preallocate it first > > > > before doing anything else. There is no need to clutter a simple > > > > data mover interface with all sorts of unnecessary error handling. > > > > > > You are implying the the 2-step interface, creating a new inode then > > > swapping the contents, is the only way to implement this. > > > > No, it's not the only way to implement it, but it seems the cleanest > > way to me when you have to consider crash recovery. With a temporary > > inode, you can create it, hold a reference and then unlink it so > > that any crash at that point will free the inode and any extents > > it has on it. > > > > The only way I can see anything different working is having the > > filesystem hold extents somewhere internally that provides us the > > same recovery guarantees while we copy the data and insert the new > > extents. This is obviously a filesystem specific solution and is > > more complex to implement than a swap extent transaction. it > > probably also needs on disk format changes to support properly > > This is definitely filesystem-dependent. I would think allocating an > extent would be like any other allocation done by the filesystem, and > there are already recovery mechanisms for that. Yes, the allocation would be the same, but that isn't the problem I was talking about. The problem is holding a reference to the extent once it has been allocated while it is having the data copied into it (i.e. before it is swapped with the original extents) and then holding the original extents until they are freed. These references need to be persistent so they can be freed correctly during crash recovery i.e. rollback the allocation if the extent swap has not been logged, or free the original blocks is the extent swap has been logged. The obvious way to do this is to use an unlinked (orphan) inode > > > > Once you've separated the destination allocation from the data > > > > mover, the mover is basically a splice copy from source to > > > > destination, an fsync and then an atomic swap blocks/extents operation. > > > > Most of this code is generic, and a per-fs swap-extents vector > > > > could be easily provided for the one bit that is not > > > > > > The benefit of having such a simple data mover is negated by moving the > > > complexity into the allocator. > > > > What complexity does it introduce that the allocator doesn't already > > have or needs to provide for the single call interface to work? > > I don't see it as any more or less complex than a single interface. Ok, I thought I was missing something there. > > The allocation interface needs to be be able to be extended > > independently of the data mover interface. XFS already exposes > > allocation ioctls to userspace for preallocation and we've got plans > > to extnd this further to allow userspace controlled allocation for > > smart defrag tools for XFS. Tying allocation to the data mover > > just makes the interface less flexible and harder to do anything > > smart with > > Okay. It would be nice to standardize the interface so we don't have > every filesystem introducing new ioctls. Well, that will be an interesting challenge. I'm sure that there is a common subset that all filesystems can implement e.g. per file preallocation (something like XFS's allocate/reserve/free space ioctls) to provide kernel support for posix_fallocate(), etc. However, we may end up exposing enough of XFS's current allocation semantics to do things like telling the filesystem to allocate in allocation group 6, near block number 0x32482 within the AG, falling back to searching for the nearest match to the size requirement, failing that look for something larger than the minimum size specified, and then fail if you can't find a match in that AG. That makes little sense to any filesystem but XFS, which is really why I think that the smarter allocation interfaces are going to remain filesystem specific Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote: > On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote: > > On Tue, Oct 24, 2006 at 12:14:33AM -0400, Jeff Garzik wrote: > > > On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote: > > > > isn't that a kernel responsbility to find/allocate target blocks? > > > > wouldn't it better to specify desirable target group and minimal > > > > acceptable chunk of free blocks? > > > > > > The kernel doesn't have enough knowledge to know whether or not the > > > defragger prefers one blkdev location over another. > > > > > > When you are trying to consolidate blocks, you must specify the > > > destination as well as source blocks. > > > > > > Certainly, to prevent corruption and other nastiness, you must fail if > > > the destination isn't available... > > > > That's the wrong way to look at it. if you want the userspace > > process to specify a location, then you should preallocate it first > > before doing anything else. There is no need to clutter a simple > > data mover interface with all sorts of unnecessary error handling. > > You are implying the the 2-step interface, creating a new inode then > swapping the contents, is the only way to implement this. No, it's not the only way to implement it, but it seems the cleanest way to me when you have to consider crash recovery. With a temporary inode, you can create it, hold a reference and then unlink it so that any crash at that point will free the inode and any extents it has on it. The only way I can see anything different working is having the filesystem hold extents somewhere internally that provides us the same recovery guarantees while we copy the data and insert the new extents. This is obviously a filesystem specific solution and is more complex to implement than a swap extent transaction. it probably also needs on disk format changes to support properly > > Once you've separated the destination allocation from the data > > mover, the mover is basically a splice copy from source to > > destination, an fsync and then an atomic swap blocks/extents operation. > > Most of this code is generic, and a per-fs swap-extents vector > > could be easily provided for the one bit that is not > > The benefit of having such a simple data mover is negated by moving the > complexity into the allocator. What complexity does it introduce that the allocator doesn't already have or needs to provide for the single call interface to work? > A single interface that would move a part of a file at a time has the > advantage that a large file which is only fragmented in a few areas does > not need to be completely moved. And the two-step process can do exactly this as well - splice can work on any offset within the file... > > The allocation interface, OTOH, is anything but simple and is really > > a filesystem specific interface. Seems logical to me to separate > > the two. > > So what then is the benefit of having a simple generic data mover if > every file system needs to implement it's own interface to allocate a > copy of the data? I assume you meant "allocate the space to store the copy of the data." The allocation interface needs to be be able to be extended independently of the data mover interface. XFS already exposes allocation ioctls to userspace for preallocation and we've got plans to extnd this further to allow userspace controlled allocation for smart defrag tools for XFS. Tying allocation to the data mover just makes the interface less flexible and harder to do anything smart with Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Tue, Oct 24, 2006 at 12:14:33AM -0400, Jeff Garzik wrote: > On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote: > > isn't that a kernel responsbility to find/allocate target blocks? > > wouldn't it better to specify desirable target group and minimal > > acceptable chunk of free blocks? > > The kernel doesn't have enough knowledge to know whether or not the > defragger prefers one blkdev location over another. > > When you are trying to consolidate blocks, you must specify the > destination as well as source blocks. > > Certainly, to prevent corruption and other nastiness, you must fail if > the destination isn't available... That's the wrong way to look at it. if you want the userspace process to specify a location, then you should preallocate it first before doing anything else. There is no need to clutter a simple data mover interface with all sorts of unnecessary error handling. Once you've separated the destination allocation from the data mover, the mover is basically a splice copy from source to destination, an fsync and then an atomic swap blocks/extents operation. Most of this code is generic, and a per-fs swap-extents vector could be easily provided for the one bit that is not The allocation interface, OTOH, is anything but simple and is really a filesystem specific interface. Seems logical to me to separate the two. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Directories > 2GB
On Wed, Oct 11, 2006 at 11:49:10AM -0500, Steve Lord wrote: > David Chinner wrote: > >On Tue, Oct 10, 2006 at 10:19:04AM +0100, Christoph Hellwig wrote: > >>On Mon, Oct 09, 2006 at 09:15:28PM -0500, Steve Lord wrote: > >>>Hi Dave, > >>> > >>>My recollection is that it used to default to on, it was disabled > >>>because it needs to map the buffer into a single contiguous chunk > >>>of kernel memory. This was placing a lot of pressure on the memory > >>>remapping code, so we made it not default to on as reworking the > >>>code to deal with non contig memory was looking like a major > >>>effort. > >>Exactly. The code works but tends to go OOM pretty fast at least > >>when the dir blocksize code is bigger than the page size. I should > >>give the code a spin on my ppc box with 64k pages if it works better > >>there. > > > >The pagebuf code doesn't use high-order allocations anymore; it uses > >scatter lists and remapping to allow physically discontiguous pages > >in a multi-page buffer. That is, the pages are sourced via > >find_or_create_page() from the address space of the backing device, > >and then mapped via vmap() to provide a virtually contigous mapping > >of the multi-page buffer. > > > >So I don't think this problem exists anymore... > > I was not referring to high order allocations here, but the overhead > of doing address space remapping every time a directory is accessed. Ah - ok. contig -> non-contig and OOM is usually discussed in the context of higher order allocations failing. FWIW, I've not noticed any extra overhead - the CPU usage seems to grow roughly linearly with the increase in directory operations done as a result of higher throughput for the same number of I/Os. I'll have a look at the Vm stats, though, next time I run a comparison to see how bad this is. Thanks for the clarification, Steve. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Directories > 2GB
On Tue, Oct 10, 2006 at 10:19:04AM +0100, Christoph Hellwig wrote: > On Mon, Oct 09, 2006 at 09:15:28PM -0500, Steve Lord wrote: > > Hi Dave, > > > > My recollection is that it used to default to on, it was disabled > > because it needs to map the buffer into a single contiguous chunk > > of kernel memory. This was placing a lot of pressure on the memory > > remapping code, so we made it not default to on as reworking the > > code to deal with non contig memory was looking like a major > > effort. > > Exactly. The code works but tends to go OOM pretty fast at least > when the dir blocksize code is bigger than the page size. I should > give the code a spin on my ppc box with 64k pages if it works better > there. The pagebuf code doesn't use high-order allocations anymore; it uses scatter lists and remapping to allow physically discontiguous pages in a multi-page buffer. That is, the pages are sourced via find_or_create_page() from the address space of the backing device, and then mapped via vmap() to provide a virtually contigous mapping of the multi-page buffer. So I don't think this problem exists anymore... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Directories > 2GB
On Mon, Oct 09, 2006 at 04:53:02PM -0500, Steve Lord wrote: > You might want to think about keeping the directory a little > more contiguous than individual disk blocks. XFS does have > code in it to allocate the directory in chunks larger than > a single file system block. It does not get used on linux > because the code was written under the assumption you can > see the whole chunk as a single piece of memory which does not > work to well in the linux kernel. This code is enabled and seems to work in Linux. I don't know if it passes xfsqa so I don't know how reliable this feature is. TO check it all I did was run a quick test on a x86_64 kernel (4k page size) using 16k directory blocks (4 pages): # mkfs.xfs -f -n size=16384 /dev/ubd/1 . # xfs_db -r -c "sb 0" -c "p dirblklog" /dev/ubd/1 dirblklog = 2 # mount /dev/ubd/1 /mnt/xfs # for i in `seq 0 1 10`; do touch fred.$i; done # umount /mnt/xfs # mount /mnt/xfs # ls /mnt/xfs |wc -l 10 # rm -rf /mnt/xfs/* # ls /mnt/xfs |wc -l 0 # umount /mnt/xfs # Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html