Re: [RFC] Ext3 online defrag
Hi Alex, Thank you for your information. I have sent the patches of the defragmentation for a extent-based file on ext3 using your patches of the multi-block allocation. I'm happy if you have a time to review my patches. [RFC][PATCH 0/3] Extent base online defrag http://marc.theaimsgroup.com/?l=linux-ext4m=116307062907075w=2 And I'd like to start considering the defragmentation for ext4. Do you have a plan to update your patches for ext4? I've been reworking mballoc with few new features: 1) in-core preallocation like existing reservation, but can preallocate few pieces for a file 2) locality groups to maintain groups of related files and flush them together. say, two users are unpacking kernel. with delayed allocation we've got bunch of files from the both in cache. then we flush first set (few MBs) of files from one user, then from another. this way write I/Os will be large enough to achieve good throughput and files are still quite localized to be used later at good read rate. 3) scalable reservation required for delayed allocation to avoid -ENOSPC at flush time. current version uses per-sb spinlock. probably we could add something for defragmentation? Cheers, Takashi - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
Alex Tomas wrote: 3) scalable reservation required for delayed allocation to avoid -ENOSPC at flush time. current version uses per-sb spinlock. Can you elaborate on this issue? Shouldn't delayed allocation decrement free space immediately, and only the actual block location choice is delayed? Or is this due to potential extra metadata space required as blocks are allocated? Thanks, -Eric - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
Eric Sandeen (ES) writes: ES Alex Tomas wrote: 3) scalable reservation required for delayed allocation to avoid -ENOSPC at flush time. current version uses per-sb spinlock. ES Can you elaborate on this issue? Shouldn't delayed allocation ES decrement free space immediately, and only the actual block location ES choice is delayed? Or is this due to potential extra metadata space ES required as blocks are allocated? exactly. in this case, reservation has nothing to do with allocation or preallocation of real blocks. this is just a *per-sb counter* of blocks reserved for allocation at flush time. it includes all non-allocated-yet blocks and metadata needed to allocate them (bitmaps, group descriptors, blocks extent tree, etc). the previous version of mballoc has reservation, but it doesn't scale very well being a single global counter protected by the spinlock. at least, in many regular loads I observed the reservation function in top30 of oprofile. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 11:33:16PM -0400, Theodore Tso wrote: On Thu, Oct 26, 2006 at 11:40:20AM +1000, David Chinner wrote: We don't need to expose anything filesystem specific to userspace to implement this. Online data movement (i.e. the defrag mechanism) becomes something like: do { get_free_list(dst_fd, location, len, list) /* select extent to use */ alloc_from_list(dst_fd, list[X], off, len) } while (ENOALLOC) move_data(src_fd, dst_fd, off, len); And this would work on any filesystem type that implemented these interfaces. Hence tools like a startup file optimiser would only need to be written once, rather than needing a different tool for every different filesystem type. Yeah, but that's simply not enough. Not enough for what? A good defragger needs to know Oh, we're back to defrag again. :/ about a filesystem's allocation policies, and move files so they are optimally located, given the filesystem layout. For example, in ext2/3/4 we will want to move blocks so they in the same block group as the inode. That's filesystem specific information; other filesystems will require different policies. Of which a good chunk of policies will be common. the above policy has been around for many, many years and is implemented in many, many filesystems (even XFS). get_free_list(dst_fd, location, len, list) location == allocation policy. e.g: give me a list of free blocks: - anywhere (default filesystem policy applies) - near block number X - at block X - in block/allocation group Y - of the largest contiguous regions in (one of the above) - at least N blocks in length - near inode src_fd - in storage tier 3 then you select one of the regions that was returned at attempt to allocate that. You can put whatever filesystems specific stuff you need around this to arrive at the decision of where to put the file, but you've got to allocate the new blocks, move the data to them, and swap them over. Every defragger needs to do this, regardless of the filesystem type. So why not provide a framework for it, especially as the framework is useful for far more than just as the data movement part of a defrag application. Remember, I'm not just talking about defrag - I'm talking about an interface that is actually useful to apps that might care about how data is laid out on disk but the applications writers don't know anyhting about how filesystem X or Y or Z is implemented. Putting the burden of learning about fileystem internals on application developers is not the correct solution. Unfortunately, if you want to do a good job, a defragger *has* to know about some very low-level filesystem specific information, if it wants to do a good job. Back to defrag. Again. Bigger picture, guys, bigger picture. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 25, 2006 16:54 +0200, Jan Kara wrote: I've just not yet decided how to handle indirect blocks in case of relocation in the middle of the file. Should they be relocated or shouldn't they? Probably they should be relocated at least in case they are fully contained in relocated interval or maybe better said when all the blocks they reference to are also in the interval (this handles also the case of EOF). But still if you would like to relocate the file by parts this is not quite what you want (you won't be able to relocate indirect blocks in the boundary of intervals) :(. I suspect that the natural choice for metadata blocks is to keep the block which has the most metadata unchanged. For example, if you are doing a full-file relocation then you would naturally keep all of the new {dt}indirect blocks. If you are relocating a small chunk of the file you would keep the old {dt}indirect blocks and just copy a few block pointers over. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote: So how do you then get the generic interface to allocate blocks specified by userspace race free? As has been repeatedly stated, there is no generic. There MUST be filesystem-specific knowledge during these operations. What information? All we need to know is where the free disk space is, and have a method to attempt to allocate from it. That's _easy_ to abstract into a common interface via the VFS Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Sure, but that's not a generic solution to a problem common to all filesystems You clearly don't know what I'm talking about. ext2meta is an example of a filesystem-specific metadata access method, applicable to tasks such as online optimization. I know exactly what ext2meta is. I said it's not a generic solution and you say its a filesystem specific solution. I think we're agreeing here. ;) We don't need to expose anything filesystem specific to userspace to implement this. Online data movement (i.e. the defrag mechanism) becomes something like: do { get_free_list(dst_fd, location, len, list) /* select extent to use */ Upto this point I can imagine we can be perfectly generic. alloc_from_list(dst_fd, list[X], off, len) } while (ENOALLOC) move_data(src_fd, dst_fd, off, len); With these two it's not clear how well can we do with just a generic interface. Every filesystem needs to have some additional metadata to keep list of data blocks. In case of ext2/ext3/reiserfs this is not a negligible amount of space and placement of these metadata is important for performance. So either we focus only on data blocks and let implementation of alloc_from_list() allocate metadata wherever it wants (but then we get suboptimal performace because there need not be space for indirect blocks close before our provided extent) or we allocate metadata from the provided list, but then we need some knowledge of fs to know how much should we expect to spend on metadata and where these metadata should be placed. For example if you know that indirect block for your interval is at block B, then you'd like to allocate somewhere close after this point or to relocate that indirect block (and all the data it references to). But for that you need to know you have something like indirect blocks = filesystem knowledge. So I think that to get this working, we also need some way to tell the program that if it wants to allocate some data, it also needs to count with this amount of metadata and some of it is already allocated in given blocks... I see substantial benefit moving forward from having filesystem independent interfaces. Many features that filesystems implement are common, and as time goes on the common feature set of the different filesystems gets larger. So why shouldn't we be trying to make common operations generic so that every filesystem can benefit from the latest and greatest tool? So you prefer to handle only data blocks part of the problem and let filesystem sort out metadata? Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Thu, Oct 26, 2006 at 04:36:48PM +1000, David Chinner wrote: Remember, I'm not just talking about defrag - I'm talking about an interface that is actually useful to apps that might care about how data is laid out on disk but the applications writers don't know anyhting about how filesystem X or Y or Z is implemented. Putting the burden of learning about fileystem internals on application developers is not the correct solution. If all you want is something for applicaiton developers, about all you can do is to tell the filesystem, create the file so that it will be quickly accessed after accessing this file or this directory. I really don't see the point of having the application specify block numbers if you're also claiming the applicaiton isn't going to know anything about the filesystem layout --- or even the RAID layout of the filesystem. I don't think it's at **all** useful to be half-pregnant on this score. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Thu, 2006-10-26 at 09:37 -0400, Theodore Tso wrote: On Thu, Oct 26, 2006 at 04:36:48PM +1000, David Chinner wrote: Remember, I'm not just talking about defrag - I'm talking about an interface that is actually useful to apps that might care about how data is laid out on disk but the applications writers don't know anyhting about how filesystem X or Y or Z is implemented. Putting the burden of learning about fileystem internals on application developers is not the correct solution. If all you want is something for applicaiton developers, about all you can do is to tell the filesystem, create the file so that it will be quickly accessed after accessing this file or this directory. I really don't see the point of having the application specify block numbers if you're also claiming the applicaiton isn't going to know anything about the filesystem layout --- or even the RAID layout of the filesystem. I don't think it's at **all** useful to be half-pregnant on this score. I think a utility such as a defragmenter should know about about the filesystem layout. I also think that it would be a good thing to have a consistent interface so that every filesystem isn't implementing a completely different one. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, 25 October 2006 14:41:18 -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 08:36:56PM +0200, Jan Kara wrote: Yes, but there's a question of the interface to this operation. How to specify which indirect block I mean? Obviously we could introduce separate call for remapping indirect blocks but I find this solution kind of clumsy... Agreed... that gets nasty real quick. Logfs has a similar problem and I introduced a level. Without going into all the gory details, data blocks reside on level 0, indirect blocks on level 1, doubly indirect blocks on level 2, etc. With this, the tupel of (ino, pos, level) can specify any block on the filesystem, provided it is used for some inode. Logfs needs this for Garbage Collection, which is a fairly similar problem. Jörn -- Joern's library part 3: http://inst.eecs.berkeley.edu/~cs152/fa05/handouts/clark-test.pdf - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Thu, Oct 26, 2006 at 01:37:22PM +0200, Jan Kara wrote: On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote: We don't need to expose anything filesystem specific to userspace to implement this. Online data movement (i.e. the defrag mechanism) becomes something like: do { get_free_list(dst_fd, location, len, list) /* select extent to use */ Upto this point I can imagine we can be perfectly generic. alloc_from_list(dst_fd, list[X], off, len) } while (ENOALLOC) move_data(src_fd, dst_fd, off, len); With these two it's not clear how well can we do with just a generic interface. Every filesystem needs to have some additional metadata to keep list of data blocks. In case of ext2/ext3/reiserfs this is not a negligible amount of space and placement of these metadata is important for performance. Yes, the same can be said for XFS. However, XFS's extent btree implementation uses readahead to hide a lot of the latency involved with reading extent map, and it only needs to read it once per inode lifecycle So either we focus only on data blocks and let implementation of alloc_from_list() allocate metadata wherever it wants (but then we get suboptimal performace because there need not be space for indirect blocks close before our provided extent) I think the first step would be to focus on data blocks using something like the above. There are many steps to full filesystem defragmentation, but data fragmetnation is typically the most common symptom of fragmentation that we see. or we allocate metadata from the provided list, but then we need some knowledge of fs to know how much should we expect to spend on metadata and where these metadata should be placed. That's the second step, I think. For example, we could count the metadata blocks used in metadata structure (say an block list), allocate a new chunk like above, and then execute a move_metadata() type of operation, which the filesystem does internally in a transactionally safe manner. Once again, generic interface, filesystem specific implementations. For example if you know that indirect block for your interval is at block B, then you'd like to allocate somewhere close after this point or to relocate that indirect block (and all the data it references to). But for that you need to know you have something like indirect blocks = filesystem knowledge. *nod* This is far less of a problem with extent based filesystems - coalescing all the fragments into a single extent removes the need for indirect blocks and you get the extent list for free when you read the inode. When we do have a fragmented file, XFS uses readahead to speed btree searching and reading, so it hides a lot of the latency overhead that fragmented metadata can cause. Either way, these lists can still be optimised by allocating a set of contiguous blocks and copying the metadata into them and updating the pointers to the new blocks. It can be done separately to the data moving and really should be done after the data has been defragmented So I think that to get this working, we also need some way to tell the program that if it wants to allocate some data, it also needs to count with this amount of metadata and some of it is already allocated in given blocks... If you want to do it all in one step. However, it's not quite that simple for something like XFS. An allocation may require a btree split (or three, actually) and the number of blocks required is dependent on the height of the btrees. So we don't know how many blocks we'll need ahead of time, and we'd have to reach deep into the allocator and abuse it badly to do anything like this. It's not something I want to even contemplate doing. :/ Also, we don't want to be mingling global metadata with inode specific metadata so we don't want to put most of the new metadata blocks near the extent we are putting the data into. That means I'd prefer to be able to optimise metadata objects separately. e.g. rewrite a btree into a single contiguous extent with the btree blocks laid out so the readahead patterns result in sequential I/O. The kernel would need to do this in XFS because we'd have to lock the entire btree a block at a time, copy it and then issue a swap btree transaction. most other journalling filesystems will have similar requirements, I think, for doing this online That's a very similar concept to the move_data() interface... I see substantial benefit moving forward from having filesystem independent interfaces. Many features that filesystems implement are common, and as time goes on the common feature set of the different filesystems gets larger. So why shouldn't we be trying to make common operations generic so that every filesystem can benefit from the latest and greatest tool? So you prefer to handle only data blocks part of the problem and let filesystem sort out metadata? The filesystem
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote: But it a race that is _easily_ handled, and applications only need to implement one interface, not a different method for every filesystem that requires deeep filesystem knowledge. Besides, you still have to handle the case where the block you want has already been allocated because reading the metadata from userspace doesn't prevent the kernel from allocating the block you want before you ask for it... The race is easily handled either way, by having the block move fail when you tell the kernel the destination blocks. So why are you arguing that an interface is no good because it is fundamentally racy? ;) My point was that it is silly to introduce obviously racy code into the kernel, when -- inside the kernel -- it could be handled race-free. If you accept a racy solution, you might as well do it outside the kernel, where you get the same results, but without adding silliness and bloat to the kernel. Every major filesystem has a libfoofs library that makes it trivial to read the metadata, so all you need to do is use an existing lib. IOWs, you are advocating that any application that wants to use this special allocation technique needs to link against every different filesystem library and it then needs to implement filesystem specific searches through their metadata? Nobody in their right mind would ever want to use an interface like this. Online defrag is OBVIOUSLY highly filesystem specific. You have to link against filesystem specific code somewhere, whether its inside the kernel or outside the kernel. Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: So why are you arguing that an interface is no good because it is fundamentally racy? ;) My point was that it is silly to introduce obviously racy code into the kernel, when -- inside the kernel -- it could be handled race-free. So how do you then get the generic interface to allocate blocks specified by userspace race free? Every major filesystem has a libfoofs library that makes it trivial to read the metadata, so all you need to do is use an existing lib. IOWs, you are advocating that any application that wants to use this special allocation technique needs to link against every different filesystem library and it then needs to implement filesystem specific searches through their metadata? Nobody in their right mind would ever want to use an interface like this. Online defrag is OBVIOUSLY highly filesystem specific. Parts of it are, but data movement and allocation hints need to be provided by every filesystem that wants to implement this efficiently. These features are also useful outside of defrag as well - I can think of several applications that would benefit from being able to direct where in the filesystem they want data to reside. If userspace directed allocation requires deep knowledge of the filesystem metadata (this is what you are saying they need to do, right?), then these applications will never, ever make use of this interface and we'll continue to have problems with them. I guess my point is that we are going to implement features like this in XFS and if other filesystems are going to be doing the same thing then we should try to come up with generic solutions rather than reinvent the wheel over an over again. Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Sure, but that's not a generic solution to a problem common to all filesystems Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 24, 2006 15:44 -0400, Theodore Tso wrote: First of all, we would need a way of allowing userpsace to specify which blocks should be used in the preallocation. Presumably it could do this in the same way it will be specifying which blocks to relocate in the defragger - by passing an extent. You would be required to pass the file offset for which to preallocate, and optionally an extent for the on-disk allocation itself (if none is supplied the kernel will allocate the best extent it can). Secondly, we would need a way of marking blocks as preallocated but not pre-zeroed; otherwise we would have to zero out all of the blocks in order to assure security (don't want userspace programs seeing the previous contents of the data blocks), only to do the copy and the extents vector swap. This could be mitigated by having the preallocation be done (in the defragment case) against a temporary inode in the orphan list (as the initial patch did) so if there is a crash it will be released. The temporary inode will not be linked into the namespace so it cannot be read - only used to hold preallocation. If this was a write-only file handle then we should be OK? For defragger purposes this would need: - allocate new temporary inode (VFS + fs, returns write-only fh if fs can't properly handle uninitalized extents, or doesn't request full-extent zeroing) for each extent to defragment { - preallocate extents on temp inode (fs specific internals) - copy data from orig to temp at offset X (VFS, splice or e.g. sys_copyfile(src, dst, offset, count) which Linus agreed to at KS '05 for network filesystems) - migrate copied extent to original inode (fs specific internals) } - free temporary inode (just close of temp fh, frees unmigrated extents). Yes, this sounds feasible. We could split the defrag ioctl into two pieces (addition of given extent to a file and swapping of extents), which can have generic interface... I don't think this is much more work than implementing all of this functionality as part of a monolithic online defrag function, assuming we don't require full-file copies in order to do defrag. Yes, it's not more work than supporting swapping of extents in the middle of the file. I've just not yet decided how to handle indirect blocks in case of relocation in the middle of the file. Should they be relocated or shouldn't they? Probably they should be relocated at least in case they are fully contained in relocated interval or maybe better said when all the blocks they reference to are also in the interval (this handles also the case of EOF). But still if you would like to relocate the file by parts this is not quite what you want (you won't be able to relocate indirect blocks in the boundary of intervals) :(. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: So why are you arguing that an interface is no good because it is fundamentally racy? ;) My point was that it is silly to introduce obviously racy code into the kernel, when -- inside the kernel -- it could be handled race-free. So how do you then get the generic interface to allocate blocks specified by userspace race free? As has been repeatedly stated, there is no generic. There MUST be filesystem-specific knowledge during these operations. If userspace directed allocation requires deep knowledge of the filesystem metadata (this is what you are saying they need to do, right?), then these applications will never, ever make use of this interface and we'll continue to have problems with them. Completely false assumptions. There is no difference in handling of knowledge, be it kernel space or userspace. Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Sure, but that's not a generic solution to a problem common to all filesystems You clearly don't know what I'm talking about. ext2meta is an example of a filesystem-specific metadata access method, applicable to tasks such as online optimization. Implement that tiny kernel module for each filesystem, and you have everything you need, without races. This was discussed years ago; review the mailing lists. Google for 'Alexander Viro' and 'ext2meta'. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote: Yes, this sounds feasible. We could split the defrag ioctl into two pieces (addition of given extent to a file and swapping of extents), which can have generic interface... An ioctl is UGLY. This was discussed years ago. Google for 'Alexander Viro' and 'ext2meta'. That's a clean, flexible, extensible way to access metadata online. No need for ioctl binary translation across 32bit-64bit, or any other ioctl issue. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote: Yes, this sounds feasible. We could split the defrag ioctl into two pieces (addition of given extent to a file and swapping of extents), which can have generic interface... An ioctl is UGLY. Agreed. This was discussed years ago. Google for 'Alexander Viro' and 'ext2meta'. That's a clean, flexible, extensible way to access metadata online. No need for ioctl binary translation across 32bit-64bit, or any other ioctl issue. I've briefly looked at this and this kind of interface has some appeal. On the other hand it's not obvious to me, how to implement in this interface *atomic* operation copy data from file F to given set of blocks and rewrite pointers to original blocks with pointers to new blocks. Something like this is needed for what we want to do... Also if we'd like to implement operation like add this block to file F at position P we have to make sure that all the necessary updates (bitmap updates, inode updates, indirect block updates) go into one transaction. Which basically mean that either ext3meta has to have a way how to do this in a single operation, or we have to give userspace a way to start/stop transaction and that starts to be really a mess because of various deadlocks and so on. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 07:58:51PM +0200, Jan Kara wrote: I've briefly looked at this and this kind of interface has some appeal. On the other hand it's not obvious to me, how to implement in this interface *atomic* operation copy data from file F to given set of blocks and rewrite pointers to original blocks with pointers to new blocks. Something like this is needed for what we want to do... Also if we'd like to implement operation like add this block to file F at position P we have to make sure that all the necessary updates (bitmap updates, inode updates, indirect block updates) go into one transaction. Which basically mean that either ext3meta has to have a way how to do this in a single operation, or we have to give userspace a way to start/stop transaction and that starts to be really a mess because of various deadlocks and so on. Agreed, this issues exist. But these issues exist independent of whether an ioctl or ext3meta is used. It's all the responsibility of the implementor to define the interface. My contention is that ext3meta interface method would be much more robust than ioctl. It's a namespace inside which you can define any inodes/dirents you wish, for the operations you desire. I see. So you mean that in our ext3meta filesystem we'd have a file named add_this_extent_to_inode and a file reloc_inode_interval and they'd be fed essentially the same info as the current ioctl interface and do the same thing as we currently do. Hmm, I don't find it that nice any more but yes, this would work. Heck, according to my sf.net/projects/gkernel CVS log, you offered some helpful review comments to me when I was implementing ext2meta ;-) Looking at those mails it was already quite some time ago so I forgot about it ;) Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 08:25:30PM +0200, Jan Kara wrote: I see. So you mean that in our ext3meta filesystem we'd have a file named add_this_extent_to_inode and a file reloc_inode_interval and they'd be fed essentially the same info as the current ioctl interface and do the same thing as we currently do. Hmm, I don't find it that nice any more but yes, this would work. It depends on the operation. ext2meta[1] works fine for online defrag, just exporting metadata objects and providing read(1) and write(2) operations on them. Adding 'trigger' files (like your add_this_extent_to_inode) may make sense for some operations, indeed, but we need to see the whole picture before really understanding whether that interface is optimal. Jeff [1] http://linux.yyz.us/misc/ext2meta.c - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 23, 2006 18:03 +0200, Jan Kara wrote: Andreas Dilger wrote: I would in fact go so far as to allow only a single extent to be specified per call. This is to avoid the passing of any pointers as part of the interface (hello ioctl police :-), and also makes the kernel code simpler. I don't think the syscall/ioctl overhead is significant compared to the journal and IO overhead. ...it makes it kind of harder to tell where indirect blocks would go - and it would be impossible for the defragmenter to force some unusual placement of indirect blocks... It would be possible to specify indirect block relocation in same manner as regular block relocation I think. Allocate a new block, copy contents, flush block from cache, fix up reference (inode, dindirect), commit. Yes, but there's a question of the interface to this operation. How to specify which indirect block I mean? Obviously we could introduce separate call for remapping indirect blocks but I find this solution kind of clumsy... Bye Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 08:36:56PM +0200, Jan Kara wrote: Yes, but there's a question of the interface to this operation. How to specify which indirect block I mean? Obviously we could introduce separate call for remapping indirect blocks but I find this solution kind of clumsy... Agreed... that gets nasty real quick. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: So why are you arguing that an interface is no good because it is fundamentally racy? ;) My point was that it is silly to introduce obviously racy code into the kernel, when -- inside the kernel -- it could be handled race-free. So how do you then get the generic interface to allocate blocks specified by userspace race free? As has been repeatedly stated, there is no generic. There MUST be filesystem-specific knowledge during these operations. What information? All we need to know is where the free disk space is, and have a method to attempt to allocate from it. That's _easy_ to abstract into a common interface via the VFS Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Sure, but that's not a generic solution to a problem common to all filesystems You clearly don't know what I'm talking about. ext2meta is an example of a filesystem-specific metadata access method, applicable to tasks such as online optimization. I know exactly what ext2meta is. I said it's not a generic solution and you say its a filesystem specific solution. I think we're agreeing here. ;) We don't need to expose anything filesystem specific to userspace to implement this. Online data movement (i.e. the defrag mechanism) becomes something like: do { get_free_list(dst_fd, location, len, list) /* select extent to use */ alloc_from_list(dst_fd, list[X], off, len) } while (ENOALLOC) move_data(src_fd, dst_fd, off, len); And this would work on any filesystem type that implemented these interfaces. Hence tools like a startup file optimiser would only need to be written once, rather than needing a different tool for every different filesystem type. Remember, I'm not just talking about defrag - I'm talking about an interface that is actually useful to apps that might care about how data is laid out on disk but the applications writers don't know anyhting about how filesystem X or Y or Z is implemented. Putting the burden of learning about fileystem internals on application developers is not the correct solution. I see substantial benefit moving forward from having filesystem independent interfaces. Many features that filesystems implement are common, and as time goes on the common feature set of the different filesystems gets larger. So why shouldn't we be trying to make common operations generic so that every filesystem can benefit from the latest and greatest tool? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Thu, Oct 26, 2006 at 11:40:20AM +1000, David Chinner wrote: We don't need to expose anything filesystem specific to userspace to implement this. Online data movement (i.e. the defrag mechanism) becomes something like: do { get_free_list(dst_fd, location, len, list) /* select extent to use */ alloc_from_list(dst_fd, list[X], off, len) } while (ENOALLOC) move_data(src_fd, dst_fd, off, len); And this would work on any filesystem type that implemented these interfaces. Hence tools like a startup file optimiser would only need to be written once, rather than needing a different tool for every different filesystem type. Yeah, but that's simply not enough. A good defragger needs to know about a filesystem's allocation policies, and move files so they are optimally located, given the filesystem layout. For example, in ext2/3/4 we will want to move blocks so they in the same block group as the inode. That's filesystem specific information; other filesystems will require different policies. Remember, I'm not just talking about defrag - I'm talking about an interface that is actually useful to apps that might care about how data is laid out on disk but the applications writers don't know anyhting about how filesystem X or Y or Z is implemented. Putting the burden of learning about fileystem internals on application developers is not the correct solution. Unfortunately, if you want to do a good job, a defragger *has* to know about some very low-level filesystem specific information, if it wants to do a good job. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Tue, Oct 24, 2006 at 12:14:33AM -0400, Jeff Garzik wrote: On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote: isn't that a kernel responsbility to find/allocate target blocks? wouldn't it better to specify desirable target group and minimal acceptable chunk of free blocks? The kernel doesn't have enough knowledge to know whether or not the defragger prefers one blkdev location over another. When you are trying to consolidate blocks, you must specify the destination as well as source blocks. Certainly, to prevent corruption and other nastiness, you must fail if the destination isn't available... That's the wrong way to look at it. if you want the userspace process to specify a location, then you should preallocate it first before doing anything else. There is no need to clutter a simple data mover interface with all sorts of unnecessary error handling. Once you've separated the destination allocation from the data mover, the mover is basically a splice copy from source to destination, an fsync and then an atomic swap blocks/extents operation. Most of this code is generic, and a per-fs swap-extents vector could be easily provided for the one bit that is not The allocation interface, OTOH, is anything but simple and is really a filesystem specific interface. Seems logical to me to separate the two. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
David Chinner wrote: The allocation interface, OTOH, is anything but simple and is really a filesystem specific interface. Seems logical to me to separate the two. And ext[234] preallocation would be a very nice feature in its own right. -Eric - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote: On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote: On Tue, Oct 24, 2006 at 12:14:33AM -0400, Jeff Garzik wrote: On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote: isn't that a kernel responsbility to find/allocate target blocks? wouldn't it better to specify desirable target group and minimal acceptable chunk of free blocks? The kernel doesn't have enough knowledge to know whether or not the defragger prefers one blkdev location over another. When you are trying to consolidate blocks, you must specify the destination as well as source blocks. Certainly, to prevent corruption and other nastiness, you must fail if the destination isn't available... That's the wrong way to look at it. if you want the userspace process to specify a location, then you should preallocate it first before doing anything else. There is no need to clutter a simple data mover interface with all sorts of unnecessary error handling. You are implying the the 2-step interface, creating a new inode then swapping the contents, is the only way to implement this. No, it's not the only way to implement it, but it seems the cleanest way to me when you have to consider crash recovery. With a temporary inode, you can create it, hold a reference and then unlink it so that any crash at that point will free the inode and any extents it has on it. The only way I can see anything different working is having the filesystem hold extents somewhere internally that provides us the same recovery guarantees while we copy the data and insert the new extents. This is obviously a filesystem specific solution and is more complex to implement than a swap extent transaction. it probably also needs on disk format changes to support properly Once you've separated the destination allocation from the data mover, the mover is basically a splice copy from source to destination, an fsync and then an atomic swap blocks/extents operation. Most of this code is generic, and a per-fs swap-extents vector could be easily provided for the one bit that is not The benefit of having such a simple data mover is negated by moving the complexity into the allocator. What complexity does it introduce that the allocator doesn't already have or needs to provide for the single call interface to work? A single interface that would move a part of a file at a time has the advantage that a large file which is only fragmented in a few areas does not need to be completely moved. And the two-step process can do exactly this as well - splice can work on any offset within the file... The allocation interface, OTOH, is anything but simple and is really a filesystem specific interface. Seems logical to me to separate the two. So what then is the benefit of having a simple generic data mover if every file system needs to implement it's own interface to allocate a copy of the data? I assume you meant allocate the space to store the copy of the data. The allocation interface needs to be be able to be extended independently of the data mover interface. XFS already exposes allocation ioctls to userspace for preallocation and we've got plans to extnd this further to allow userspace controlled allocation for smart defrag tools for XFS. Tying allocation to the data mover just makes the interface less flexible and harder to do anything smart with Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, 2006-10-25 at 02:01 +1000, David Chinner wrote: On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote: On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote: That's the wrong way to look at it. if you want the userspace process to specify a location, then you should preallocate it first before doing anything else. There is no need to clutter a simple data mover interface with all sorts of unnecessary error handling. You are implying the the 2-step interface, creating a new inode then swapping the contents, is the only way to implement this. No, it's not the only way to implement it, but it seems the cleanest way to me when you have to consider crash recovery. With a temporary inode, you can create it, hold a reference and then unlink it so that any crash at that point will free the inode and any extents it has on it. The only way I can see anything different working is having the filesystem hold extents somewhere internally that provides us the same recovery guarantees while we copy the data and insert the new extents. This is obviously a filesystem specific solution and is more complex to implement than a swap extent transaction. it probably also needs on disk format changes to support properly This is definitely filesystem-dependent. I would think allocating an extent would be like any other allocation done by the filesystem, and there are already recovery mechanisms for that. Once you've separated the destination allocation from the data mover, the mover is basically a splice copy from source to destination, an fsync and then an atomic swap blocks/extents operation. Most of this code is generic, and a per-fs swap-extents vector could be easily provided for the one bit that is not The benefit of having such a simple data mover is negated by moving the complexity into the allocator. What complexity does it introduce that the allocator doesn't already have or needs to provide for the single call interface to work? I don't see it as any more or less complex than a single interface. A single interface that would move a part of a file at a time has the advantage that a large file which is only fragmented in a few areas does not need to be completely moved. And the two-step process can do exactly this as well - splice can work on any offset within the file... I wasn't aware of that. That makes your proposal sound a lot better. The allocation interface, OTOH, is anything but simple and is really a filesystem specific interface. Seems logical to me to separate the two. So what then is the benefit of having a simple generic data mover if every file system needs to implement it's own interface to allocate a copy of the data? I assume you meant allocate the space to store the copy of the data. Yeah. The allocation interface needs to be be able to be extended independently of the data mover interface. XFS already exposes allocation ioctls to userspace for preallocation and we've got plans to extnd this further to allow userspace controlled allocation for smart defrag tools for XFS. Tying allocation to the data mover just makes the interface less flexible and harder to do anything smart with Okay. It would be nice to standardize the interface so we don't have every filesystem introducing new ioctls. Cheers, Dave. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Tue, Oct 24, 2006 at 11:59:28PM +1000, David Chinner wrote: That's the wrong way to look at it. if you want the userspace process to specify a location, then you should preallocate it first before doing anything else. There is no need to clutter a simple data mover interface with all sorts of unnecessary error handling. This is doable, but it adds a huge amount of complexity before we could implement on-line defragmentation. First of all, we would need a way of allowing userpsace to specify which blocks should be used in the preallocation. Secondly, we would need a way of marking blocks as preallocated but not pre-zeroed; otherwise we would have to zero out all of the blocks in order to assure security (don't want userspace programs seeing the previous contents of the data blocks), only to do the copy and the extents vector swap. That's a huge amount of work, and while the above two features can be useful for other things, it's not clear it's worth it to require this as the only way to implement on-line defragging. You're right that it's a way of making things be more generic, but it means that each filesystem needs to have a huge amount of additional complexity and potential filesystem format changes before they could take advantage of this general framework. (For example, you'd never be able to do this with the FAT filesystem, or the ext2 or ext3 filesystems; it would work for ext4 only *after* we implement the above mentioned new features and the associated filesystem format changes.) Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Tue, 2006-10-24 at 15:44 -0400, Theodore Tso wrote: On Tue, Oct 24, 2006 at 11:59:28PM +1000, David Chinner wrote: That's the wrong way to look at it. if you want the userspace process to specify a location, then you should preallocate it first before doing anything else. There is no need to clutter a simple data mover interface with all sorts of unnecessary error handling. This is doable, but it adds a huge amount of complexity before we could implement on-line defragmentation. First of all, we would need a way of allowing userpsace to specify which blocks should be used in the preallocation. Secondly, we would need a way of marking blocks as preallocated but not pre-zeroed; otherwise we would have to zero out all of the blocks in order to assure security (don't want userspace programs seeing the previous contents of the data blocks), only to do the copy and the extents vector swap Chris Mason page place holder work for DIRECT IO should be applicable to any pre-allocations? -- Russell Cattelan [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part
Re: [RFC] Ext3 online defrag
On Tue, Oct 24, 2006 at 11:26:26AM -0500, Dave Kleikamp wrote: On Wed, 2006-10-25 at 02:01 +1000, David Chinner wrote: On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote: On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote: That's the wrong way to look at it. if you want the userspace process to specify a location, then you should preallocate it first before doing anything else. There is no need to clutter a simple data mover interface with all sorts of unnecessary error handling. You are implying the the 2-step interface, creating a new inode then swapping the contents, is the only way to implement this. No, it's not the only way to implement it, but it seems the cleanest way to me when you have to consider crash recovery. With a temporary inode, you can create it, hold a reference and then unlink it so that any crash at that point will free the inode and any extents it has on it. The only way I can see anything different working is having the filesystem hold extents somewhere internally that provides us the same recovery guarantees while we copy the data and insert the new extents. This is obviously a filesystem specific solution and is more complex to implement than a swap extent transaction. it probably also needs on disk format changes to support properly This is definitely filesystem-dependent. I would think allocating an extent would be like any other allocation done by the filesystem, and there are already recovery mechanisms for that. Yes, the allocation would be the same, but that isn't the problem I was talking about. The problem is holding a reference to the extent once it has been allocated while it is having the data copied into it (i.e. before it is swapped with the original extents) and then holding the original extents until they are freed. These references need to be persistent so they can be freed correctly during crash recovery i.e. rollback the allocation if the extent swap has not been logged, or free the original blocks is the extent swap has been logged. The obvious way to do this is to use an unlinked (orphan) inode Once you've separated the destination allocation from the data mover, the mover is basically a splice copy from source to destination, an fsync and then an atomic swap blocks/extents operation. Most of this code is generic, and a per-fs swap-extents vector could be easily provided for the one bit that is not The benefit of having such a simple data mover is negated by moving the complexity into the allocator. What complexity does it introduce that the allocator doesn't already have or needs to provide for the single call interface to work? I don't see it as any more or less complex than a single interface. Ok, I thought I was missing something there. The allocation interface needs to be be able to be extended independently of the data mover interface. XFS already exposes allocation ioctls to userspace for preallocation and we've got plans to extnd this further to allow userspace controlled allocation for smart defrag tools for XFS. Tying allocation to the data mover just makes the interface less flexible and harder to do anything smart with Okay. It would be nice to standardize the interface so we don't have every filesystem introducing new ioctls. Well, that will be an interesting challenge. I'm sure that there is a common subset that all filesystems can implement e.g. per file preallocation (something like XFS's allocate/reserve/free space ioctls) to provide kernel support for posix_fallocate(), etc. However, we may end up exposing enough of XFS's current allocation semantics to do things like telling the filesystem to allocate in allocation group 6, near block number 0x32482 within the AG, falling back to searching for the nearest match to the size requirement, failing that look for something larger than the minimum size specified, and then fail if you can't find a match in that AG. That makes little sense to any filesystem but XFS, which is really why I think that the smarter allocation interfaces are going to remain filesystem specific Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Tue, Oct 24, 2006 at 03:44:16PM -0400, Theodore Tso wrote: On Tue, Oct 24, 2006 at 11:59:28PM +1000, David Chinner wrote: That's the wrong way to look at it. if you want the userspace process to specify a location, then you should preallocate it first before doing anything else. There is no need to clutter a simple data mover interface with all sorts of unnecessary error handling. This is doable, but it adds a huge amount of complexity before we could implement on-line defragmentation. First of all, we would need a way of allowing userpsace to specify which blocks should be used in the preallocation. Not initially. Create a file, and call posix_fallocate() on it. Later, the filesystem can provide something that the defrag tool can use for fine-grained control of where the preallocated blocks are on disk. Secondly, we would need a way of marking blocks as preallocated but not pre-zeroed; otherwise we would have to zero out all of the blocks in order to assure security (don't want userspace programs seeing the previous contents of the data blocks), only to do the copy and the extents vector swap. The unlinked inode method avoids this problem because no user space process can see the inode to open it. Also, posix_fallocate() zeroes the disk blocks so even this protects against data exposure. So, now all that remains for an initial implementation is the swap extents transaction and the data mover syscall. For a smart, fast implementation, I agree that you need unwritten extents (which XFS already has), then a fast filesystem implementation of posix_fallocate() that utilises unwritten extents (which XFS already has), and finally another interface that allows you to allocate unwritten extents in an arbitrary location within the filesystem (which no filesystem currently has). That's a huge amount of work, and while the above two features can be useful for other things, it's not clear it's worth it to require this as the only way to implement on-line defragging. You're right that it's a way of making things be more generic, but it means that each filesystem needs to have a huge amount of additional complexity and potential filesystem format changes before they could take advantage of this general framework. I disagree - it's not a huge amount of work to get some thing working and to solidify the generic interfaces and only format change is a new transaction. Any filesystem that supports the swap extent/blocks method would then work better than XFs's current online defrag tool which currently does not use preallocation, nor does it use splice. (For example, you'd never be able to do this with the FAT filesystem, or the ext2 or ext3 filesystems; it would work for ext4 only *after* we implement the above mentioned new features and the associated filesystem format changes.) Sure, but they can use the slow, unoptimised posix_fallocate() method for allocating disk space Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC] Ext3 online defrag
On Wed, 25 Oct 2006 11:19 AM, David Chinner wrote: On Tue, Oct 24, 2006 at 11:26:26AM -0500, Dave Kleikamp wrote: On Wed, 2006-10-25 at 02:01 +1000, David Chinner wrote: On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote: The allocation interface needs to be be able to be extended independently of the data mover interface. XFS already exposes allocation ioctls to userspace for preallocation and we've got plans to extnd this further to allow userspace controlled allocation for smart defrag tools for XFS. Tying allocation to the data mover just makes the interface less flexible and harder to do anything smart with Okay. It would be nice to standardize the interface so we don't have every filesystem introducing new ioctls. Well, that will be an interesting challenge. I'm sure that there is a common subset that all filesystems can implement e.g. per file preallocation (something like XFS's allocate/reserve/free space ioctls) to provide kernel support for posix_fallocate(), etc. However, we may end up exposing enough of XFS's current allocation semantics to do things like telling the filesystem to allocate in allocation group 6, near block number 0x32482 within the AG, falling back to searching for the nearest match to the size requirement, failing that look for something larger than the minimum size specified, and then fail if you can't find a match in that AG. That makes little sense to any filesystem but XFS, which is really why I think that the smarter allocation interfaces are going to remain filesystem specific Could we have a more abstract method for asking the filesystem where the free blocks are and then using the same block addressing to tell the fs where to allocate/move the file's data to? - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 12:30:02PM +1000, Barry Naujok wrote: Could we have a more abstract method for asking the filesystem where the free blocks are and then using the same block addressing to tell the fs where to allocate/move the file's data to? That's fundamentally racy, so you might as well just read the filesystem metadata from userspace. No need to go through the kernel for that. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Tue, Oct 24, 2006 at 10:42:57PM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 12:30:02PM +1000, Barry Naujok wrote: Could we have a more abstract method for asking the filesystem where the free blocks are and then using the same block addressing to tell the fs where to allocate/move the file's data to? That's fundamentally racy, so you might as well just read the filesystem metadata from userspace. No need to go through the kernel for that. But it a race that is _easily_ handled, and applications only need to implement one interface, not a different method for every filesystem that requires deeep filesystem knowledge. Besides, you still have to handle the case where the block you want has already been allocated because reading the metadata from userspace doesn't prevent the kernel from allocating the block you want before you ask for it... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote: But it a race that is _easily_ handled, and applications only need to implement one interface, not a different method for every filesystem that requires deeep filesystem knowledge. Besides, you still have to handle the case where the block you want has already been allocated because reading the metadata from userspace doesn't prevent the kernel from allocating the block you want before you ask for it... The race is easily handled either way, by having the block move fail when you tell the kernel the destination blocks. The difference is that you don't unnecessarily bloat the kernel. Every major filesystem has a libfoofs library that makes it trivial to read the metadata, so all you need to do is use an existing lib. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote: But it a race that is _easily_ handled, and applications only need to implement one interface, not a different method for every filesystem that requires deeep filesystem knowledge. Besides, you still have to handle the case where the block you want has already been allocated because reading the metadata from userspace doesn't prevent the kernel from allocating the block you want before you ask for it... The race is easily handled either way, by having the block move fail when you tell the kernel the destination blocks. So why are you arguing that an interface is no good because it is fundamentally racy? ;) The difference is that you don't unnecessarily bloat the kernel. By that argument, we should rip out the bmap interface (FIBMAP) because you can get all that information by reading the metadata from userspace. Every major filesystem has a libfoofs library that makes it trivial to read the metadata, so all you need to do is use an existing lib. IOWs, you are advocating that any application that wants to use this special allocation technique needs to link against every different filesystem library and it then needs to implement filesystem specific searches through their metadata? Nobody in their right mind would ever want to use an interface like this. Also, this simply doesn't work for XFS because the cached metadata is in a different address space to the block device. Hence it can be tens of seconds between the kernel modifying a metadata buffer and userspace being able to see that modification. You need to freeze the filesystem for the XFS userspace tools to guarantee a consistent view of an online filesystem from the block device. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote: Hello, I've written a simple patch implementing ext3 ioctl for file relocation. Basically you call ioctl on a file, give it list of blocks and it relocates the file into given blocks (provided they are still free). The idea is to use it as a kernel part of ext3 online defragmenter (or generally disk access optimizer). Now I don't have the user space part that finds larger runs of free blocks and so on so that it can really be used as a defragmenter. I just send this as a kind of proof-of-concept to hear some comments. Attached is also a simple program that demonstrates the use of the ioctl. As a suggestion, I would pass the inode number and inode generation number into the ext3_file_mode_data array: struct ext3_file_move_data { int extents; struct ext3_reloc_extent __user *ext_array; }; This will be much more efficient for the userspace relocator, since it won't need to translate from an inode number to a pathname, and then try to open the file before relocating it. I'd also use an explicit 64-bit block numbers type so that we don't have to worry about the ABI changing when we support 64-bit block numbers. The other problem I see with this patch is that there will be cache coherency problems between the buffer cache and the page cache. I think you will want to pull the data blocks of the file into the page cache, and then write them out from the page cache, and only *then* update the indirect blocks and commit the transaction. So what needs to happen is the following: 1) Validate that inode and generation number. Make sure the new (destination) blocks passed in are valid and not in use. Allocate them to prevent anyone else from using those blocks. 2) Pull the blocks into the page cache (if they are not already there), and the write them out to the new location on disk. If any of the I/O's fail, abort. 3) Update the indirect blocks or extent tree to point at the newly allocated and copied data blocks. In the current patch, it looks like you add the inode being relocated to the orphan list, and then update the direct/indirect blocks first --- and if you fail the inode gets truncated. That's bad since we don't want to lose any data if we crash in the middle of the defrag operation Great to see that you're working on this problem! I'd love to see this functionality into ext4. Regards, - Ted P.S. There is also the question of whether we'll be able to get this interface past the ioctl() police, but the atomicity requirements of such an interface are a poster child for why we really, REALLY, can't do this via a sysfs interface. We might be forced to create a new filesystem, or create a pseudo inode which we open via a magic pathname, though. That in my opinion is uglier than an ioctl, but the ioctl police really don't like the problem of needing to maintain 32/64 bit translation functions, and this interface would surely cause problems for the x86_64 and PPC platforms, since they have to support 32-bit and 64-bit system ABI's. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
Hello, I've written a simple patch implementing ext3 ioctl for file relocation. Basically you call ioctl on a file, give it list of blocks and it relocates the file into given blocks (provided they are still free). The idea is to use it as a kernel part of ext3 online defragmenter (or generally disk access optimizer). Now I don't have the user space part that finds larger runs of free blocks and so on so that it can really be used as a defragmenter. I just send this as a kind of proof-of-concept to hear some comments. Attached is also a simple program that demonstrates the use of the ioctl. As a suggestion, I would pass the inode number and inode generation number into the ext3_file_mode_data array: struct ext3_file_move_data { int extents; struct ext3_reloc_extent __user *ext_array; }; This will be much more efficient for the userspace relocator, since it won't need to translate from an inode number to a pathname, and then try to open the file before relocating it. Hmm, I was also thinking about it. Probably you're right. It just seemed elegant to call ioctl on a file and *plop* it's relocated ;). I'd also use an explicit 64-bit block numbers type so that we don't have to worry about the ABI changing when we support 64-bit block numbers. Right, will fix. The other problem I see with this patch is that there will be cache coherency problems between the buffer cache and the page cache. I think you will want to pull the data blocks of the file into the page cache, and then write them out from the page cache, and only *then* update the indirect blocks and commit the transaction. Hmm, I thought I got this right. We build a new tree, copy all data to it (no writes happen so trees remain consistent), we switch block pointers from inode. So from now on, any get_block() will correctly return new block number and block will be read from disk (hmm, probably I'm missing sync after writing out all the data). Now we call invalidate_inode_pages2() so all buffers mapped to old blocks are freed from memory. So there should not be problems with this... OTOH doing the data copy via page-cache (of the temporarily set-up inode) should not be a big problem either and we can avoid one sync which should be a win. So what needs to happen is the following: 1) Validate that inode and generation number. Make sure the new (destination) blocks passed in are valid and not in use. Allocate them to prevent anyone else from using those blocks. 2) Pull the blocks into the page cache (if they are not already there), and the write them out to the new location on disk. If any of the I/O's fail, abort. 3) Update the indirect blocks or extent tree to point at the newly allocated and copied data blocks. In the current patch, it looks like you add the inode being relocated to the orphan list, and then update the direct/indirect blocks first No, I create temporary inode that holds allocated blocks and that is added to the orphan list. Hence if we crash in the middle of relocation, all blocks are correctly freed. --- and if you fail the inode gets truncated. That's bad since we don't want to lose any data if we crash in the middle of the defrag operation Great to see that you're working on this problem! I'd love to see this functionality into ext4. Thanks for comments. P.S. There is also the question of whether we'll be able to get this interface past the ioctl() police, but the atomicity requirements of such an interface are a poster child for why we really, REALLY, can't do this via a sysfs interface. We might be forced to create a new filesystem, or create a pseudo inode which we open via a magic pathname, though. That in my opinion is uglier than an ioctl, but the ioctl police really don't like the problem of needing to maintain 32/64 bit translation functions, and this interface would surely cause problems for the x86_64 and PPC platforms, since they have to support 32-bit and 64-bit system ABI's. Umm, yes. I'm open to suggestions with respect to which interface to choose. ioctl() was just the easiest to code ;). Bye Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 23, 2006 18:31 +0400, Alex Tomas wrote: isn't that a kernel responsbility to find/allocate target blocks? wouldn't it better to specify desirable target group and minimal acceptable chunk of free blocks? In some cases this is useful (e.g. if file has small fragments after being written in small pieces or in a fragmented free space). In other cases the user tool HAS to be able to specify the new mapping in order to make progress. Consider if there are two very large fragmented files and user-space defrag tool wants to make contiguous free space. If kernel is left to do allocation it will always consume the largest chunk of free space first, even if it is not yet optimal (e.g. large 1MB aligned extent). I would make this interface optionally allow the target extent to be specified, but if target block == 0 then the kernel is free to do its own allocation. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
Theodore Tso (TT) writes: TT On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote: Hello, I've written a simple patch implementing ext3 ioctl for file relocation. Basically you call ioctl on a file, give it list of blocks and it relocates the file into given blocks (provided they are still free). The idea is to use it as a kernel part of ext3 online defragmenter (or generally disk access optimizer). isn't that a kernel responsbility to find/allocate target blocks? wouldn't it better to specify desirable target group and minimal acceptable chunk of free blocks? Kernel definitely allocates those blocks (because it's the only reasonably race-free way). The problem of finding those blocks is a bit harder - it may be quite complicated decision where to put the file (also given, that sometimes you may need to shift away some file to make space for some other one). Also what I'm aiming for is, that userspace defragmenter could be fed some access patterns and it optimizes layout of several files to speedup startup (i.e. blocks of those several files would be interleaved so that their sequence is close to the one seen during start-up). Honza - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 23, 2006 18:31 +0400, Alex Tomas wrote: I would make this interface optionally allow the target extent to be specified, but if target block == 0 then the kernel is free to do its own allocation. That's a good idea! I'll change the handling so that if block==0 we just allocate blocks of given extent as we wish... Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
Alex Tomas wrote: Theodore Tso (TT) writes: TT On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote: Hello, I've written a simple patch implementing ext3 ioctl for file relocation. Basically you call ioctl on a file, give it list of blocks and it relocates the file into given blocks (provided they are still free). The idea is to use it as a kernel part of ext3 online defragmenter (or generally disk access optimizer). isn't that a kernel responsbility to find/allocate target blocks? wouldn't it better to specify desirable target group and minimal acceptable chunk of free blocks? XFS does this by allocating new blocks for a temporary file (initiated from userspace, implemented in kernelspace of course), then just checks to see if the result is better than what we had before; if so, then swap the storage space throw away the temporary file (which now has the original, more-fragmented file blocks). see xfs_swapext() in xfs_dfrag.c for the extent swapping part of this. You probably want to avoid the page cache in all of this too, doing O_DIRECT IO if possible, I don't think there's any reason to churn the page cache while the defragmenter runs over a filesystem? -Eric - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 23, 2006 10:16 -0400, Theodore Tso wrote: As a suggestion, I would pass the inode number and inode generation number into the ext3_file_mode_data array: struct ext3_file_move_data { int extents; struct ext3_reloc_extent __user *ext_array; }; This will be much more efficient for the userspace relocator, since it won't need to translate from an inode number to a pathname, and then try to open the file before relocating it. I'd also use an explicit 64-bit block numbers type so that we don't have to worry about the ABI changing when we support 64-bit block numbers. I would in fact go so far as to allow only a single extent to be specified per call. This is to avoid the passing of any pointers as part of the interface (hello ioctl police :-), and also makes the kernel code simpler. I don't think the syscall/ioctl overhead is significant compared to the journal and IO overhead. Also, I would specify both the source extent and the target extent in the inode. This first allows defragmenting only part of the file instead of (it appears) requiring the whole file to be relocated. That would be a killer if the file being defragmented is larger than free space. It secondly provides a level of insurance that what the kernel is relocating matches what userspace thinks it is doing. It would protect against problems if the kernel ever does block relocation itself (e.g. merge fragments into a single extent on (re)write, or for snapshot/COW). The other problem I see with this patch is that there will be cache coherency problems between the buffer cache and the page cache. I think you will want to pull the data blocks of the file into the page cache, and then write them out from the page cache, and only *then* update the indirect blocks and commit the transaction. Alternately (maybe even better) is to treat it as O_DIRECT and ensure the page cache is flushed. This also avoids polluting the whole page cache while running a defragmenter on the filesystem. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 23, 2006 10:16 -0400, Theodore Tso wrote: As a suggestion, I would pass the inode number and inode generation number into the ext3_file_mode_data array: struct ext3_file_move_data { int extents; struct ext3_reloc_extent __user *ext_array; }; This will be much more efficient for the userspace relocator, since it won't need to translate from an inode number to a pathname, and then try to open the file before relocating it. I'd also use an explicit 64-bit block numbers type so that we don't have to worry about the ABI changing when we support 64-bit block numbers. I would in fact go so far as to allow only a single extent to be specified per call. This is to avoid the passing of any pointers as part of the interface (hello ioctl police :-), and also makes the kernel code simpler. I don't think the syscall/ioctl overhead is significant compared to the journal and IO overhead. I'm not sure it makes the kernel code simplier - if we have to replace just a part of the file, we have to rewrite references to blocks at several places inside indiretc tree. If we relocate whole file, we just replace block pointers from inode. Furthermore it makes it kind of harder to tell where indirect blocks would go - and it would be impossible for the defragmenter to force some unusual placement of indirect blocks... Currently blocks (including indirect ones) are just being allocated in the DFS order from the given list. Also, I would specify both the source extent and the target extent in the inode. This first allows defragmenting only part of the file instead of (it appears) requiring the whole file to be relocated. That would be a killer if the file being defragmented is larger than free space. It secondly provides a level of insurance that what the kernel is relocating matches what userspace thinks it is doing. It would protect against problems if the kernel ever does block relocation itself (e.g. merge fragments into a single extent on (re)write, or for snapshot/COW). I agree that this is the positive side of your approach :). The other problem I see with this patch is that there will be cache coherency problems between the buffer cache and the page cache. I think you will want to pull the data blocks of the file into the page cache, and then write them out from the page cache, and only *then* update the indirect blocks and commit the transaction. Alternately (maybe even better) is to treat it as O_DIRECT and ensure the page cache is flushed. This also avoids polluting the whole page cache while running a defragmenter on the filesystem. That's what I'm trying to do (but maybe my code is buggy). Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote: isn't that a kernel responsbility to find/allocate target blocks? wouldn't it better to specify desirable target group and minimal acceptable chunk of free blocks? The kernel doesn't have enough knowledge to know whether or not the defragger prefers one blkdev location over another. When you are trying to consolidate blocks, you must specify the destination as well as source blocks. Certainly, to prevent corruption and other nastiness, you must fail if the destination isn't available... (ext2meta did all this...) Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html