Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote: But it a race that is _easily_ handled, and applications only need to implement one interface, not a different method for every filesystem that requires deeep filesystem knowledge. Besides, you still have to handle the case where the block you want has already been allocated because reading the metadata from userspace doesn't prevent the kernel from allocating the block you want before you ask for it... The race is easily handled either way, by having the block move fail when you tell the kernel the destination blocks. So why are you arguing that an interface is no good because it is fundamentally racy? ;) My point was that it is silly to introduce obviously racy code into the kernel, when -- inside the kernel -- it could be handled race-free. If you accept a racy solution, you might as well do it outside the kernel, where you get the same results, but without adding silliness and bloat to the kernel. Every major filesystem has a libfoofs library that makes it trivial to read the metadata, so all you need to do is use an existing lib. IOWs, you are advocating that any application that wants to use this special allocation technique needs to link against every different filesystem library and it then needs to implement filesystem specific searches through their metadata? Nobody in their right mind would ever want to use an interface like this. Online defrag is OBVIOUSLY highly filesystem specific. You have to link against filesystem specific code somewhere, whether its inside the kernel or outside the kernel. Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: So why are you arguing that an interface is no good because it is fundamentally racy? ;) My point was that it is silly to introduce obviously racy code into the kernel, when -- inside the kernel -- it could be handled race-free. So how do you then get the generic interface to allocate blocks specified by userspace race free? Every major filesystem has a libfoofs library that makes it trivial to read the metadata, so all you need to do is use an existing lib. IOWs, you are advocating that any application that wants to use this special allocation technique needs to link against every different filesystem library and it then needs to implement filesystem specific searches through their metadata? Nobody in their right mind would ever want to use an interface like this. Online defrag is OBVIOUSLY highly filesystem specific. Parts of it are, but data movement and allocation hints need to be provided by every filesystem that wants to implement this efficiently. These features are also useful outside of defrag as well - I can think of several applications that would benefit from being able to direct where in the filesystem they want data to reside. If userspace directed allocation requires deep knowledge of the filesystem metadata (this is what you are saying they need to do, right?), then these applications will never, ever make use of this interface and we'll continue to have problems with them. I guess my point is that we are going to implement features like this in XFS and if other filesystems are going to be doing the same thing then we should try to come up with generic solutions rather than reinvent the wheel over an over again. Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Sure, but that's not a generic solution to a problem common to all filesystems Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3: bogus i_mode errors with 2.6.18.1
On 14:27, Andreas Dilger wrote: + j = find_next_usable_block(-1, gdp, EXT3_BLOCKS_PER_GROUP(sb)); I'm not sure why the find_next_usable_block() part is in here? At this point we KNOW that ret_block is not a block we should be allocating, yet it is marked free in the bitmap. So we should just mark the block(s) in-use in the bitmap and look for a different block(s). Are you saying that ext3_set_bit() should simply be called with ret_block as its first argument? If yes, that is what the revised patch below does. Thanks Andre diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c index 063d994..3cca317 100644 --- a/fs/ext3/balloc.c +++ b/fs/ext3/balloc.c @@ -359,17 +359,6 @@ do_more: if (!desc) goto error_return; - if (in_range (le32_to_cpu(desc-bg_block_bitmap), block, count) || - in_range (le32_to_cpu(desc-bg_inode_bitmap), block, count) || - in_range (block, le32_to_cpu(desc-bg_inode_table), - sbi-s_itb_per_group) || - in_range (block + count - 1, le32_to_cpu(desc-bg_inode_table), - sbi-s_itb_per_group)) - ext3_error (sb, ext3_free_blocks, - Freeing blocks in system zones - - Block = E3FSBLK, count = %lu, - block, count); - /* * We are about to start releasing blocks in the bitmap, * so we need undo access. @@ -392,7 +381,17 @@ do_more: jbd_lock_bh_state(bitmap_bh); - for (i = 0, group_freed = 0; i count; i++) { + for (i = 0, group_freed = 0; i count; i++, block++) { + struct ext3_group_desc *gdp = ext3_get_group_desc(sb, i, NULL); + if (block == le32_to_cpu(gdp-bg_block_bitmap) || + block == le32_to_cpu(gdp-bg_inode_bitmap) || + in_range(block, le32_to_cpu(gdp-bg_inode_table), + EXT3_SB(sb)-s_itb_per_group)) { + ext3_error(sb, __FUNCTION__, + Freeing block in system zone - block = %lu, + block); + continue; + } /* * An HJ special. This is expensive... */ @@ -400,7 +399,7 @@ #ifdef CONFIG_JBD_DEBUG jbd_unlock_bh_state(bitmap_bh); { struct buffer_head *debug_bh; - debug_bh = sb_find_get_block(sb, block + i); + debug_bh = sb_find_get_block(sb, block); if (debug_bh) { BUFFER_TRACE(debug_bh, Deleted!); if (!bh2jh(bitmap_bh)-b_committed_data) @@ -452,7 +451,7 @@ #endif jbd_unlock_bh_state(bitmap_bh); ext3_error(sb, __FUNCTION__, bit already cleared for block E3FSBLK, -block + i); + block); jbd_lock_bh_state(bitmap_bh); BUFFER_TRACE(bitmap_bh, bit already cleared); } else { @@ -479,7 +478,6 @@ #endif *pdquot_freed_blocks += group_freed; if (overflow !err) { - block += count; count = overflow; goto do_more; } @@ -1260,7 +1258,7 @@ #endif *errp = -ENOSPC; goto out; } - +repeat: /* * First, test whether the goal block is free. */ @@ -1372,12 +1370,21 @@ allocated: in_range(ret_block, le32_to_cpu(gdp-bg_inode_table), EXT3_SB(sb)-s_itb_per_group) || in_range(ret_block + num - 1, le32_to_cpu(gdp-bg_inode_table), - EXT3_SB(sb)-s_itb_per_group)) - ext3_error(sb, ext3_new_block, + EXT3_SB(sb)-s_itb_per_group)) { + ext3_error(sb, __FUNCTION__, Allocating block in system zone - blocks from E3FSBLK, length %lu, ret_block, num); - + /* Note: This will potentially use up one of the handle's +* buffer credits. Normally we have way too many credits, +* so that is OK. In _very_ rare cases it might not be OK. +* We will trigger an assertion if we run out of credits, +* and we will have to do a full fsck of the filesystem - +* better than randomly corrupting filesystem metadata. +*/ + ext3_set_bit(ret_block, gdp_bh-b_data); + goto repeat; + } performed_allocation = 1; #ifdef CONFIG_JBD_DEBUG -- The only person who always got his work done by Friday was Robinson Crusoe signature.asc Description: Digital signature
Re: [RFC] Ext3 online defrag
On Oct 24, 2006 15:44 -0400, Theodore Tso wrote: First of all, we would need a way of allowing userpsace to specify which blocks should be used in the preallocation. Presumably it could do this in the same way it will be specifying which blocks to relocate in the defragger - by passing an extent. You would be required to pass the file offset for which to preallocate, and optionally an extent for the on-disk allocation itself (if none is supplied the kernel will allocate the best extent it can). Secondly, we would need a way of marking blocks as preallocated but not pre-zeroed; otherwise we would have to zero out all of the blocks in order to assure security (don't want userspace programs seeing the previous contents of the data blocks), only to do the copy and the extents vector swap. This could be mitigated by having the preallocation be done (in the defragment case) against a temporary inode in the orphan list (as the initial patch did) so if there is a crash it will be released. The temporary inode will not be linked into the namespace so it cannot be read - only used to hold preallocation. If this was a write-only file handle then we should be OK? For defragger purposes this would need: - allocate new temporary inode (VFS + fs, returns write-only fh if fs can't properly handle uninitalized extents, or doesn't request full-extent zeroing) for each extent to defragment { - preallocate extents on temp inode (fs specific internals) - copy data from orig to temp at offset X (VFS, splice or e.g. sys_copyfile(src, dst, offset, count) which Linus agreed to at KS '05 for network filesystems) - migrate copied extent to original inode (fs specific internals) } - free temporary inode (just close of temp fh, frees unmigrated extents). Yes, this sounds feasible. We could split the defrag ioctl into two pieces (addition of given extent to a file and swapping of extents), which can have generic interface... I don't think this is much more work than implementing all of this functionality as part of a monolithic online defrag function, assuming we don't require full-file copies in order to do defrag. Yes, it's not more work than supporting swapping of extents in the middle of the file. I've just not yet decided how to handle indirect blocks in case of relocation in the middle of the file. Should they be relocated or shouldn't they? Probably they should be relocated at least in case they are fully contained in relocated interval or maybe better said when all the blocks they reference to are also in the interval (this handles also the case of EOF). But still if you would like to relocate the file by parts this is not quite what you want (you won't be able to relocate indirect blocks in the boundary of intervals) :(. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: So why are you arguing that an interface is no good because it is fundamentally racy? ;) My point was that it is silly to introduce obviously racy code into the kernel, when -- inside the kernel -- it could be handled race-free. So how do you then get the generic interface to allocate blocks specified by userspace race free? As has been repeatedly stated, there is no generic. There MUST be filesystem-specific knowledge during these operations. If userspace directed allocation requires deep knowledge of the filesystem metadata (this is what you are saying they need to do, right?), then these applications will never, ever make use of this interface and we'll continue to have problems with them. Completely false assumptions. There is no difference in handling of knowledge, be it kernel space or userspace. Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Sure, but that's not a generic solution to a problem common to all filesystems You clearly don't know what I'm talking about. ext2meta is an example of a filesystem-specific metadata access method, applicable to tasks such as online optimization. Implement that tiny kernel module for each filesystem, and you have everything you need, without races. This was discussed years ago; review the mailing lists. Google for 'Alexander Viro' and 'ext2meta'. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote: Yes, this sounds feasible. We could split the defrag ioctl into two pieces (addition of given extent to a file and swapping of extents), which can have generic interface... An ioctl is UGLY. This was discussed years ago. Google for 'Alexander Viro' and 'ext2meta'. That's a clean, flexible, extensible way to access metadata online. No need for ioctl binary translation across 32bit-64bit, or any other ioctl issue. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote: Yes, this sounds feasible. We could split the defrag ioctl into two pieces (addition of given extent to a file and swapping of extents), which can have generic interface... An ioctl is UGLY. Agreed. This was discussed years ago. Google for 'Alexander Viro' and 'ext2meta'. That's a clean, flexible, extensible way to access metadata online. No need for ioctl binary translation across 32bit-64bit, or any other ioctl issue. I've briefly looked at this and this kind of interface has some appeal. On the other hand it's not obvious to me, how to implement in this interface *atomic* operation copy data from file F to given set of blocks and rewrite pointers to original blocks with pointers to new blocks. Something like this is needed for what we want to do... Also if we'd like to implement operation like add this block to file F at position P we have to make sure that all the necessary updates (bitmap updates, inode updates, indirect block updates) go into one transaction. Which basically mean that either ext3meta has to have a way how to do this in a single operation, or we have to give userspace a way to start/stop transaction and that starts to be really a mess because of various deadlocks and so on. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 07:58:51PM +0200, Jan Kara wrote: I've briefly looked at this and this kind of interface has some appeal. On the other hand it's not obvious to me, how to implement in this interface *atomic* operation copy data from file F to given set of blocks and rewrite pointers to original blocks with pointers to new blocks. Something like this is needed for what we want to do... Also if we'd like to implement operation like add this block to file F at position P we have to make sure that all the necessary updates (bitmap updates, inode updates, indirect block updates) go into one transaction. Which basically mean that either ext3meta has to have a way how to do this in a single operation, or we have to give userspace a way to start/stop transaction and that starts to be really a mess because of various deadlocks and so on. Agreed, this issues exist. But these issues exist independent of whether an ioctl or ext3meta is used. It's all the responsibility of the implementor to define the interface. My contention is that ext3meta interface method would be much more robust than ioctl. It's a namespace inside which you can define any inodes/dirents you wish, for the operations you desire. I see. So you mean that in our ext3meta filesystem we'd have a file named add_this_extent_to_inode and a file reloc_inode_interval and they'd be fed essentially the same info as the current ioctl interface and do the same thing as we currently do. Hmm, I don't find it that nice any more but yes, this would work. Heck, according to my sf.net/projects/gkernel CVS log, you offered some helpful review comments to me when I was implementing ext2meta ;-) Looking at those mails it was already quite some time ago so I forgot about it ;) Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 08:25:30PM +0200, Jan Kara wrote: I see. So you mean that in our ext3meta filesystem we'd have a file named add_this_extent_to_inode and a file reloc_inode_interval and they'd be fed essentially the same info as the current ioctl interface and do the same thing as we currently do. Hmm, I don't find it that nice any more but yes, this would work. It depends on the operation. ext2meta[1] works fine for online defrag, just exporting metadata objects and providing read(1) and write(2) operations on them. Adding 'trigger' files (like your add_this_extent_to_inode) may make sense for some operations, indeed, but we need to see the whole picture before really understanding whether that interface is optimal. Jeff [1] http://linux.yyz.us/misc/ext2meta.c - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Oct 23, 2006 18:03 +0200, Jan Kara wrote: Andreas Dilger wrote: I would in fact go so far as to allow only a single extent to be specified per call. This is to avoid the passing of any pointers as part of the interface (hello ioctl police :-), and also makes the kernel code simpler. I don't think the syscall/ioctl overhead is significant compared to the journal and IO overhead. ...it makes it kind of harder to tell where indirect blocks would go - and it would be impossible for the defragmenter to force some unusual placement of indirect blocks... It would be possible to specify indirect block relocation in same manner as regular block relocation I think. Allocate a new block, copy contents, flush block from cache, fix up reference (inode, dindirect), commit. Yes, but there's a question of the interface to this operation. How to specify which indirect block I mean? Obviously we could introduce separate call for remapping indirect blocks but I find this solution kind of clumsy... Bye Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 08:36:56PM +0200, Jan Kara wrote: Yes, but there's a question of the interface to this operation. How to specify which indirect block I mean? Obviously we could introduce separate call for remapping indirect blocks but I find this solution kind of clumsy... Agreed... that gets nasty real quick. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: So why are you arguing that an interface is no good because it is fundamentally racy? ;) My point was that it is silly to introduce obviously racy code into the kernel, when -- inside the kernel -- it could be handled race-free. So how do you then get the generic interface to allocate blocks specified by userspace race free? As has been repeatedly stated, there is no generic. There MUST be filesystem-specific knowledge during these operations. What information? All we need to know is where the free disk space is, and have a method to attempt to allocate from it. That's _easy_ to abstract into a common interface via the VFS Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Sure, but that's not a generic solution to a problem common to all filesystems You clearly don't know what I'm talking about. ext2meta is an example of a filesystem-specific metadata access method, applicable to tasks such as online optimization. I know exactly what ext2meta is. I said it's not a generic solution and you say its a filesystem specific solution. I think we're agreeing here. ;) We don't need to expose anything filesystem specific to userspace to implement this. Online data movement (i.e. the defrag mechanism) becomes something like: do { get_free_list(dst_fd, location, len, list) /* select extent to use */ alloc_from_list(dst_fd, list[X], off, len) } while (ENOALLOC) move_data(src_fd, dst_fd, off, len); And this would work on any filesystem type that implemented these interfaces. Hence tools like a startup file optimiser would only need to be written once, rather than needing a different tool for every different filesystem type. Remember, I'm not just talking about defrag - I'm talking about an interface that is actually useful to apps that might care about how data is laid out on disk but the applications writers don't know anyhting about how filesystem X or Y or Z is implemented. Putting the burden of learning about fileystem internals on application developers is not the correct solution. I see substantial benefit moving forward from having filesystem independent interfaces. Many features that filesystems implement are common, and as time goes on the common feature set of the different filesystems gets larger. So why shouldn't we be trying to make common operations generic so that every filesystem can benefit from the latest and greatest tool? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Thu, Oct 26, 2006 at 11:40:20AM +1000, David Chinner wrote: We don't need to expose anything filesystem specific to userspace to implement this. Online data movement (i.e. the defrag mechanism) becomes something like: do { get_free_list(dst_fd, location, len, list) /* select extent to use */ alloc_from_list(dst_fd, list[X], off, len) } while (ENOALLOC) move_data(src_fd, dst_fd, off, len); And this would work on any filesystem type that implemented these interfaces. Hence tools like a startup file optimiser would only need to be written once, rather than needing a different tool for every different filesystem type. Yeah, but that's simply not enough. A good defragger needs to know about a filesystem's allocation policies, and move files so they are optimally located, given the filesystem layout. For example, in ext2/3/4 we will want to move blocks so they in the same block group as the inode. That's filesystem specific information; other filesystems will require different policies. Remember, I'm not just talking about defrag - I'm talking about an interface that is actually useful to apps that might care about how data is laid out on disk but the applications writers don't know anyhting about how filesystem X or Y or Z is implemented. Putting the burden of learning about fileystem internals on application developers is not the correct solution. Unfortunately, if you want to do a good job, a defragger *has* to know about some very low-level filesystem specific information, if it wants to do a good job. - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html