Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jeff Garzik
On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote:
 On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
  On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote:
   But it a race that is _easily_ handled, and applications only need to
   implement one interface, not a different method for every
   filesystem that requires deeep filesystem knowledge.
   
   Besides, you still have to handle the case where the block you want
   has already been allocated because reading the metadata from
   userspace doesn't prevent the kernel from allocating the block you
   want before you ask for it...
  
  The race is easily handled either way, by having the block move fail
  when you tell the kernel the destination blocks.
 
 So why are you arguing that an interface is no good because it
 is fundamentally racy? ;)

My point was that it is silly to introduce obviously racy code into the
kernel, when -- inside the kernel -- it could be handled race-free.

If you accept a racy solution, you might as well do it outside the
kernel, where you get the same results, but without adding silliness and
bloat to the kernel.


  Every major filesystem has a libfoofs library that makes it trivial to
  read the metadata, so all you need to do is use an existing lib.
 
 IOWs, you are advocating that any application that wants to use this
 special allocation technique needs to link against every different
 filesystem library and it then needs to implement filesystem
 specific searches through their metadata?  Nobody in their right
 mind would ever want to use an interface like this.

Online defrag is OBVIOUSLY highly filesystem specific.  You have to link
against filesystem specific code somewhere, whether its inside the
kernel or outside the kernel.

Further, in the case being discussed in this thread, ext2meta has
already been proven a workable solution.

Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread David Chinner
On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
 On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote:
  On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
  So why are you arguing that an interface is no good because it
  is fundamentally racy? ;)
 
 My point was that it is silly to introduce obviously racy code into the
 kernel, when -- inside the kernel -- it could be handled race-free.

So how do you then get the generic interface to allocate blocks
specified by userspace race free?

   Every major filesystem has a libfoofs library that makes it trivial to
   read the metadata, so all you need to do is use an existing lib.
  
  IOWs, you are advocating that any application that wants to use this
  special allocation technique needs to link against every different
  filesystem library and it then needs to implement filesystem
  specific searches through their metadata?  Nobody in their right
  mind would ever want to use an interface like this.
 
 Online defrag is OBVIOUSLY highly filesystem specific. 

Parts of it are, but data movement and allocation hints need to be
provided by every filesystem that wants to implement this
efficiently. These features are also useful outside of defrag as
well - I can think of several applications that would benefit from
being able to direct where in the filesystem they want data to
reside. 

If userspace directed allocation requires deep knowledge of the
filesystem metadata (this is what you are saying they need to do,
right?), then these applications will never, ever make use of this
interface and we'll continue to have problems with them.

I guess my point is that we are going to implement features like
this in XFS and if other filesystems are going to be doing the same
thing then we should try to come up with generic solutions rather
than reinvent the wheel over an over again.

 Further, in the case being discussed in this thread, ext2meta has
 already been proven a workable solution.

Sure, but that's not a generic solution to a problem common to
all filesystems

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3: bogus i_mode errors with 2.6.18.1

2006-10-25 Thread Andre Noll
On 14:27, Andreas Dilger wrote:

  +   j = find_next_usable_block(-1, gdp, EXT3_BLOCKS_PER_GROUP(sb));
 
 I'm not sure why the find_next_usable_block() part is in here?  At this
 point we KNOW that ret_block is not a block we should be allocating, yet
 it is marked free in the bitmap.  So we should just mark the block(s) in-use
 in the bitmap and look for a different block(s).

Are you saying that ext3_set_bit() should simply be called with
ret_block as its first argument? If yes, that is what the revised
patch below does.

Thanks
Andre

diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index 063d994..3cca317 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -359,17 +359,6 @@ do_more:
if (!desc)
goto error_return;
 
-   if (in_range (le32_to_cpu(desc-bg_block_bitmap), block, count) ||
-   in_range (le32_to_cpu(desc-bg_inode_bitmap), block, count) ||
-   in_range (block, le32_to_cpu(desc-bg_inode_table),
- sbi-s_itb_per_group) ||
-   in_range (block + count - 1, le32_to_cpu(desc-bg_inode_table),
- sbi-s_itb_per_group))
-   ext3_error (sb, ext3_free_blocks,
-   Freeing blocks in system zones - 
-   Block = E3FSBLK, count = %lu,
-   block, count);
-
/*
 * We are about to start releasing blocks in the bitmap,
 * so we need undo access.
@@ -392,7 +381,17 @@ do_more:
 
jbd_lock_bh_state(bitmap_bh);
 
-   for (i = 0, group_freed = 0; i  count; i++) {
+   for (i = 0, group_freed = 0; i  count; i++, block++) {
+   struct ext3_group_desc *gdp = ext3_get_group_desc(sb, i, NULL);
+   if (block == le32_to_cpu(gdp-bg_block_bitmap) ||
+   block == le32_to_cpu(gdp-bg_inode_bitmap) ||
+   in_range(block, le32_to_cpu(gdp-bg_inode_table),
+   EXT3_SB(sb)-s_itb_per_group)) {
+   ext3_error(sb, __FUNCTION__,
+   Freeing block in system zone - block = %lu,
+   block);
+   continue;
+   }
/*
 * An HJ special.  This is expensive...
 */
@@ -400,7 +399,7 @@ #ifdef CONFIG_JBD_DEBUG
jbd_unlock_bh_state(bitmap_bh);
{
struct buffer_head *debug_bh;
-   debug_bh = sb_find_get_block(sb, block + i);
+   debug_bh = sb_find_get_block(sb, block);
if (debug_bh) {
BUFFER_TRACE(debug_bh, Deleted!);
if (!bh2jh(bitmap_bh)-b_committed_data)
@@ -452,7 +451,7 @@ #endif
jbd_unlock_bh_state(bitmap_bh);
ext3_error(sb, __FUNCTION__,
bit already cleared for block E3FSBLK,
-block + i);
+   block);
jbd_lock_bh_state(bitmap_bh);
BUFFER_TRACE(bitmap_bh, bit already cleared);
} else {
@@ -479,7 +478,6 @@ #endif
*pdquot_freed_blocks += group_freed;
 
if (overflow  !err) {
-   block += count;
count = overflow;
goto do_more;
}
@@ -1260,7 +1258,7 @@ #endif
*errp = -ENOSPC;
goto out;
}
-
+repeat:
/*
 * First, test whether the goal block is free.
 */
@@ -1372,12 +1370,21 @@ allocated:
in_range(ret_block, le32_to_cpu(gdp-bg_inode_table),
  EXT3_SB(sb)-s_itb_per_group) ||
in_range(ret_block + num - 1, le32_to_cpu(gdp-bg_inode_table),
- EXT3_SB(sb)-s_itb_per_group))
-   ext3_error(sb, ext3_new_block,
+ EXT3_SB(sb)-s_itb_per_group)) {
+   ext3_error(sb, __FUNCTION__,
Allocating block in system zone - 
blocks from E3FSBLK, length %lu,
 ret_block, num);
-
+   /* Note: This will potentially use up one of the handle's
+* buffer credits.  Normally we have way too many credits,
+* so that is OK.  In _very_ rare cases it might not be OK.
+* We will trigger an assertion if we run out of credits,
+* and we will have to do a full fsck of the filesystem -
+* better than randomly corrupting filesystem metadata.
+*/
+   ext3_set_bit(ret_block, gdp_bh-b_data);
+   goto repeat;
+   }
performed_allocation = 1;
 
 #ifdef CONFIG_JBD_DEBUG
-- 
The only person who always got his work done by Friday was Robinson Crusoe


signature.asc
Description: Digital signature


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jan Kara
 On Oct 24, 2006  15:44 -0400, Theodore Tso wrote:
  First of all, we would need a way of allowing userpsace to specify
  which blocks should be used in the preallocation.
 
 Presumably it could do this in the same way it will be specifying
 which blocks to relocate in the defragger - by passing an extent.
 You would be required to pass the file offset for which to preallocate,
 and optionally an extent for the on-disk allocation itself (if none is
 supplied the kernel will allocate the best extent it can).
 
  Secondly, we would need a way of marking blocks as preallocated but
  not pre-zeroed; otherwise we would have to zero out all of the blocks
  in order to assure security (don't want userspace programs seeing the
  previous contents of the data blocks), only to do the copy and the
  extents vector swap.
 
 This could be mitigated by having the preallocation be done (in the
 defragment case) against a temporary inode in the orphan list (as
 the initial patch did) so if there is a crash it will be released.
 The temporary inode will not be linked into the namespace so it cannot
 be read - only used to hold preallocation.  If this was a write-only
 file handle then we should be OK?
 
 For defragger purposes this would need:
 
 - allocate new temporary inode (VFS + fs, returns write-only fh if
fs can't properly handle uninitalized extents, or doesn't request
full-extent zeroing)
 
for each extent to defragment {
   - preallocate extents on temp inode (fs specific internals)
   - copy data from orig to temp at offset X (VFS, splice or
  e.g. sys_copyfile(src, dst, offset, count) which Linus agreed
  to at KS '05 for network filesystems)
   - migrate copied extent to original inode (fs specific internals)
}
 
 - free temporary inode (just close of temp fh, frees unmigrated extents).
  Yes, this sounds feasible. We could split the defrag ioctl into two
pieces (addition of given extent to a file and swapping of extents), which
can have generic interface... 

 I don't think this is much more work than implementing all of this
 functionality as part of a monolithic online defrag function, assuming
 we don't require full-file copies in order to do defrag.
  Yes, it's not more work than supporting swapping of extents in the
middle of the file. I've just not yet decided how to handle indirect
blocks in case of relocation in the middle of the file. Should they be
relocated or shouldn't they? Probably they should be relocated at least
in case they are fully contained in relocated interval or maybe better
said when all the blocks they reference to are also in the interval
(this handles also the case of EOF). But still if you would like to
relocate the file by parts this is not quite what you want (you won't be
able to relocate indirect blocks in the boundary of intervals) :(.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jeff Garzik
On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote:
 On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
  On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote:
   On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
   So why are you arguing that an interface is no good because it
   is fundamentally racy? ;)
  
  My point was that it is silly to introduce obviously racy code into the
  kernel, when -- inside the kernel -- it could be handled race-free.
 
 So how do you then get the generic interface to allocate blocks
 specified by userspace race free?

As has been repeatedly stated, there is no generic.  There MUST be
filesystem-specific knowledge during these operations.


 If userspace directed allocation requires deep knowledge of the
 filesystem metadata (this is what you are saying they need to do,
 right?), then these applications will never, ever make use of this
 interface and we'll continue to have problems with them.

Completely false assumptions.  There is no difference in handling of
knowledge, be it kernel space or userspace.


  Further, in the case being discussed in this thread, ext2meta has
  already been proven a workable solution.
 
 Sure, but that's not a generic solution to a problem common to
 all filesystems

You clearly don't know what I'm talking about.  ext2meta is an example
of a filesystem-specific metadata access method, applicable to tasks
such as online optimization.

Implement that tiny kernel module for each filesystem, and you have
everything you need, without races.  This was discussed years ago;
review the mailing lists.  Google for 'Alexander Viro' and 'ext2meta'.

Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jeff Garzik
On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote:
   Yes, this sounds feasible. We could split the defrag ioctl into two
 pieces (addition of given extent to a file and swapping of extents), which
 can have generic interface... 

An ioctl is UGLY.

This was discussed years ago.  Google for 'Alexander Viro' and
'ext2meta'.  That's a clean, flexible, extensible way to access metadata
online.  No need for ioctl binary translation across 32bit-64bit, or
any other ioctl issue.

Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jan Kara
 On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote:
Yes, this sounds feasible. We could split the defrag ioctl into two
  pieces (addition of given extent to a file and swapping of extents), which
  can have generic interface... 
 
 An ioctl is UGLY.
  Agreed.

 This was discussed years ago.  Google for 'Alexander Viro' and
 'ext2meta'.  That's a clean, flexible, extensible way to access metadata
 online.  No need for ioctl binary translation across 32bit-64bit, or
 any other ioctl issue.
  I've briefly looked at this and this kind of interface has some
appeal. On the other hand it's not obvious to me, how to implement in
this interface *atomic* operation copy data from file F to given set of
blocks and rewrite pointers to original blocks with pointers to new
blocks. Something like this is needed for what we want to do...
Also if we'd like to implement operation like add this block to file F
at position P we have to make sure that all the necessary updates
(bitmap updates, inode updates, indirect block updates) go into one
transaction. Which basically mean that either ext3meta has to have a way
how to do this in a single operation, or we have to give userspace a way
to start/stop transaction and that starts to be really a mess because of
various deadlocks and so on.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jan Kara
 On Wed, Oct 25, 2006 at 07:58:51PM +0200, Jan Kara wrote:
I've briefly looked at this and this kind of interface has some
  appeal. On the other hand it's not obvious to me, how to implement in
  this interface *atomic* operation copy data from file F to given set of
  blocks and rewrite pointers to original blocks with pointers to new
  blocks. Something like this is needed for what we want to do...
  Also if we'd like to implement operation like add this block to file F
  at position P we have to make sure that all the necessary updates
  (bitmap updates, inode updates, indirect block updates) go into one
  transaction. Which basically mean that either ext3meta has to have a way
  how to do this in a single operation, or we have to give userspace a way
  to start/stop transaction and that starts to be really a mess because of
  various deadlocks and so on.
 
 Agreed, this issues exist.  But these issues exist independent of
 whether an ioctl or ext3meta is used.  It's all the responsibility
 of the implementor to define the interface.
 
 My contention is that ext3meta interface method would be much more
 robust than ioctl.  It's a namespace inside which you can define any
 inodes/dirents you wish, for the operations you desire.
  I see. So you mean that in our ext3meta filesystem we'd have a file
named add_this_extent_to_inode and a file reloc_inode_interval and
they'd be fed essentially the same info as the current ioctl interface and
do the same thing as we currently do. Hmm, I don't find it that nice any
more but yes, this would work.

 Heck, according to my sf.net/projects/gkernel CVS log, you offered
 some helpful review comments to me when I was implementing ext2meta ;-)
  Looking at those mails it was already quite some time ago so I
forgot about it  ;)
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jeff Garzik
On Wed, Oct 25, 2006 at 08:25:30PM +0200, Jan Kara wrote:
   I see. So you mean that in our ext3meta filesystem we'd have a file
 named add_this_extent_to_inode and a file reloc_inode_interval and
 they'd be fed essentially the same info as the current ioctl interface and
 do the same thing as we currently do. Hmm, I don't find it that nice any
 more but yes, this would work.

It depends on the operation.  ext2meta[1] works fine for online
defrag, just exporting metadata objects and providing read(1)
and write(2) operations on them.  Adding 'trigger' files (like your
add_this_extent_to_inode) may make sense for some operations, indeed,
but we need to see the whole picture before really understanding
whether that interface is optimal.

Jeff


[1] http://linux.yyz.us/misc/ext2meta.c
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jan Kara
 On Oct 23, 2006  18:03 +0200, Jan Kara wrote:
  Andreas Dilger wrote:
   I would in fact go so far as to allow only a single extent to be specified
   per call.  This is to avoid the passing of any pointers as part of the
   interface (hello ioctl police :-), and also makes the kernel code simpler.
   I don't think the syscall/ioctl overhead is significant compared to the
   journal and IO overhead.
 
  ...it makes it kind of
  harder to tell where indirect blocks would go - and it would be
  impossible for the defragmenter to force some unusual placement of
  indirect blocks...
 
 It would be possible to specify indirect block relocation in same manner
 as regular block relocation I think.  Allocate a new block, copy contents,
 flush block from cache, fix up reference (inode, dindirect), commit.
  Yes, but there's a question of the interface to this operation. How to
specify which indirect block I mean? Obviously we could introduce
separate call for remapping indirect blocks but I find this solution
kind of clumsy...

Bye
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Jeff Garzik
On Wed, Oct 25, 2006 at 08:36:56PM +0200, Jan Kara wrote:
   Yes, but there's a question of the interface to this operation. How to
 specify which indirect block I mean? Obviously we could introduce
 separate call for remapping indirect blocks but I find this solution
 kind of clumsy...

Agreed...  that gets nasty real quick.

Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread David Chinner
On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote:
 On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote:
  On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
   On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote:
On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
So why are you arguing that an interface is no good because it
is fundamentally racy? ;)
   
   My point was that it is silly to introduce obviously racy code into the
   kernel, when -- inside the kernel -- it could be handled race-free.
  
  So how do you then get the generic interface to allocate blocks
  specified by userspace race free?
 
 As has been repeatedly stated, there is no generic.  There MUST be
 filesystem-specific knowledge during these operations.

What information? All we need to know is where the free disk space
is, and have a method to attempt to allocate from it. That's _easy_
to abstract into a common interface via the VFS

   Further, in the case being discussed in this thread, ext2meta has
   already been proven a workable solution.
  
  Sure, but that's not a generic solution to a problem common to
  all filesystems
 
 You clearly don't know what I'm talking about.  ext2meta is an example
 of a filesystem-specific metadata access method, applicable to tasks
 such as online optimization.

I know exactly what ext2meta is. I said it's not a generic solution
and you say its a filesystem specific solution.  I think we're
agreeing here. ;)

We don't need to expose anything filesystem specific to userspace to
implement this.  Online data movement (i.e. the defrag mechanism)
becomes something like:

do {
get_free_list(dst_fd, location, len, list)
/* select extent to use */
alloc_from_list(dst_fd, list[X], off, len)
} while (ENOALLOC)
move_data(src_fd, dst_fd, off, len);

And this would work on any filesystem type that implemented these
interfaces. Hence tools like a startup file optimiser would
only need to be written once, rather than needing a different
tool for every different filesystem type.

Remember, I'm not just talking about defrag - I'm talking about
an interface that is actually useful to apps that might care
about how data is laid out on disk but the applications writers
don't know anyhting about how filesystem X or Y or Z is
implemented. Putting the burden of learning about fileystem
internals on application developers is not the correct solution.

I see substantial benefit moving forward from having filesystem
independent interfaces. Many features that  filesystems implement
are common, and as time goes on the common feature set of the
different filesystems gets larger. So why shouldn't we be
trying to make common operations generic so that every filesystem
can benefit from the latest and greatest tool?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-25 Thread Theodore Tso
On Thu, Oct 26, 2006 at 11:40:20AM +1000, David Chinner wrote:
 We don't need to expose anything filesystem specific to userspace to
 implement this.  Online data movement (i.e. the defrag mechanism)
 becomes something like:
 
   do {
   get_free_list(dst_fd, location, len, list)
   /* select extent to use */
   alloc_from_list(dst_fd, list[X], off, len)
   } while (ENOALLOC)
   move_data(src_fd, dst_fd, off, len);
 
 And this would work on any filesystem type that implemented these
 interfaces. Hence tools like a startup file optimiser would
 only need to be written once, rather than needing a different
 tool for every different filesystem type.

Yeah, but that's simply not enough.  A good defragger needs to know
about a filesystem's allocation policies, and move files so they are
optimally located, given the filesystem layout.  For example, in
ext2/3/4 we will want to move blocks so they in the same block group
as the inode.  That's filesystem specific information; other
filesystems will require different policies.

 Remember, I'm not just talking about defrag - I'm talking about
 an interface that is actually useful to apps that might care
 about how data is laid out on disk but the applications writers
 don't know anyhting about how filesystem X or Y or Z is
 implemented. Putting the burden of learning about fileystem
 internals on application developers is not the correct solution.

Unfortunately, if you want to do a good job, a defragger *has* to know
about some very low-level filesystem specific information, if it wants
to do a good job.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html