Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Thu, Feb 08, 2007 at 11:47:39AM +0100, Jan Kara wrote: > > > > Well. Do we really? Are we looking for a 100% solution here, or a 90% one? > Umm, I think that for ext3 having data on one end of the disk and > indirect blocks on the other end of the disk does not quite help (not > mentioning that it can create bad free space fragmentation over the time). > I have not measured it but I'd guess that it would erase the effect of > moving data closer together. At least for sequential reads.. I don't think anyone is saying we can ignore the metadata; but the fact is, the cleanest solution for 90% of the problem is to use the page cache, and as far as the other 10%, Linus has been pushing us to move at least the directories into the page cache, and it's not insane to consider moving the rest of the metadata into page cache. At least it's something we should consider carefully. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Thu 08-02-07 02:32:13, Andrew Morton wrote: > On Thu, 8 Feb 2007 11:21:02 +0100 Jan Kara <[EMAIL PROTECTED]> wrote: > > > On Thu 08-02-07 01:45:29, Andrew Morton wrote: > > > > > > I though Andreas meant "any write changes" - i.e. you check that noone > > > > has open file descriptor for writing and block any new open for writing. > > > > That can be done quite easily. > > > > Anyway, I agree with you that userspace solution to a possible page > > > > cache pollution is preferable after thinking about it for a while. > > > > As I've been thinking about it, we could actually do the copying > > > > from user space. We could do something like: > > > > block any writes to file (as I described above) > > > > craft new inode with blocks allocated as we want (using preallocation, > > > > we should mostly have the kernel infrastructure we need) > > > > copy data using splice syscall > > > > call the kernel to switch data > > > > > > > > > > I don't think we need to block any writes to any file or anything. > > > > > > To move a page within a file: > > > > > > fd = open(file); > > > p = mmap(fd); > > > the_page_was_in_core = mincore(p, offset); > > > munmap(p); > > > ioctl(fd, ..., new_block); > > > > > > > > > read_cache_page(inode, offset); > > > lock_page(page); > > > if (try_to_free_buffers(page)) { > > > > > > set_page_dirty(page); > > > } > > > unlock_page(page); > > > > > > if (the_page_was_in_core) { > > > sync_file_range(fd, offset SYNC_FILE_RANGE_WAIT_BEFORE| > > > SYNC_FILE_RANGE_WRITE| > > > SYNC_FILE_RANGE_WAIT_AFTER); > > > fadvise(fd, offset, FADV_DONTNEED); > > > } > > > > > > completely coherent with pagecache, quite safe in the presence of mmap, > > > mlock, O_DIRECT, everything else. Also fully journallable in-kernel. > > Yes, this is the simple way. But I see two disadvantages: > > 1) You'd like to relocate metadata (indirect blocks) too. > > Well. Do we really? Are we looking for a 100% solution here, or a 90% one? Umm, I think that for ext3 having data on one end of the disk and indirect blocks on the other end of the disk does not quite help (not mentioning that it can create bad free space fragmentation over the time). I have not measured it but I'd guess that it would erase the effect of moving data closer together. At least for sequential reads.. > Relocating data is the main thing. After that, yeah, relocating metadata, > inodes and directories is probably a second-order thing. > > > For that you need > >a different mechanism. > > I suspect a similar approach will work there: load and lock the > buffer_heads (or maybe just the top-level buffer_head) and then alter their > contents. It could be that verify_chain() will just magically do the right > thing there, but some changes might be needed. Yes, it could be done. I just wanted to point to the fact that things may not be as simple in your solution either... Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Thu, 8 Feb 2007 11:21:02 +0100 Jan Kara <[EMAIL PROTECTED]> wrote: > On Thu 08-02-07 01:45:29, Andrew Morton wrote: > > > > I though Andreas meant "any write changes" - i.e. you check that noone > > > has open file descriptor for writing and block any new open for writing. > > > That can be done quite easily. > > > Anyway, I agree with you that userspace solution to a possible page > > > cache pollution is preferable after thinking about it for a while. > > > As I've been thinking about it, we could actually do the copying > > > from user space. We could do something like: > > > block any writes to file (as I described above) > > > craft new inode with blocks allocated as we want (using preallocation, > > > we should mostly have the kernel infrastructure we need) > > > copy data using splice syscall > > > call the kernel to switch data > > > > > > > I don't think we need to block any writes to any file or anything. > > > > To move a page within a file: > > > > fd = open(file); > > p = mmap(fd); > > the_page_was_in_core = mincore(p, offset); > > munmap(p); > > ioctl(fd, ..., new_block); > > > > > > read_cache_page(inode, offset); > > lock_page(page); > > if (try_to_free_buffers(page)) { > > > > set_page_dirty(page); > > } > > unlock_page(page); > > > > if (the_page_was_in_core) { > > sync_file_range(fd, offset SYNC_FILE_RANGE_WAIT_BEFORE| > > SYNC_FILE_RANGE_WRITE| > > SYNC_FILE_RANGE_WAIT_AFTER); > > fadvise(fd, offset, FADV_DONTNEED); > > } > > > > completely coherent with pagecache, quite safe in the presence of mmap, > > mlock, O_DIRECT, everything else. Also fully journallable in-kernel. > Yes, this is the simple way. But I see two disadvantages: > 1) You'd like to relocate metadata (indirect blocks) too. Well. Do we really? Are we looking for a 100% solution here, or a 90% one? Relocating data is the main thing. After that, yeah, relocating metadata, inodes and directories is probably a second-order thing. > For that you need >a different mechanism. I suspect a similar approach will work there: load and lock the buffer_heads (or maybe just the top-level buffer_head) and then alter their contents. It could be that verify_chain() will just magically do the right thing there, but some changes might be needed. > In my approach, you can mostly assume you've got >sanely laid out metadata and so the existence of such mechanism is not >so important. > 2) You'd like to allocate new blocks in big chunks. So your kernel function >should rather take a range. Also when you fail in the middle of >relocating a file (for example the block you'd like to use is already >taken by someone else), I find it nice if you can return at least to the >original state. But that's probably not important. Well yes, that was a minimal sketch. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Thu 08-02-07 01:45:29, Andrew Morton wrote: > > I though Andreas meant "any write changes" - i.e. you check that noone > > has open file descriptor for writing and block any new open for writing. > > That can be done quite easily. > > Anyway, I agree with you that userspace solution to a possible page > > cache pollution is preferable after thinking about it for a while. > > As I've been thinking about it, we could actually do the copying > > from user space. We could do something like: > > block any writes to file (as I described above) > > craft new inode with blocks allocated as we want (using preallocation, > > we should mostly have the kernel infrastructure we need) > > copy data using splice syscall > > call the kernel to switch data > > > > I don't think we need to block any writes to any file or anything. > > To move a page within a file: > > fd = open(file); > p = mmap(fd); > the_page_was_in_core = mincore(p, offset); > munmap(p); > ioctl(fd, ..., new_block); > > > read_cache_page(inode, offset); > lock_page(page); > if (try_to_free_buffers(page)) { > > set_page_dirty(page); > } > unlock_page(page); > > if (the_page_was_in_core) { > sync_file_range(fd, offset SYNC_FILE_RANGE_WAIT_BEFORE| > SYNC_FILE_RANGE_WRITE| > SYNC_FILE_RANGE_WAIT_AFTER); > fadvise(fd, offset, FADV_DONTNEED); > } > > completely coherent with pagecache, quite safe in the presence of mmap, > mlock, O_DIRECT, everything else. Also fully journallable in-kernel. Yes, this is the simple way. But I see two disadvantages: 1) You'd like to relocate metadata (indirect blocks) too. For that you need a different mechanism. In my approach, you can mostly assume you've got sanely laid out metadata and so the existence of such mechanism is not so important. 2) You'd like to allocate new blocks in big chunks. So your kernel function should rather take a range. Also when you fail in the middle of relocating a file (for example the block you'd like to use is already taken by someone else), I find it nice if you can return at least to the original state. But that's probably not important. Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Wed 07-02-07 12:56:59, Andrew Morton wrote: > On Wed, 7 Feb 2007 13:46:57 -0700 > Andreas Dilger <[EMAIL PROTECTED]> wrote: > > > On Feb 06, 2007 17:35 -0800, Andrew Morton wrote: > > > On Mon, 5 Feb 2007 14:12:04 +0100 > > > Jan Kara <[EMAIL PROTECTED]> wrote: > > > > > Move the blocks on the temporary inode to the original inode > > > > > by a page. > > > > > 1. Read the file data from the old blocks to the page > > > > > 2. Move the block on the temporary inode to the original inode > > > > > 3. Write the file data on the page into the new blocks > > > > I have one thing - it's probably not good to use page cache for > > > > defragmentation. > > > > > > Then it is no longer online defragmentation. The issues with maintaining > > > correctness and coherency with ongoing VFS activity would be truly > > > ghastly. > > > > > > If we're worried about pagecache pollution then it would be better to > > > control > > > that from userspace via fadvise(). > > > > It should be possible to have the online defrag tool lock the inode against > > any changes, > > Sounds easy when you say it fast. But how do we "lock" against, say, a > read pagefault? Only by writing back then removing the pagecache page then > reinstantiating it as a locked, not-uptodate page and then removing it from > pagecache afterwards prior to unlocking it. Or something. > > I don't think we want to go there. I though Andreas meant "any write changes" - i.e. you check that noone has open file descriptor for writing and block any new open for writing. That can be done quite easily. Anyway, I agree with you that userspace solution to a possible page cache pollution is preferable after thinking about it for a while. As I've been thinking about it, we could actually do the copying from user space. We could do something like: block any writes to file (as I described above) craft new inode with blocks allocated as we want (using preallocation, we should mostly have the kernel infrastructure we need) copy data using splice syscall call the kernel to switch data But maybe I miss something and it's more complicated than I think. Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Thu, 8 Feb 2007 10:29:45 +0100 Jan Kara <[EMAIL PROTECTED]> wrote: > On Wed 07-02-07 12:56:59, Andrew Morton wrote: > > On Wed, 7 Feb 2007 13:46:57 -0700 > > Andreas Dilger <[EMAIL PROTECTED]> wrote: > > > > > On Feb 06, 2007 17:35 -0800, Andrew Morton wrote: > > > > On Mon, 5 Feb 2007 14:12:04 +0100 > > > > Jan Kara <[EMAIL PROTECTED]> wrote: > > > > > > Move the blocks on the temporary inode to the original inode > > > > > > by a page. > > > > > > 1. Read the file data from the old blocks to the page > > > > > > 2. Move the block on the temporary inode to the original inode > > > > > > 3. Write the file data on the page into the new blocks > > > > > I have one thing - it's probably not good to use page cache for > > > > > defragmentation. > > > > > > > > Then it is no longer online defragmentation. The issues with > > > > maintaining > > > > correctness and coherency with ongoing VFS activity would be truly > > > > ghastly. > > > > > > > > If we're worried about pagecache pollution then it would be better to > > > > control > > > > that from userspace via fadvise(). > > > > > > It should be possible to have the online defrag tool lock the inode > > > against > > > any changes, > > > > Sounds easy when you say it fast. But how do we "lock" against, say, a > > read pagefault? Only by writing back then removing the pagecache page then > > reinstantiating it as a locked, not-uptodate page and then removing it from > > pagecache afterwards prior to unlocking it. Or something. > > > > I don't think we want to go there. > I though Andreas meant "any write changes" - i.e. you check that noone > has open file descriptor for writing and block any new open for writing. > That can be done quite easily. > Anyway, I agree with you that userspace solution to a possible page > cache pollution is preferable after thinking about it for a while. > As I've been thinking about it, we could actually do the copying > from user space. We could do something like: > block any writes to file (as I described above) > craft new inode with blocks allocated as we want (using preallocation, > we should mostly have the kernel infrastructure we need) > copy data using splice syscall > call the kernel to switch data > I don't think we need to block any writes to any file or anything. To move a page within a file: fd = open(file); p = mmap(fd); the_page_was_in_core = mincore(p, offset); munmap(p); ioctl(fd, ..., new_block); read_cache_page(inode, offset); lock_page(page); if (try_to_free_buffers(page)) { set_page_dirty(page); } unlock_page(page); if (the_page_was_in_core) { sync_file_range(fd, offset SYNC_FILE_RANGE_WAIT_BEFORE| SYNC_FILE_RANGE_WRITE| SYNC_FILE_RANGE_WAIT_AFTER); fadvise(fd, offset, FADV_DONTNEED); } completely coherent with pagecache, quite safe in the presence of mmap, mlock, O_DIRECT, everything else. Also fully journallable in-kernel. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 2/3] Move the file data to the new blocks
Move the blocks on the temporary inode to the original inode by a page. 1. Read the file data from the old blocks to the page 2. Move the block on the temporary inode to the original inode 3. Write the file data on the page into the new blocks Signed-off-by: Takashi Sato <[EMAIL PROTECTED]> --- diff -Nrup -X linux-2.6.19-rc6.org/Documentation/dontdiff linux-2.6.19-rc6-1/fs/ext4/extents.c linux-2.6.19-rc6-full/fs/ext4/extents.c --- linux-2.6.19-rc6-1/fs/ext4/extents.c2007-02-08 14:13:49.0 +0900 +++ linux-2.6.19-rc6-full/fs/ext4/extents.c 2007-02-08 14:09:43.0 +0900 @@ -2533,6 +2533,653 @@ ext4_ext_next_extent(struct inode *inode } /** + * ext4_ext_merge_across - merge extents across leaf block + * + * @handle journal handle + * @inode target file's inode + * @o_startfirst original extent to be defraged + * @o_end last original extent to be defraged + * @start_ext first new extent to be merged + * @new_extmiddle of new extent to be merged + * @end_extlast new extent to be merged + * + * This function returns 0 if succeed, otherwise returns error value. + */ +static int +ext4_ext_merge_across_blocks(handle_t *handle, struct inode *inode, + struct ext4_extent *o_start, + struct ext4_extent *o_end, struct ext4_extent *start_ext, + struct ext4_extent *new_ext,struct ext4_extent *end_ext) +{ + struct ext4_ext_path *org_path = NULL; + unsigned long eblock = 0; + int err = 0; + int new_flag = 0; + int end_flag = 0; + + if (le16_to_cpu(start_ext->ee_len) && + le16_to_cpu(new_ext->ee_len) && + le16_to_cpu(end_ext->ee_len)) { + + if ((o_start) == (o_end)) { + + /* start_ext new_extend_ext +* dest |-|---|| +* org |--| +*/ + + ext4_free_blocks(handle, inode, ext_pblock(o_start) + +le16_to_cpu(start_ext->ee_len), +le16_to_cpu(new_ext->ee_len), 0); + + end_flag = 1; + + } else { + + /* start_ext new_ext end_ext +* dest |-|--|-| +* org |---|--| +*/ + + ext4_free_blocks(handle, inode, ext_pblock(o_start) + + le16_to_cpu(start_ext->ee_len), + le16_to_cpu(o_start->ee_len) + - le16_to_cpu(start_ext->ee_len), 0); + + ext4_free_blocks(handle, inode, ext_pblock(o_end), + le16_to_cpu(o_end->ee_len) + - le16_to_cpu(end_ext->ee_len), 0); + + o_end->ee_block = end_ext->ee_block; + o_end->ee_len = end_ext->ee_len; + ext4_ext_store_pblock(o_end, ext_pblock(end_ext)); + } + + o_start->ee_len = start_ext->ee_len; + new_flag = 1; + + } else if ((le16_to_cpu(start_ext->ee_len)) && + (le16_to_cpu(new_ext->ee_len)) && + (!le16_to_cpu(end_ext->ee_len)) && + ((o_start) == (o_end))) { + + /* start_extnew_ext +* dest |--|---| +* org |--| +*/ + + ext4_free_blocks(handle, inode, ext_pblock(o_start) + + le16_to_cpu(start_ext->ee_len), + le16_to_cpu(new_ext->ee_len), 0); + + o_start->ee_len = start_ext->ee_len; + new_flag = 1; + + } else if ((!le16_to_cpu(start_ext->ee_len)) && + (le16_to_cpu(new_ext->ee_len)) && + (le16_to_cpu(end_ext->ee_len)) && + ((o_start) == (o_end))) { + + /* new_extend_ext +* dest |--|---| +* org |--| +*/ + + ext4_free_blocks(handle, inode, ext_pblock(o_end), + le16_to_cpu(new_ext->ee_len), 0); + + o_end->ee_block = end_ext->ee_block; + o_end->ee_len = end_ext->ee_len; + ext4_ext_store_pblock(o_end, ext_pblock(end_ext)); + + /* If new_ext was first block */ + if (!new_ext->ee_block) + eblock = 0; + else + eblock = le32_to_cpu(new_ext->ee_block); + + new_flag = 1; + } else { + printk("Unexpected case \n"); + return -EIO; +
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Wed, 7 Feb 2007 13:46:57 -0700 Andreas Dilger <[EMAIL PROTECTED]> wrote: > On Feb 06, 2007 17:35 -0800, Andrew Morton wrote: > > On Mon, 5 Feb 2007 14:12:04 +0100 > > Jan Kara <[EMAIL PROTECTED]> wrote: > > > > Move the blocks on the temporary inode to the original inode > > > > by a page. > > > > 1. Read the file data from the old blocks to the page > > > > 2. Move the block on the temporary inode to the original inode > > > > 3. Write the file data on the page into the new blocks > > > I have one thing - it's probably not good to use page cache for > > > defragmentation. > > > > Then it is no longer online defragmentation. The issues with maintaining > > correctness and coherency with ongoing VFS activity would be truly ghastly. > > > > If we're worried about pagecache pollution then it would be better to > > control > > that from userspace via fadvise(). > > It should be possible to have the online defrag tool lock the inode against > any changes, Sounds easy when you say it fast. But how do we "lock" against, say, a read pagefault? Only by writing back then removing the pagecache page then reinstantiating it as a locked, not-uptodate page and then removing it from pagecache afterwards prior to unlocking it. Or something. I don't think we want to go there. > flush all pages out of the cache for that inode, and then do > the reallocated outside of the page cache. For inodes not already in cache > this is a no-op. For the (hopefully rare) case were the inode already has > cached pages and also needs to be reallocated it would be a performance hit. > > Alternately, we could skip files currently being modified (or mmaped), or > even recently modified (e.g. within the last 30 minutes) in the default case, > on the assumption that they might be deleted soon anyways. argh. It's simple to just use pagecache. The "we don't want to swamp the machine with pagecache" argument is bogus. If it becomes a problem (and it might not) then it is very simple to control the pagecache from userspace. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Feb 06, 2007 17:35 -0800, Andrew Morton wrote: > On Mon, 5 Feb 2007 14:12:04 +0100 > Jan Kara <[EMAIL PROTECTED]> wrote: > > > Move the blocks on the temporary inode to the original inode > > > by a page. > > > 1. Read the file data from the old blocks to the page > > > 2. Move the block on the temporary inode to the original inode > > > 3. Write the file data on the page into the new blocks > > I have one thing - it's probably not good to use page cache for > > defragmentation. > > Then it is no longer online defragmentation. The issues with maintaining > correctness and coherency with ongoing VFS activity would be truly ghastly. > > If we're worried about pagecache pollution then it would be better to control > that from userspace via fadvise(). It should be possible to have the online defrag tool lock the inode against any changes, flush all pages out of the cache for that inode, and then do the reallocated outside of the page cache. For inodes not already in cache this is a no-op. For the (hopefully rare) case were the inode already has cached pages and also needs to be reallocated it would be a performance hit. Alternately, we could skip files currently being modified (or mmaped), or even recently modified (e.g. within the last 30 minutes) in the default case, on the assumption that they might be deleted soon anyways. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
Hi, +ext4_ext_replace_branches(struct inode *org_inode, struct inode *dest_inode, + pgoff_t from_page, pgoff_t dest_from_page, + pgoff_t count_page, unsigned long *delete_start) +{ + struct ext4_ext_path *org_path = NULL; + struct ext4_ext_path *dest_path = NULL; + struct ext4_extent *oext, *dext; + struct ext4_extent tmp_ext; + int err = 0; + int depth; + unsigned long from, count, dest_off, diff, replaced_count = 0; These should be sector_t, shouldn't they? At some point should we start using blkcnt_t properly? (block-in[-large]-file, not block-in[-large]-device?) I think that's what it was introduced for, although it's not in wide use at this point. I guess the type really isn't used anywhere else; just in the inode's i_blocks. Hmm. On reflection, I think we should use ext4_fsblk_t in this case, because some ext4 codes such as ext4_ext_get_blocks() use it. int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ext4_fsblk_t iblock, So I would like to change "unsigned long" into ext4_fsblk_t in my next patch. Cheers, Takashi - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
Andrew Morton wrote: On Tue, 16 Jan 2007 21:05:20 +0900 [EMAIL PROTECTED] wrote: ... +ext4_ext_replace_branches(struct inode *org_inode, struct inode *dest_inode, + pgoff_t from_page, pgoff_t dest_from_page, + pgoff_t count_page, unsigned long *delete_start) +{ + struct ext4_ext_path *org_path = NULL; + struct ext4_ext_path *dest_path = NULL; + struct ext4_extent *oext, *dext; + struct ext4_extent tmp_ext; + int err = 0; + int depth; + unsigned long from, count, dest_off, diff, replaced_count = 0; These should be sector_t, shouldn't they? At some point should we start using blkcnt_t properly? (block-in[-large]-file, not block-in[-large]-device?) I think that's what it was introduced for, although it's not in wide use at this point. I guess the type really isn't used anywhere else; just in the inode's i_blocks. Hmm. -Eric - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Mon, 5 Feb 2007 14:12:04 +0100 Jan Kara <[EMAIL PROTECTED]> wrote: > > Move the blocks on the temporary inode to the original inode > > by a page. > > 1. Read the file data from the old blocks to the page > > 2. Move the block on the temporary inode to the original inode > > 3. Write the file data on the page into the new blocks > I have one thing - it's probably not good to use page cache for > defragmentation. Then it is no longer online defragmentation. The issues with maintaining correctness and coherency with ongoing VFS activity would be truly ghastly. If we're worried about pagecache pollution then it would be better to control that from userspace via fadvise(). - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Tue, 16 Jan 2007 21:05:20 +0900 [EMAIL PROTECTED] wrote: > Move the blocks on the temporary inode to the original inode > by a page. > 1. Read the file data from the old blocks to the page > 2. Move the block on the temporary inode to the original inode > 3. Write the file data on the page into the new blocks > > ... > > + > +/** > + * ext4_ext_replace_branches - replace original extents with new extents. > + * @org_inodeOriginal inode > + * @dest_inode temporary inode > + * @from_pagePage offset > + * @count_page Page count to be replaced > + * @delete_start block offset for deletion > + * > + * This function returns 0 if succeed, otherwise returns error value. > + * Replace extents for blocks from "from" to "from+count-1". > + */ > +static int > +ext4_ext_replace_branches(struct inode *org_inode, struct inode *dest_inode, > + pgoff_t from_page, pgoff_t dest_from_page, > + pgoff_t count_page, unsigned long *delete_start) > +{ > + struct ext4_ext_path *org_path = NULL; > + struct ext4_ext_path *dest_path = NULL; > + struct ext4_extent *oext, *dext; > + struct ext4_extent tmp_ext; > + int err = 0; > + int depth; > + unsigned long from, count, dest_off, diff, replaced_count = 0; These should be sector_t, shouldn't they? > + handle_t *handle = NULL; > + unsigned jnum; > + > + from = from_page << (PAGE_CACHE_SHIFT - dest_inode->i_blkbits); In which case one needs to be very careful to avoid overflows in expressions such as this one. > + wait_on_page_locked(page); > + lock_page(page); The wait_on_page_locked() is unneeded here. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
Hi Jan, On Mon, 2007-02-05 at 14:12 +0100, Jan Kara wrote: > Hi, > > I'm replying rather late but I've been busy with my PhD thesis lately. > So sorry for that. > > > Move the blocks on the temporary inode to the original inode > > by a page. > > 1. Read the file data from the old blocks to the page > > 2. Move the block on the temporary inode to the original inode > > 3. Write the file data on the page into the new blocks > I have one thing - it's probably not good to use page cache for > defragmentation. As you will read/write tons of data and that will push > out all other data from the cache. I think it may be better to allocate > some separate pages and use them as buffers. The XFS defrag tool uses direct IO to avoid this issue. You may do an even better job with splice(2) in this situation, since its simply read-then-write-straight-back-out without looking at the file data. cheers. -- Nathan - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
Hi, I'm replying rather late but I've been busy with my PhD thesis lately. So sorry for that. > Move the blocks on the temporary inode to the original inode > by a page. > 1. Read the file data from the old blocks to the page > 2. Move the block on the temporary inode to the original inode > 3. Write the file data on the page into the new blocks I have one thing - it's probably not good to use page cache for defragmentation. As you will read/write tons of data and that will push out all other data from the cache. I think it may be better to allocate some separate pages and use them as buffers. Also I'm intending to work on a similar functionality for ext3 (without extents/mballoc). It would be probably good to find some common parts so that they don't have to be duplicated. Honza > Signed-off-by: Takashi Sato <[EMAIL PROTECTED]> > --- > diff -Nrup -X linux-2.6.19-rc6-org/Documentation/dontdiff > linux-2.6.19-rc6-1/fs/ext4/extents.c linux-2.6.19-rc6-FULL/fs/ext4/extents.c > --- linux-2.6.19-rc6-1/fs/ext4/extents.c 2007-01-16 09:46:13.0 > +0900 > +++ linux-2.6.19-rc6-FULL/fs/ext4/extents.c 2007-01-16 09:43:45.0 > +0900 > @@ -2539,6 +2539,470 @@ ext4_ext_next_extent(struct inode *inode > } > > /** > + * ext4_ext_merge_extents - merge new extent to the extent block > + * > + * @handle journal handle > + * @inode target file's inode > + * @org_pathpath indicates first extent to be defraged > + * @o_start first original extent to be defraged > + * @o_end last original extent to be defraged > + * @start_ext first new extent to be merged > + * @new_ext middle of new extent to be merged > + * @end_ext last new extent to be merged > + * @replacedthe number of blocks which will be replaced with new_ext > + * > + * This function returns 0 if succeed, otherwise returns -1. > + */ > +static int > +ext4_ext_merge_extents(handle_t *handle, struct inode *inode, > + struct ext4_ext_path *org_path, > + struct ext4_extent *o_start, struct ext4_extent *o_end, > + struct ext4_extent *start_ext, struct ext4_extent *new_ext, > + struct ext4_extent *end_ext, unsigned long replaced) > +{ > + int i = 0; > + unsigned need_slots; > + unsigned slots_range, len; > + int range_to_move; > + struct ext4_extent_header * eh; > + struct ext4_extent *free_start, *free_end; > + int depth; > + > + /* The extents need to be inserted > + * start_extent + new_extent + end_extent > + */ > + need_slots = (le16_to_cpu(start_ext->ee_len) ? 1 : 0) + > + (le16_to_cpu(end_ext->ee_len) ? 1 : 0) + > + (le16_to_cpu(new_ext->ee_len) ? 1 : 0); > + > + /* The number of slots between start and end */ > + slots_range = o_end - o_start + 1; > + > + /* Range to move the end of extent */ > + range_to_move = need_slots - slots_range; > + depth = org_path->p_depth; > + org_path += depth; > + eh = org_path->p_hdr; > + /* expansion */ > + if (range_to_move > 0) { > + if (range_to_move > le16_to_cpu(eh->eh_max) > + - le16_to_cpu(eh->eh_entries)) { > + printk("Cannot merge extents due to no space\n"); > + return -1; > + } > + } > + if (depth) { > + /* Register to journal */ > + if (ext4_journal_get_write_access(handle, org_path->p_bh)) { > + return -1; > + } > + } > + > + /* Free old blocks > + * dest|---| > + * org |---| > + */ > + free_start = o_start; > + free_end = o_end; > + if (le16_to_cpu(start_ext->ee_len)) { > + if (le16_to_cpu(o_start->ee_len) > + > le16_to_cpu(start_ext->ee_len)) { > + ext4_free_blocks(handle, inode, > + ext_pblock(o_start) > + + le16_to_cpu(start_ext->ee_len), > + le16_to_cpu(o_start->ee_len) > + - le16_to_cpu(start_ext->ee_len), 0); > + } > + free_start++; > + } > + > + /* dest || > + * org |---| > + */ > + if (le16_to_cpu(end_ext->ee_len)) { > + ext4_free_blocks(handle, inode, ext_pblock(o_end), > + le16_to_cpu(o_end->ee_len) > + - le16_to_cpu(end_ext->ee_len), 0); > + free_end--; > + } > + > + /* dest |---| > + * org |-| > + */ > + for (; free_start <= free_end; free_start++) { > + ext4_free_blocks(handle, inode, > + ext_pblock(free_start), > + le32_to_cpu(free_s
[RFC][PATCH 2/3] Move the file data to the new blocks
Move the blocks on the temporary inode to the original inode by a page. 1. Read the file data from the old blocks to the page 2. Move the block on the temporary inode to the original inode 3. Write the file data on the page into the new blocks Signed-off-by: Takashi Sato <[EMAIL PROTECTED]> --- diff -Nrup -X linux-2.6.19-rc6-org/Documentation/dontdiff linux-2.6.19-rc6-1/fs/ext4/extents.c linux-2.6.19-rc6-FULL/fs/ext4/extents.c --- linux-2.6.19-rc6-1/fs/ext4/extents.c2007-01-16 09:46:13.0 +0900 +++ linux-2.6.19-rc6-FULL/fs/ext4/extents.c 2007-01-16 09:43:45.0 +0900 @@ -2539,6 +2539,470 @@ ext4_ext_next_extent(struct inode *inode } /** + * ext4_ext_merge_extents - merge new extent to the extent block + * + * @handle journal handle + * @inode target file's inode + * @org_pathpath indicates first extent to be defraged + * @o_start first original extent to be defraged + * @o_end last original extent to be defraged + * @start_ext first new extent to be merged + * @new_ext middle of new extent to be merged + * @end_ext last new extent to be merged + * @replacedthe number of blocks which will be replaced with new_ext + * + * This function returns 0 if succeed, otherwise returns -1. + */ +static int +ext4_ext_merge_extents(handle_t *handle, struct inode *inode, + struct ext4_ext_path *org_path, + struct ext4_extent *o_start, struct ext4_extent *o_end, + struct ext4_extent *start_ext, struct ext4_extent *new_ext, + struct ext4_extent *end_ext, unsigned long replaced) +{ + int i = 0; + unsigned need_slots; + unsigned slots_range, len; + int range_to_move; + struct ext4_extent_header * eh; + struct ext4_extent *free_start, *free_end; + int depth; + + /* The extents need to be inserted +* start_extent + new_extent + end_extent +*/ + need_slots = (le16_to_cpu(start_ext->ee_len) ? 1 : 0) + + (le16_to_cpu(end_ext->ee_len) ? 1 : 0) + + (le16_to_cpu(new_ext->ee_len) ? 1 : 0); + + /* The number of slots between start and end */ + slots_range = o_end - o_start + 1; + + /* Range to move the end of extent */ + range_to_move = need_slots - slots_range; + depth = org_path->p_depth; + org_path += depth; + eh = org_path->p_hdr; + /* expansion */ + if (range_to_move > 0) { + if (range_to_move > le16_to_cpu(eh->eh_max) + - le16_to_cpu(eh->eh_entries)) { + printk("Cannot merge extents due to no space\n"); + return -1; + } + } + if (depth) { + /* Register to journal */ + if (ext4_journal_get_write_access(handle, org_path->p_bh)) { + return -1; + } + } + + /* Free old blocks +* dest|---| +* org |---| +*/ + free_start = o_start; + free_end = o_end; + if (le16_to_cpu(start_ext->ee_len)) { + if (le16_to_cpu(o_start->ee_len) + > le16_to_cpu(start_ext->ee_len)) { + ext4_free_blocks(handle, inode, + ext_pblock(o_start) + + le16_to_cpu(start_ext->ee_len), + le16_to_cpu(o_start->ee_len) + - le16_to_cpu(start_ext->ee_len), 0); + } + free_start++; + } + + /* dest || +* org |---| +*/ + if (le16_to_cpu(end_ext->ee_len)) { + ext4_free_blocks(handle, inode, ext_pblock(o_end), + le16_to_cpu(o_end->ee_len) + - le16_to_cpu(end_ext->ee_len), 0); + free_end--; + } + + /* dest |---| +* org |-| +*/ + for (; free_start <= free_end; free_start++) { + ext4_free_blocks(handle, inode, + ext_pblock(free_start), + le32_to_cpu(free_start->ee_len), 0); + } + + /* Move the existing extents */ + if (range_to_move && o_end < EXT_LAST_EXTENT(eh)) { + len = EXT_LAST_EXTENT(eh) - (o_end + 1) + 1; + len = len * sizeof(struct ext4_extent); + memmove(o_end + 1 + range_to_move, o_end + 1, len); + } + + /* Insert start entry */ + if (le16_to_cpu(start_ext->ee_len)) { + o_start[i++].ee_len = start_ext->ee_len; + } + + /* Insert new entry */ + if (le16_to_cpu(new_ext->ee_len)) { + o_start[i].ee_block = new_ext->ee_block; + o_start[i].ee_len = cpu_to_le16(replaced); + ext4_
[RFC][PATCH 2/3] Move the file data to the new blocks
Move the blocks on the temporary inode to the original inode by a page. 1. Read the file data from the old blocks to the page 2. Move the block on the temporary inode to the original inode 3. Write the file data on the page into the new blocks Signed-off-by: Takashi Sato <[EMAIL PROTECTED]> --- diff -Nrup -X linux-2.6.19-rc6-1/Documentation/dontdiff linux-2.6.19-rc6-1/fs/ext4/extents.c linux-2.6.19-rc6-full/fs/ext4/extents.c --- linux-2.6.19-rc6-1/fs/ext4/extents.c2006-12-22 14:41:49.0 +0900 +++ linux-2.6.19-rc6-full/fs/ext4/extents.c 2006-12-22 14:56:17.0 +0900 @@ -2518,6 +2518,470 @@ ext4_ext_next_extent(struct inode *inode } /** + * ext4_ext_merge_extents - merge new extent to the extent block + * + * @handle journal handle + * @inode target file's inode + * @org_pathpath indicates first extent to be defraged + * @o_start first original extent to be defraged + * @o_end last original extent to be defraged + * @start_ext first new extent to be merged + * @new_ext middle of new extent to be merged + * @end_ext last new extent to be merged + * @replacedthe number of blocks which will be replaced with new_ext + * + * This function returns 0 if succeed, otherwise returns -1. + */ +static int +ext4_ext_merge_extents(handle_t *handle, struct inode *inode, + struct ext4_ext_path *org_path, + struct ext4_extent *o_start, struct ext4_extent *o_end, + struct ext4_extent *start_ext, struct ext4_extent *new_ext, + struct ext4_extent *end_ext, unsigned long replaced) +{ + int i = 0; + unsigned need_slots; + unsigned slots_range, len; + int range_to_move; + struct ext4_extent_header * eh; + struct ext4_extent *free_start, *free_end; + int depth; + + /* The extents need to be inserted +* start_extent + new_extent + end_extent +*/ + need_slots = (le16_to_cpu(start_ext->ee_len) ? 1 : 0) + + (le16_to_cpu(end_ext->ee_len) ? 1 : 0) + + (le16_to_cpu(new_ext->ee_len) ? 1 : 0); + + /* The number of slots between start and end */ + slots_range = o_end - o_start + 1; + + /* Range to move the end of extent */ + range_to_move = need_slots - slots_range; + depth = org_path->p_depth; + org_path += depth; + eh = org_path->p_hdr; + /* expansion */ + if (range_to_move > 0) { + if (range_to_move > le16_to_cpu(eh->eh_max) + - le16_to_cpu(eh->eh_entries)) { + printk("Cannot merge extents due to no space\n"); + return -1; + } + } + if (depth) { + /* Register to journal */ + if (ext4_journal_get_write_access(handle, org_path->p_bh)) { + return -1; + } + } + + /* Free old blocks +* dest|---| +* org |---| +*/ + free_start = o_start; + free_end = o_end; + if (le16_to_cpu(start_ext->ee_len)) { + if (le16_to_cpu(o_start->ee_len) + > le16_to_cpu(start_ext->ee_len)) { + ext4_free_blocks(handle, inode, + ext_pblock(o_start) + + le16_to_cpu(start_ext->ee_len), + le16_to_cpu(o_start->ee_len) + - le16_to_cpu(start_ext->ee_len), 0); + } + free_start++; + } + + /* dest || +* org |---| +*/ + if (le16_to_cpu(end_ext->ee_len)) { + ext4_free_blocks(handle, inode, ext_pblock(o_end), + le16_to_cpu(o_end->ee_len) + - le16_to_cpu(end_ext->ee_len), 0); + free_end--; + } + + /* dest |---| +* org |-| +*/ + for (; free_start <= free_end; free_start++) { + ext4_free_blocks(handle, inode, + ext_pblock(free_start), + le32_to_cpu(free_start->ee_len), 0); + } + + /* Move the existing extents */ + if (range_to_move && o_end < EXT_LAST_EXTENT(eh)) { + len = EXT_LAST_EXTENT(eh) - (o_end + 1) + 1; + len = len * sizeof(struct ext4_extent); + memmove(o_end + 1 + range_to_move, o_end + 1, len); + } + + /* Insert start entry */ + if (le16_to_cpu(start_ext->ee_len)) { + o_start[i++].ee_len = start_ext->ee_len; + } + + /* Insert new entry */ + if (le16_to_cpu(new_ext->ee_len)) { + o_start[i].ee_block = new_ext->ee_block; + o_start[i].ee_len = cpu_to_le16(replaced); + ext4_ex
[RFC][PATCH 2/3] Move the file data to the new blocks
Move the blocks on the temporary inode to the original inode by a page. 1. Read the file data from the old blocks to the page 2. Move the block on the temporary inode to the original inode 3. Write the file data on the page into the new blocks Signed-off-by: Takashi Sato <[EMAIL PROTECTED]> --- diff -upNr -X linux-2.6.16.8-tmp1/Documentation/dontdiff linux-2.6.16.8-tmp1/fs/ext3/extents.c linux-2.6.16.8-tmp2/fs/ext3/extents.c --- linux-2.6.16.8-tmp1/fs/ext3/extents.c 2006-11-08 22:12:19.0 +0900 +++ linux-2.6.16.8-tmp2/fs/ext3/extents.c 2006-11-08 22:25:07.0 +0900 @@ -2424,6 +2424,421 @@ ext3_ext_next_extent(struct inode *inode } /** + * ext3_ext_merge_extents - merge new extent to the extent block + * + * @handle journal handle + * @inode target file's inode + * @org_path path indicates first extent to be defraged + * @o_startfirst original extent to be defraged + * @o_end last original extent to be defraged + * @start_ext first new extent to be merged + * @new_extmiddle of new extent to be merged + * @end_extlast new extent to be merged + * @replaced the number of blocks which will be replaced with new_ext + * + * This function returns 0 if succeed, otherwise returns -1. + */ +static int +ext3_ext_merge_extents(handle_t *handle, struct inode *inode, + struct ext3_ext_path *org_path, + struct ext3_extent *o_start, struct ext3_extent *o_end, + struct ext3_extent *start_ext, struct ext3_extent *new_ext, + struct ext3_extent *end_ext, unsigned long replaced) +{ + int i = 0; + unsigned need_slots; + unsigned slots_range, len; + int range_to_move; + struct ext3_extent_header * eh; + struct ext3_extent *free_start, *free_end; + int depth; + + /* The extents need to be inserted +* start_extent + new_extent + end_extent +*/ + need_slots = (le16_to_cpu(start_ext->ee_len) ? 1 : 0) + +(le16_to_cpu(end_ext->ee_len) ? 1 : 0) + +(le16_to_cpu(new_ext->ee_len) ? 1 : 0); + + /* The number of slots between start and end */ + slots_range = o_end - o_start + 1; + + /* Range to move the end of extent */ + range_to_move = need_slots - slots_range; + depth = org_path->p_depth; + org_path += depth; + eh = org_path->p_hdr; + /* expansion */ + if (range_to_move > 0) { + if (range_to_move > le16_to_cpu(eh->eh_max) + - le16_to_cpu(eh->eh_entries)) { + printk("Cannot merge extents due to no space\n"); + return -1; + } + } + if (depth) { + /* Register to journal */ + if (ext3_journal_get_write_access(handle, org_path->p_bh)) { + return -1; + } + } + + /* Free old blocks +* dest|---| +* org |---| +*/ + free_start = o_start; + free_end = o_end; + if (le16_to_cpu(start_ext->ee_len)) { + if (le16_to_cpu(o_start->ee_len) + > le16_to_cpu(start_ext->ee_len)) { + ext3_free_blocks(handle, inode, + le32_to_cpu(o_start->ee_start) ++ le16_to_cpu(start_ext->ee_len), + le16_to_cpu(o_start->ee_len) + - le16_to_cpu(start_ext->ee_len), 0); + } + free_start++; + } + + /* dest || +* org |---| +*/ + if (le16_to_cpu(end_ext->ee_len)) { + ext3_free_blocks(handle, inode, le32_to_cpu(o_end->ee_start), + le16_to_cpu(o_end->ee_len) +- le16_to_cpu(end_ext->ee_len), 0); + free_end--; + } + + /* dest |---| +* org |-| +*/ + for (; free_start <= free_end; free_start++) { + ext3_free_blocks(handle, inode, + le32_to_cpu(free_start->ee_start), + le32_to_cpu(free_start->ee_len), 0); + } + + /* Move the existing extents */ + if (range_to_move && o_end < EXT_LAST_EXTENT(eh)) { + len = EXT_LAST_EXTENT(eh) - (o_end + 1) + 1; + len = len * sizeof(struct ext3_extent); + memmove(o_end + 1 + range_to_move, o_end + 1, len); + } + + /* Insert start entry */ + if (le16_to_cpu(start_ext->ee_len)) { + o_start[i++].ee_len = start_ext->ee_len; + } + + /* Insert new entry */ + if (le16_to_cpu(new_ext->ee_len)) { + o_start[i].ee_block = new_ext->ee_block; + o_start[i].ee_len = cpu_to_le16(replac