Re: [PATCH 1/12] ext3 balloc: reset windowsz when full
On Tue, 28 Nov 2006, Mingming Cao wrote: > Port a series ext2 balloc patches from Hugh to ext3/4. The first 6 > patches are against ext3, and the rest are aginst ext4. Thanks for all that, Mingming: whichever is appropriate, all twelve Acked-by: Hugh Dickins <[EMAIL PROTECTED]> or Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> I'll think about your other mails, those that need further thought, later on: I need to pin down more accurately the repetitious sequence of reservations in the mistaken case - maybe it indicated further issues, maybe not; and I need to consider our different views of the my_rsv find_next_usable_block. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 9/12] ext4 balloc: fix off-by-one against rsv_end
-- Subject: ext2 balloc: fix off-by-one against rsv_end From: Hugh Dickins <[EMAIL PROTECTED]> rsv_end is the last block within the reservation, so alloc_new_reservation should accept start_block == rsv_end as success. -- Sync up a ext2 reservation fix in ext4 Signed-Off-By: Mingming Cao <[EMAIL PROTECTED]> --- linux-2.6.19-rc5-cmm/fs/ext4/balloc.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN fs/ext4/balloc.c~ext4-balloc-fix-off-by-one-against-rsv_end fs/ext4/balloc.c --- linux-2.6.19-rc5/fs/ext4/balloc.c~ext4-balloc-fix-off-by-one-against-rsv_end 2006-11-28 19:37:15.0 -0800 +++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-28 19:37:15.0 -0800 @@ -1165,7 +1165,7 @@ retry: * check if the first free block is within the * free space we just reserved */ - if (start_block >= my_rsv->rsv_start && start_block < my_rsv->rsv_end) + if (start_block >= my_rsv->rsv_start && start_block <= my_rsv->rsv_end) return 0; /* success */ /* * if the first free bit we found is out of the reservable space _ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/12] ext3 balloc: fix off-by-one against grp_goal
-- Subject: ext2 balloc: fix off-by-one against grp_goal From: Hugh Dickins <[EMAIL PROTECTED]> grp_goal 0 is a genuine goal (unlike -1), so ext2_try_to_allocate_with_rsv should treat it as such. -- Sync up with ext2 reservation fix in ext3 Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- --- linux-2.6.19-rc5-cmm/fs/ext3/balloc.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff -puN fs/ext3/balloc.c~ext3-balloc-fix-off-by-one-against-grp_goal fs/ext3/balloc.c --- linux-2.6.19-rc5/fs/ext3/balloc.c~ext3-balloc-fix-off-by-one-against-grp_goal 2006-11-28 19:36:48.0 -0800 +++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-28 19:36:48.0 -0800 @@ -1271,7 +1271,7 @@ ext3_try_to_allocate_with_rsv(struct sup } /* * grp_goal is a group relative block number (if there is a goal) -* 0 < grp_goal < EXT3_BLOCKS_PER_GROUP(sb) +* 0 <= grp_goal < EXT3_BLOCKS_PER_GROUP(sb) * first block is a filesystem wide block number * first block is the block number of the first block in this group */ @@ -1307,7 +1307,7 @@ ext3_try_to_allocate_with_rsv(struct sup if (!goal_in_my_reservation(&my_rsv->rsv_window, grp_goal, group, sb)) grp_goal = -1; - } else if (grp_goal > 0) { + } else if (grp_goal >= 0) { int curr = my_rsv->rsv_end - (grp_goal + group_first_block) + 1; _ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/12] ext2 balloc: fix _with_rsv freeze
Sync up a reservation fix from ext2 in ext3 -- Subject: ext2 balloc: fix _with_rsv freeze From: Hugh Dickins <[EMAIL PROTECTED]> After several days of testing ext2 with reservations, it got caught inside ext2_try_to_allocate_with_rsv: alloc_new_reservation repeatedly succeeding on the window [12cff,12d0e], ext2_try_to_allocate repeatedly failing to find the free block guaranteed to be included (unless there's contention). Fix the range to find_next_usable_block's memscan: the scan from "here" (0xcfe) up to (but excluding) "maxblocks" (0xd0e) needs to scan 3 bytes not 2 (the relevant bytes of bitmap in this case being f7 df ff - none 00, but the premature cutoff implying that the last was found 00). Is this a problem for mainline ext2? No, because the "size" in its memscan is always EXT2_BLOCKS_PER_GROUP(sb), which mkfs.ext2 requires to be a multiple of 8. Is this a problem for ext3 or ext4? No, because they have an additional extN_test_allocatable test which rescues them from the error. -- Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- linux-2.6.19-rc5-cmm/fs/ext3/balloc.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN fs/ext3/balloc.c~ext3-balloc-fix-_with_rsv-freeze fs/ext3/balloc.c --- linux-2.6.19-rc5/fs/ext3/balloc.c~ext3-balloc-fix-_with_rsv-freeze 2006-11-28 19:36:55.0 -0800 +++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-28 19:36:55.0 -0800 @@ -730,7 +730,7 @@ find_next_usable_block(ext3_grpblk_t sta here = 0; p = ((char *)bh->b_data) + (here >> 3); - r = memscan(p, 0, (maxblocks - here + 7) >> 3); + r = memscan(p, 0, ((maxblocks + 7) >> 3) - (here >> 3)); next = (r - ((char *)bh->b_data)) << 3; if (next < maxblocks && next >= start && ext3_test_allocatable(next, bh)) _ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/12] ext3 balloc: fix _with_rsv freeze
-- Subject: ext2 balloc: fix _with_rsv freeze From: Hugh Dickins <[EMAIL PROTECTED]> After several days of testing ext2 with reservations, it got caught inside ext2_try_to_allocate_with_rsv: alloc_new_reservation repeatedly succeeding on the window [12cff,12d0e], ext2_try_to_allocate repeatedly failing to find the free block guaranteed to be included (unless there's contention). Fix the range to find_next_usable_block's memscan: the scan from "here" (0xcfe) up to (but excluding) "maxblocks" (0xd0e) needs to scan 3 bytes not 2 (the relevant bytes of bitmap in this case being f7 df ff - none 00, but the premature cutoff implying that the last was found 00). Is this a problem for mainline ext2? No, because the "size" in its memscan is always EXT2_BLOCKS_PER_GROUP(sb), which mkfs.ext2 requires to be a multiple of 8. Is this a problem for ext3 or ext4? No, because they have an additional extN_test_allocatable test which rescues them from the error. -- Sync up a reservation fix from ext2 in ext4 Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- linux-2.6.19-rc5-cmm/fs/ext4/balloc.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN fs/ext4/balloc.c~ext4-balloc-fix-_with_rsv-freeze fs/ext4/balloc.c --- linux-2.6.19-rc5/fs/ext4/balloc.c~ext4-balloc-fix-_with_rsv-freeze 2006-11-28 19:37:12.0 -0800 +++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-28 19:37:12.0 -0800 @@ -747,7 +747,7 @@ find_next_usable_block(ext4_grpblk_t sta here = 0; p = ((char *)bh->b_data) + (here >> 3); - r = memscan(p, 0, (maxblocks - here + 7) >> 3); + r = memscan(p, 0, ((maxblocks + 7) >> 3 - (here >> 3)); next = (r - ((char *)bh->b_data)) << 3; if (next < maxblocks && next >= start && ext4_test_allocatable(next, bh)) _ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/12] ext4 balloc: fix off-by-one against grp_goal
Subject: ext2 balloc: fix off-by-one against grp_goal From: Hugh Dickins <[EMAIL PROTECTED]> grp_goal 0 is a genuine goal (unlike -1), so ext2_try_to_allocate_with_rsv should treat it as such. -- Sync up with ext2 reservation fix in ext4 Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- --- linux-2.6.19-rc5-cmm/fs/ext4/balloc.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff -puN fs/ext4/balloc.c~ext4-balloc-fix-off-by-one-against-grp_goal fs/ext4/balloc.c --- linux-2.6.19-rc5/fs/ext4/balloc.c~ext4-balloc-fix-off-by-one-against-grp_goal 2006-11-28 19:37:05.0 -0800 +++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-28 19:37:05.0 -0800 @@ -1288,7 +1288,7 @@ ext4_try_to_allocate_with_rsv(struct sup } /* * grp_goal is a group relative block number (if there is a goal) -* 0 < grp_goal < EXT4_BLOCKS_PER_GROUP(sb) +* 0 <= grp_goal < EXT4_BLOCKS_PER_GROUP(sb) * first block is a filesystem wide block number * first block is the block number of the first block in this group */ @@ -1324,7 +1324,7 @@ ext4_try_to_allocate_with_rsv(struct sup if (!goal_in_my_reservation(&my_rsv->rsv_window, grp_goal, group, sb)) grp_goal = -1; - } else if (grp_goal > 0) { + } else if (grp_goal >= 0) { int curr = my_rsv->rsv_end - (grp_goal + group_first_block) + 1; _ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/12] ext4 balloc: use io_error label
-- Subject: ext2 balloc: use io_error label From: Hugh Dickins <[EMAIL PROTECTED]> ext2_new_blocks has a nice io_error label for setting -EIO, so goto that in the one place that doesn't already use it. -- Fix it in ext4_new_blocks. Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- linux-2.6.19-rc5-cmm/fs/ext4/balloc.c |6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff -puN fs/ext4/balloc.c~ext4-balloc-use-io_error-label fs/ext4/balloc.c --- linux-2.6.19-rc5/fs/ext4/balloc.c~ext4-balloc-use-io_error-label 2006-11-28 19:42:45.0 -0800 +++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-28 19:43:21.0 -0800 @@ -1529,10 +1529,8 @@ retry_alloc: if (group_no >= ngroups) group_no = 0; gdp = ext4_get_group_desc(sb, group_no, &gdp_bh); - if (!gdp) { - *errp = -EIO; - goto out; - } + if (!gdp) + goto io_error; free_blocks = le16_to_cpu(gdp->bg_free_blocks_count); /* * skip this group if the number of _ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/12] ext4 balloc: say rb_entry not list_entry
-- Subject: ext2 balloc: say rb_entry not list_entry From: Hugh Dickins <[EMAIL PROTECTED]> The reservations tree is an rb_tree not a list, so it's less confusing to use rb_entry() than list_entry() - though they're both just container_of(). -- Sync up this fix in ext4 Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- --- linux-2.6.19-rc5-cmm/fs/ext4/balloc.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff -puN fs/ext4/balloc.c~ext4-balloc-say-rb_entry-not-list_entry fs/ext4/balloc.c --- linux-2.6.19-rc5/fs/ext4/balloc.c~ext4-balloc-say-rb_entry-not-list_entry 2006-11-28 19:37:08.0 -0800 +++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-28 19:37:08.0 -0800 @@ -165,7 +165,7 @@ restart: printk("Block Allocation Reservation Windows Map (%s):\n", fn); while (n) { - rsv = list_entry(n, struct ext4_reserve_window_node, rsv_node); + rsv = rb_entry(n, struct ext4_reserve_window_node, rsv_node); if (verbose) printk("reservation window 0x%p " "start: %llu, end: %llu\n", @@ -966,7 +966,7 @@ static int find_next_reservable_window( prev = rsv; next = rb_next(&rsv->rsv_node); - rsv = list_entry(next,struct ext4_reserve_window_node,rsv_node); + rsv = rb_entry(next,struct ext4_reserve_window_node,rsv_node); /* * Reached the last reservation, we can just append to the @@ -1210,7 +1210,7 @@ static void try_to_extend_reservation(st if (!next) my_rsv->rsv_end += size; else { - next_rsv = list_entry(next, struct ext4_reserve_window_node, rsv_node); + next_rsv = rb_entry(next, struct ext4_reserve_window_node, rsv_node); if ((next_rsv->rsv_start - my_rsv->rsv_end - 1) >= size) my_rsv->rsv_end += size; _ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/12] ext4 balloc: reset windowsz when full
-- Subject: ext2 balloc: reset windowsz when full From: Hugh Dickins <[EMAIL PROTECTED]> ext2_new_blocks should reset the reservation window size to 0 when squeezing the last blocks out of an almost full filesystem, so the retry doesn't skip any groups with less than half that free, reporting ENOSPC too soon. -- Sync up reservation fix from ext2 in ext4 Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- --- linux-2.6.19-rc5-cmm/fs/ext4/balloc.c |1 + 1 file changed, 1 insertion(+) diff -puN fs/ext4/balloc.c~ext4_reset_windowsz_in_full_fs fs/ext4/balloc.c --- linux-2.6.19-rc5/fs/ext4/balloc.c~ext4_reset_windowsz_in_full_fs 2006-11-28 19:37:01.0 -0800 +++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-28 19:37:01.0 -0800 @@ -1566,6 +1566,7 @@ retry_alloc: */ if (my_rsv) { my_rsv = NULL; + windowsz = 0; group_no = goal_group; goto retry_alloc; } _ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/12] ext3 balloc: use io_error label
-- Subject: ext2 balloc: use io_error label From: Hugh Dickins <[EMAIL PROTECTED]> ext2_new_blocks has a nice io_error label for setting -EIO, so goto that in the one place that doesn't already use it. -- Fix it in ext3_new_blocks. Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- linux-2.6.19-rc5-cmm/fs/ext3/balloc.c |6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff -puN fs/ext3/balloc.c~ext3-balloc-use-io_error-label fs/ext3/balloc.c --- linux-2.6.19-rc5/fs/ext3/balloc.c~ext3-balloc-use-io_error-label 2006-11-28 19:45:51.0 -0800 +++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-28 19:45:51.0 -0800 @@ -1515,10 +1515,8 @@ retry_alloc: if (group_no >= ngroups) group_no = 0; gdp = ext3_get_group_desc(sb, group_no, &gdp_bh); - if (!gdp) { - *errp = -EIO; - goto out; - } + if (!gdp) + goto io_error; free_blocks = le16_to_cpu(gdp->bg_free_blocks_count); /* * skip this group if the number of _ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/12] ext3 balloc: fix off-by-one against rsv_end
-- Subject: ext2 balloc: fix off-by-one against rsv_end From: Hugh Dickins <[EMAIL PROTECTED]> rsv_end is the last block within the reservation, so alloc_new_reservation should accept start_block == rsv_end as success. -- Sync up a ext2 reservation fix in ext3 Signed-Off-By: Mingming Cao <[EMAIL PROTECTED]> --- linux-2.6.19-rc5-cmm/fs/ext3/balloc.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN fs/ext3/balloc.c~ext3-balloc-fix-off-by-one-against-rsv_end fs/ext3/balloc.c --- linux-2.6.19-rc5/fs/ext3/balloc.c~ext3-balloc-fix-off-by-one-against-rsv_end 2006-11-28 19:36:58.0 -0800 +++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-28 19:36:58.0 -0800 @@ -1148,7 +1148,7 @@ retry: * check if the first free block is within the * free space we just reserved */ - if (start_block >= my_rsv->rsv_start && start_block < my_rsv->rsv_end) + if (start_block >= my_rsv->rsv_start && start_block <= my_rsv->rsv_end) return 0; /* success */ /* * if the first free bit we found is out of the reservable space _ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/12] ext3 balloc: say rb_entry not list_entry
-- Subject: ext2 balloc: say rb_entry not list_entry From: Hugh Dickins <[EMAIL PROTECTED]> The reservations tree is an rb_tree not a list, so it's less confusing to use rb_entry() than list_entry() - though they're both just container_of(). -- Sync up this fix in ext3 Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- --- linux-2.6.19-rc5-cmm/fs/ext3/balloc.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff -puN fs/ext3/balloc.c~ext3-balloc-say-rb_entry-not-list_entry fs/ext3/balloc.c --- linux-2.6.19-rc5/fs/ext3/balloc.c~ext3-balloc-say-rb_entry-not-list_entry 2006-11-28 19:36:52.0 -0800 +++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-28 19:36:52.0 -0800 @@ -144,7 +144,7 @@ restart: printk("Block Allocation Reservation Windows Map (%s):\n", fn); while (n) { - rsv = list_entry(n, struct ext3_reserve_window_node, rsv_node); + rsv = rb_entry(n, struct ext3_reserve_window_node, rsv_node); if (verbose) printk("reservation window 0x%p " "start: %lu, end: %lu\n", @@ -949,7 +949,7 @@ static int find_next_reservable_window( prev = rsv; next = rb_next(&rsv->rsv_node); - rsv = list_entry(next,struct ext3_reserve_window_node,rsv_node); + rsv = rb_entry(next,struct ext3_reserve_window_node,rsv_node); /* * Reached the last reservation, we can just append to the @@ -1193,7 +1193,7 @@ static void try_to_extend_reservation(st if (!next) my_rsv->rsv_end += size; else { - next_rsv = list_entry(next, struct ext3_reserve_window_node, rsv_node); + next_rsv = rb_entry(next, struct ext3_reserve_window_node, rsv_node); if ((next_rsv->rsv_start - my_rsv->rsv_end - 1) >= size) my_rsv->rsv_end += size; _ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/12] ext3 balloc: reset windowsz when full
Port a series ext2 balloc patches from Hugh to ext3/4. The first 6 patches are against ext3, and the rest are aginst ext4. -- Subject: ext2 balloc: reset windowsz when full From: Hugh Dickins <[EMAIL PROTECTED]> ext2_new_blocks should reset the reservation window size to 0 when squeezing the last blocks out of an almost full filesystem, so the retry doesn't skip any groups with less than half that free, reporting ENOSPC too soon. -- Sync up reservation fix from ext2 Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- --- linux-2.6.19-rc5-cmm/fs/ext3/balloc.c |1 + 1 file changed, 1 insertion(+) diff -puN fs/ext3/balloc.c~ext3_reset_windowsz_in_full_fs fs/ext3/balloc.c --- linux-2.6.19-rc5/fs/ext3/balloc.c~ext3_reset_windowsz_in_full_fs 2006-11-28 19:36:41.0 -0800 +++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-28 19:36:41.0 -0800 @@ -1552,6 +1552,7 @@ retry_alloc: */ if (my_rsv) { my_rsv = NULL; + windowsz = 0; group_no = goal_group; goto retry_alloc; } _ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: a question about ext4_fsblk_t
On Sun, 2006-10-22 at 16:50 +0800, guomingyang wrote: > >I also find many places where the block number type is not changed in > namei.c and dir.c. Why? They are all file logical blocks, ext4_fsblk_t is for on disk blocks. Takashi Sato has a patch to define all file logical blocks as ext3_fileblk_t, it did not make to mainline before ext4 was forked. Probably we should bring the to ext3/4 to clarify the confusions. Mingming >Thank you to anyone who offer help > >Hello everyone: > > > > I am a student interesting in linux filesystem, I have a > > problem about the replace of block number type to ext4_fsblk_t. Is it > > complete? Or has it passed all the test? Because I have found a place like > > this in linux-2.6.19-rc2/fs/ext4/inode.c > > > > > > > >+struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode, > >+ int block, int create, int *err) > > > > > >the block's type is not changed. Although I think it will bring no mistake, > >I doubt about why not change it. Thank you ! > > > > > >guomingyang > > > >- > >To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > >the body of a message to [EMAIL PROTECTED] > >More majordomo info at http://vger.kernel.org/majordomo-info.html > > guomingyang > > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] ext2 balloc: fix _with_rsv freeze
On Tue, 2006-11-28 at 20:07 +, Hugh Dickins wrote: > On Tue, 28 Nov 2006, Mingming Cao wrote: > > On Tue, 2006-11-28 at 17:40 +, Hugh Dickins wrote: > > > After several days of testing ext2 with reservations, it got caught inside > > > ext2_try_to_allocate_with_rsv: alloc_new_reservation repeatedly succeeding > > > on the window [12cff,12d0e], ext2_try_to_allocate repeatedly failing to > > > find the free block guaranteed to be included (unless there's contention). > > > > > > > Hmm, I suspect there is other issue: alloc_new_reservation should not > > repeatedly allocating the same window, if ext2_try_to_allocate > > repeatedly fails to find a free block in that window. > > find_next_reservable_window() takes my_rsv (the old window that he > > thinks there is no free block) as a guide to find a window "after" the > > end block of my_rsv, so how could this happen? > > Hmmm. I haven't studied that part of the code, but what you say sounds > sensible: that would leave more to be explained, yes. I guess it would > happen if all the rest of the bitmap were either allocated or reserved, But bitmap_search_next_usable_block() will fail in the case the rest of bitmap were allocated, and find_next_reservable_space() will fail in the case the rest of group were all reserved. alloc_new_reservation() should not create a new window in this case. > but I don't believe that was the case here: I have noted that the map > was all 00s from offset 0x1ae onwards, plenty unallocated; I've not > recorded the following reservations, but it seems unlikely they covered > the remaining free area (and still covered it even when the remaining > tasks got to the point of just waiting for this one). > > > > > > Fix the range to find_next_usable_block's memscan: the scan from "here" > > > (0xcfe) up to (but excluding) "maxblocks" (0xd0e) needs to scan 3 bytes > > > not 2 (the relevant bytes of bitmap in this case being f7 df ff - none > > > 00, but the premature cutoff implying that the last was found 00). > > > > > > > alloc_new_reservation() reserved a window with free block, when come to > > the time to claim it, it scans the window again. So it seems that the > > range of the the scan is too small: > > The range of the scan is 1 byte too small in this case, yes. > > > > > p = ((char *)bh->b_data) + (here >> 3); > > r = memscan(p, 0, (maxblocks - here + 7) >> 3); > > next = (r - ((char *)bh->b_data)) << 3; > > > > -> next is -1 > > I don't understand you: next was not -1, it was 0xd08. > > > if (next < maxblocks && next >= here) > > return next; > > > > --> falls to false branch > > No, it passed the "next < maxblocks && next >= here" test > (maxblocks being 0xd0e and here being 0xcfe), so returned > pointing to an allocated block - then the caller finds it > cannot set the bit. > Apologies for the confusion. I thought ext2_try_to_allocate() failed because we could not find a free block in the reserved window (i.e., find_next_usable_block() failed) It seems in this case, find_next_usable_block() incorrectly returns a bit it *thinks* free, but ext2_try_to_allocate() fails to claim it as it's being marked as used. So yes, Acked this fix. Thanks. > > > > here = bitmap_search_next_usable_block(here, bh, maxblocks); > > return here; > > > > So we failed to find a free byte in the range. That's seems fine to me. > > It's only a nice thing to have -- try to allocate a block in a place > > where it's neighbors are all free also. If it fails, it will search the > > window bit by bit. So I don't understand why it is not being recovered > > by bitmap_search_next_usable_block(), which test the bitmap bit by bit? > > It already returned, it doesn't reach that line. > Yep. > > > > > Is this a problem for mainline ext2? No, because the "size" in its > > > memscan > > > is always EXT2_BLOCKS_PER_GROUP(sb), which mkfs.ext2 requires to be a > > > multiple of 8. Is this a problem for ext3 or ext4? No, because they have > > > an additional extN_test_allocatable test which rescues them from the > > > error. > > > > > Hmm, if the error is it prematurely think there is no free block in the > > range (bitmap on disk), then even in ext3/4, it will not bother checking > > the jbd copy of the bitmap. I am not sure this is the cause that ext3/4 > > may not has the problem. > > In the ext3/4 case, it indeed won't bother to check the jbd copy > (having found this bitmap bit set), it'll fall through to the > bitmap_search_next_usable_block you indicated above, > and that should do the right thing, finding the first > free bit in the area originally reserved. > Make sense. > > > > > But the bigger question is, why does the my_rsv case come here to > > > find_next_usable_block at all? > > > > Because grp_goal is -1? > > Well, yes, but my point is that we've got a reservation, and we're > hoping to al
Re: Boot failure with ext2 and initrds
On Tue, 2006-11-28 at 14:33 -0800, Andrew Morton wrote: > On Tue, 28 Nov 2006 13:04:53 -0800 > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > Thanks, I have acked most of them, and will port them to ext3/4 soon. > > You've acked #2 and #3. #4, #5 and #6 remain un-commented-upon and #1 is > unclear? sorry, just acked #4, #5 and #6. I mean to do that before I said so but they did not go out of my outbox till now ( I was out for a few hours). #1 looks correct to me now, just thought there are other issues. I will comment on that one in that thread. Mingming - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] ext2 balloc: say rb_entry not list_entry
On Tue, 2006-11-28 at 17:43 +, Hugh Dickins wrote: > The reservations tree is an rb_tree not a list, so it's less confusing to > use rb_entry() than list_entry() - though they're both just container_of(). > > Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> > --- > Acked. Thanks, Mingming > fs/ext2/balloc.c |6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.0 + > +++ linux/fs/ext2/balloc.c2006-11-27 19:28:41.0 + > @@ -156,7 +156,7 @@ restart: > > printk("Block Allocation Reservation Windows Map (%s):\n", fn); > while (n) { > - rsv = list_entry(n, struct ext2_reserve_window_node, rsv_node); > + rsv = rb_entry(n, struct ext2_reserve_window_node, rsv_node); > if (verbose) > printk("reservation window 0x%p " > "start: %lu, end: %lu\n", > @@ -753,7 +753,7 @@ static int find_next_reservable_window( > > prev = rsv; > next = rb_next(&rsv->rsv_node); > - rsv = list_entry(next,struct ext2_reserve_window_node,rsv_node); > + rsv = rb_entry(next,struct ext2_reserve_window_node,rsv_node); > > /* >* Reached the last reservation, we can just append to the > @@ -995,7 +995,7 @@ static void try_to_extend_reservation(st > if (!next) > my_rsv->rsv_end += size; > else { > - next_rsv = list_entry(next, struct ext2_reserve_window_node, > rsv_node); > + next_rsv = rb_entry(next, struct ext2_reserve_window_node, > rsv_node); > > if ((next_rsv->rsv_start - my_rsv->rsv_end - 1) >= size) > my_rsv->rsv_end += size; - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] ext2 balloc: use io_error label
On Tue, 2006-11-28 at 17:44 +, Hugh Dickins wrote: > ext2_new_blocks has a nice io_error label for setting -EIO, > so goto that in the one place that doesn't already use it. > > Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> > --- > Acked, Mingming > fs/ext2/balloc.c |7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.0 + > +++ linux/fs/ext2/balloc.c2006-11-27 19:28:41.0 + > @@ -1258,10 +1258,9 @@ retry_alloc: > if (group_no >= ngroups) > group_no = 0; > gdp = ext2_get_group_desc(sb, group_no, &gdp_bh); > - if (!gdp) { > - *errp = -EIO; > - goto out; > - } > + if (!gdp) > + goto io_error; > + > free_blocks = le16_to_cpu(gdp->bg_free_blocks_count); > /* >* skip this group if the number of > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] ext2 balloc: fix off-by-one against grp_goal
On Tue, 2006-11-28 at 17:42 +, Hugh Dickins wrote: > grp_goal 0 is a genuine goal (unlike -1), > so ext2_try_to_allocate_with_rsv should treat it as such. > > Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> Acked. Thanks, Mingming > --- > > fs/ext2/balloc.c |4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.0 + > +++ linux/fs/ext2/balloc.c2006-11-27 19:28:41.0 + > @@ -1053,7 +1053,7 @@ ext2_try_to_allocate_with_rsv(struct sup > } > /* >* grp_goal is a group relative block number (if there is a goal) > - * 0 < grp_goal < EXT2_BLOCKS_PER_GROUP(sb) > + * 0 <= grp_goal < EXT2_BLOCKS_PER_GROUP(sb) >* first block is a filesystem wide block number >* first block is the block number of the first block in this group >*/ > @@ -1089,7 +1089,7 @@ ext2_try_to_allocate_with_rsv(struct sup > if (!goal_in_my_reservation(&my_rsv->rsv_window, > grp_goal, group, sb)) > grp_goal = -1; > - } else if (grp_goal > 0) { > + } else if (grp_goal >= 0) { > int curr = my_rsv->rsv_end - > (grp_goal + group_first_block) + 1; > - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Boot failure with ext2 and initrds
On Tue, 28 Nov 2006 13:04:53 -0800 Mingming Cao <[EMAIL PROTECTED]> wrote: > Thanks, I have acked most of them, and will port them to ext3/4 soon. You've acked #2 and #3. #4, #5 and #6 remain un-commented-upon and #1 is unclear? - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 1/1] Comprehensive Secure Deletion for ext4
Hello All, As promised a few weeks ago, we have composed a complete patch to add secure deletion functionality in ext4. With this patch, we can ensure that data, meta-data and the filename of any file marked with the special attribute "s", are overwritten right after it is deleted. The patch performs N overwrites with configurable patterns. It can handle separate overwrite configurations for each of the data, meta-data and directory entry. In order to handle partial file truncations, our patch will intercept truncate requests in addition to unlink requests (something that user-mode programs such as shred cannot handle). Moreover, to make the data overwrites and file truncation an atomic operation, we add the file to an orphan list right before we begin to overwrite and remove it from the orphan list after truncation has finished. The patch works in all journaling modes. As far as the journal is concerned, we leave it untouched. There are a few reasons behind this approach. First, performance will be noticeably hindered. This is because the journal will have to postpone all further transactions until the necessary blocks are overwritten - which will take time. Secondly, by the time we want to overwrite the necessary blocks, the journal will have already done this for us by committing all further transactions, especially in full journaling mode. That said, if the community will appear to also want secure-deletion of journal records, then we can add that (probably as a separate feature one can turn on independently). Secure deletion arguments during mount time are of the form: mount -t ext4dev -o secdel="args" [dev] [mnt] where args is a 6-way list delimited by colons as such: secdel="N1:S1:N2:S2:N3:S3" N1,N2,N3 denote the number of overwrites for data,meta-data and the filename respectively. S1,S2,S3 denote the character sequence used to overwrite data,meta-data and the filename respectively. Random character sequences are denoted by 'R'. Any item that is left empty is not securely deleted. In addition, if one wants to secure-delete all portions of a file, an abbreviated format can be used as such: secdel="N:S" N denotes the number of overwrites chosen S denotes the character sequence to use for overwriting For example, the mount option: secdel="2:Rc0:::1:\0" Means that data will be overwritten once using a random character, then once using the ASCII character 'c', and finally once more using the ASCII character 0. Finally this whole sequence will be written 2 times. Meta-data will not be touched and the filename will be overwritten 1 time with the value zero. Any comments, help or feedback will be greatly appreciated. Thank you, Harry Papaxenopoulos, Nikolai Joukov, and Erez Zadok -- File systems and Storage Laboratory Stony Brook University -- Signed-off-by: Harry Papaxenopoulos <[EMAIL PROTECTED]> Signed-off-by: Nikolai Joukov <[EMAIL PROTECTED]> Signed-off-by: Erez Zadok <[EMAIL PROTECTED]> Signed-off-by: Jeff Sipek <[EMAIL PROTECTED]> diff -Naur 2.6.19-rc6/fs/ext4/inode.c 2.6.19-rc6sd/fs/ext4/inode.c --- 2.6.19-rc6/fs/ext4/inode.c 2006-11-28 16:08:08.0 -0500 +++ 2.6.19-rc6sd/fs/ext4/inode.c2006-11-28 16:09:10.0 -0500 @@ -37,8 +37,10 @@ #include #include #include +#include #include "xattr.h" #include "acl.h" +#include "secrm.h" /* * Test whether an inode is a fast symlink. @@ -201,8 +203,19 @@ if (IS_SYNC(inode)) handle->h_sync = 1; inode->i_size = 0; +#ifndef CONFIG_EXT4DEV_FS_SECRM if (inode->i_blocks) ext4_truncate(inode); +#else + if (EXT4_I(inode)->i_flags & EXT4_SECRM_FL) + ext4_journal_stop(handle); + if (inode->i_blocks) + ext4_truncate(inode); + if (EXT4_I(inode)->i_flags & EXT4_SECRM_FL) { + ext4_secrm_metadata(inode); + handle = start_transaction(inode); + } +#endif /* * Kill off the orphan record which ext4_truncate created. * AKPM: I think this can be inside the above `if'. @@ -1082,6 +1095,272 @@ return NULL; } +#ifdef CONFIG_EXT4DEV_FS_SECRM +static int ext4_ovwt_blocks(handle_t *handle, struct inode *inode, char param, + sector_t first, sector_t last) +{ + int retval = 0; + sector_t sector; + sector_t batch = DIO_CREDITS; + struct buffer_head *bh = NULL; + struct buffer_head tmp; + struct super_block *sb; + J_ASSERT(handle != NULL); + + tmp.b_state = 0; + tmp.b_blocknr = 0; + sb = inode->i_sb; + + J_ASSERT( last >= first ); + + if (IS_ERR(handle)) { + retval = PTR_ERR(handle); + goto out; + } + + for (sector = first; sector <= last; sector++) { + if (handle->h_buffer_credits <= EXT4_RESERVE_TRANS_BLOCKS) { +
Re: Boot failure with ext2 and initrds
On Tue, 2006-11-28 at 17:38 +, Hugh Dickins wrote: > > > > And could you check the start and end block for every rsv_window_add and > > rsv_window_remove, to see if it was keep creating and removing the same > > window in the same block group? > > The same every time it settled on a usable reservation, but a lot of > wasted adds and removes as it works across a fully allocated area of > the bitmap. Not very efficient, all those rb_tree insertions and > removals until it gets to a free area; but I can't judge from this > example how common that would be, may not be worth bothering about. > Yeah, it's not so efficient to add a window before knowing it has a free block, as rb tree insertion is quit expensive. Actually it used to be this way: only insert a window to the rb tree when there is a free bit inside of it. However we need to hold the per-fs reservation lock while scanning the bitmap:( This badly hurt the performance in some case and the real time folks complained about it. So we changed the code to the current way. I think it was not that inefficient in the case it works across a fully allocated/reserved area, since bitmap_search_next_usable_block() will skip the fully allocated area fairly quickly, as it searchs from the beginning of the window till the last block of the block group (not just the end of the window) > > > > And it might be useful to dump the whole filesystem block reservation > > tree as well. > > Not necessary, I've put just the relevant numbers in the patch comment, > it helped me a lot to see the actual numbers and work it out from there. > > > > > Not sure if it worth the effort to test it on ext3. The ext2 block > > reservation code in 2.6.19-rc5-mm2 looks pretty much the same as ext3/4, > > except the missing truncate_mutex. I understand this might take a few > > days to reproduce, so this might not needed now. > > Turns out vanilla-ext2 and ext3 and ext4 are safe: > ext3 and ext4 slightly wrong in the same way, but safe. > Good to know. > I'll continue this thread with the bugfix patch 1/6; and five more > (roughly descending order of importance, the latter just cosmetic) > little fs/ext2/balloc.c patches to things I noticed on the way. > Nothing very worrying. Easier to send patches than ask questions: > please check, perhaps my "off-by-one" accusations are wrong, for > example. If you're satisfied they're right, please apply the > same to ext3 and ext4. > Thanks, I have acked most of them, and will port them to ext3/4 soon. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
xfs preallocation writeup, for comparison
As promised, here is a writeup of xfs preallocation routines. I don't hold these up as the perfect or best way to do this task, but it is worth looking at what has been done before, to get ideas, find better ways, and avoid pitfalls for ext4. XFS preallocation interfaces. = The xfs preallocation interfaces are described in the xfsctl(3) manpage. It's not the best doc, so I'll summarize: XFS has these ioctl calls for space managment of files: XFS_IOC_ALLOCSP XFS_IOC_FREESP XFS_IOC_RESVSP XFS_IOC_UNRESVSP All of these interfaces take an flock-style argument, and you use it to specify the range of bytes in the file which should be preallocated, essentially with an offset and a length. The real work for all of this is done in xfs_change_file_space() in xfs_vnodeops.c The main difference between resvsp and allocsp is that resvsp marks the blocks as "unwritten" meaning that they are allocated but not yet written to, and if they are read, they will return zeros. allocsp actually writes zeros into the allocated blocks. We can use the xfs_io tool to demonstrate. resvsp example: == [EMAIL PROTECTED] test]# touch resvsp [EMAIL PROTECTED] test]# xfs_io resvsp xfs_io> resvsp 0 10g xfs_io> bmap -vp resvsp: EXT: FILE-OFFSET BLOCK-RANGEAG AG-OFFSET TOTAL FLAGS 0: [0..16657327]:16657456..33314783 1 (64..16657391) 16657328 1 1: [16657328..20971519]: 96..4314287 0 (96..4314287) 4314192 1 so we got 2 extents for this 10g file - those are actual filesystem blocks allocated. The file is 0 length, but is using 10g of blocks: [EMAIL PROTECTED] test]# ls -lh resvsp -rw-r--r-- 1 root root 0 Nov 28 14:11 resvsp [EMAIL PROTECTED] test]# du -hc resvsp 10G resvsp 10G total The extents are simply flagged as unwritten (0x1 above), so very little IO occurs and the space reservation is fast.. allocsp example: === (note there's a bit of a buglet in xfs_io, hence the swapped arguments) [EMAIL PROTECTED] test]# touch allocsp [EMAIL PROTECTED] test]# xfs_io allocsp xfs_io> allocsp 10g 0 xfs_io> bmap -vp allocsp: EXT: FILE-OFFSET BLOCK-RANGEAG AG-OFFSET TOTAL 0: [0..16657327]:33314848..49972175 2 (64..16657391) 16657328 1: [16657328..20971519]: 4314288..86284790 (4314288..8628479) 4314192 We also got 2 extents here, but they are not flagged as unwritten - those filesystem blocks were all actually filled with zeros. [EMAIL PROTECTED] test]# ls -lh allocsp -rw-r--r-- 1 root root 10G Nov 28 14:19 allocsp [EMAIL PROTECTED] test]# du -hc allocsp 10G allocsp 10G total It would be very nice to see posix_fallocate hooked up to the underlying filesystem, so that it can make smart decisions about how to efficiently reserve space... -Eric - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
xfs defragmentation writeup, for comparison
As promised, here is a writeup of xfs defragmentation routines. I don't hold these up as the perfect or best way to do this task, but it is worth looking at what has been done before, to get ideas, find better ways, and avoid pitfalls for ext4. XFS defragmentation interface. = xfs uses the xfs_fsr tool, found in the xfsdump (!) package, to defragment files on the filesystem. It has a few features for defragmenting a whole filesystem, starting/stopping, etc, but I'll just sketch out how it defragments a single file. The xfs preallocation routines are central to this; fsr uses preallocation to create the less-fragmented space for the file in question. The 10,000 foot overview is: 1. create a new temporary file - open & unlink 2. preallocate space for the new file to match the file to be defragmented 3. see if we got fewer extents than the original 4. do an O_DIRECT data copy into the new extents 5. call the kernel to swap the extents between the two, with sanity checks 6. close unlinked temporary file which now contains the fragmented extents. userspace work is done in xfsdump/fsr/xfs_fsr.c kernelspace work is done in fs/xfs/xfs_dfrag.c In more detail... in userspace, fsrfile_common() / packfile(): check for mandatory locks, skip this file if present make sure there is room to copy the file get inode attributes (ext2-style, append-only, immutable, etc) skip if immutable, append-only, or no-defrag set get the current extent layout of the file (XFS_IOC_GETBMAP) stop if already best nr. of extents open the temp file and immediately unlink it set extended attributes on the temp file set other extended inode flags on the temp file set up buffers for direct IO loop through original block map, preallocating extents for tmp file (this preserves holes as well) double check that we have fewer extents now loop through the block map, copying into temp file via O_DIRECT truncate temp file to proper size (O_DIRECT alignment may have made it larger) switch to file owner's UID/GID to preserve quota information set up swapext ioctl to swap extents call kernel to swap extents between original & temp files: typedef struct xfs_swapext { __int64_t sx_version; /* swapext version */ __int64_t sx_fdtarget;/* fd of target file */ __int64_t sx_fdtmp; /* fd of tmp file */ xfs_off_t sx_offset; /* offset into file */ xfs_off_t sx_length; /* leng from offset */ charsx_pad[16]; /* pad space, unused */ xfs_bstat_t sx_stat;/* stat of target b4 copy */ } xfs_swapext_t; now in kernelspace, xfs_swapext() / xfs_swap_extents(): verify both files on same filesystem verify that inode numbers differ abort if filesystem is shut down lock the inodes (ilock, ilock...) check permissions on the files verify that they both have the same format/type (S_IFMT, realtime, etc) if temp file is cached, flush it verify size of both files match verify both files have extended attributes (or not) compare change & modify times with what was passed in abort if they differ, file was changed before locking abort if the original file is memory-mapped set up transaction swap the data forks of the inodes, fix up on-disk inode values commit the transaction unlock the inodes -Eric - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Boot failure with ext2 and initrds
On Tue, 21 Nov 2006, Mingming Cao wrote: > On Mon, 2006-11-20 at 16:19 +, Hugh Dickins wrote: > > After four days of running, the EM64T has at last reproduced the same > > hang as it did in an hour before: stuck in > > ext2_try_to_allocate_with_rsv, > > repeatedly ext2_rsv_window_add, ext2_try_to_allocate, > > rsv_window_remove > > (perhaps not in that order). > > At last it did happen again, on both x86_64 and ppc64. > > Don't have much clue, still...:( Don't worry, KDB helped me work it out, patch follows in a moment. > > The logic of the while loop in ext2_try_to_allocate_with_rsv() looks > like: Thanks for your clarifications and tips. > > > A bit confused, did the whole system hang or just the "ld" writer hang > in ext2 block allocation? And what other writers were waiting for? Were > they trying to write to the same file? The system was pingable, but couldn't do much else. Only the one "ld" was active on the ext2 filesystem by this time, other tasks of the make just waiting on it, nothing else was trying to write there. 4 cpus, well, 2*HT: why couldn't I ssh or login? I don't know, something else to investigate, but that can be quite separate: very unlikely to be related to the particular ext2 bug which showed it (that filesystem was just for the test, it's not my root). Probably stupidly obvious once I've worked it out. > > It might be helpful to check the return value of ext2_try_to_allocate(), > to see if it fails. It would be nice if you could check if there any > free bit inside the reservation window. After screwing up last time, I was ultra-paranoid this time, and did not dare to set any breakpoints: proceeded by repeatedly breaking in from the keyboard, and the time I happened to catch it on return from memscan() was revealing. > > And could you check the start and end block for every rsv_window_add and > rsv_window_remove, to see if it was keep creating and removing the same > window in the same block group? The same every time it settled on a usable reservation, but a lot of wasted adds and removes as it works across a fully allocated area of the bitmap. Not very efficient, all those rb_tree insertions and removals until it gets to a free area; but I can't judge from this example how common that would be, may not be worth bothering about. > > And it might be useful to dump the whole filesystem block reservation > tree as well. Not necessary, I've put just the relevant numbers in the patch comment, it helped me a lot to see the actual numbers and work it out from there. > > Not sure if it worth the effort to test it on ext3. The ext2 block > reservation code in 2.6.19-rc5-mm2 looks pretty much the same as ext3/4, > except the missing truncate_mutex. I understand this might take a few > days to reproduce, so this might not needed now. Turns out vanilla-ext2 and ext3 and ext4 are safe: ext3 and ext4 slightly wrong in the same way, but safe. I'll continue this thread with the bugfix patch 1/6; and five more (roughly descending order of importance, the latter just cosmetic) little fs/ext2/balloc.c patches to things I noticed on the way. Nothing very worrying. Easier to send patches than ask questions: please check, perhaps my "off-by-one" accusations are wrong, for example. If you're satisfied they're right, please apply the same to ext3 and ext4. Although 1/6 does (I believe: too early for my tests to tell) fix the bug in a minimal way, I rather think that area needs cleanup - further comments below my Signed-off-by in 1/6. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] ext2 balloc: fix _with_rsv freeze
On Tue, 28 Nov 2006, Mingming Cao wrote: > On Tue, 2006-11-28 at 17:40 +, Hugh Dickins wrote: > > After several days of testing ext2 with reservations, it got caught inside > > ext2_try_to_allocate_with_rsv: alloc_new_reservation repeatedly succeeding > > on the window [12cff,12d0e], ext2_try_to_allocate repeatedly failing to > > find the free block guaranteed to be included (unless there's contention). > > > > Hmm, I suspect there is other issue: alloc_new_reservation should not > repeatedly allocating the same window, if ext2_try_to_allocate > repeatedly fails to find a free block in that window. > find_next_reservable_window() takes my_rsv (the old window that he > thinks there is no free block) as a guide to find a window "after" the > end block of my_rsv, so how could this happen? Hmmm. I haven't studied that part of the code, but what you say sounds sensible: that would leave more to be explained, yes. I guess it would happen if all the rest of the bitmap were either allocated or reserved, but I don't believe that was the case here: I have noted that the map was all 00s from offset 0x1ae onwards, plenty unallocated; I've not recorded the following reservations, but it seems unlikely they covered the remaining free area (and still covered it even when the remaining tasks got to the point of just waiting for this one). > > > Fix the range to find_next_usable_block's memscan: the scan from "here" > > (0xcfe) up to (but excluding) "maxblocks" (0xd0e) needs to scan 3 bytes > > not 2 (the relevant bytes of bitmap in this case being f7 df ff - none > > 00, but the premature cutoff implying that the last was found 00). > > > > alloc_new_reservation() reserved a window with free block, when come to > the time to claim it, it scans the window again. So it seems that the > range of the the scan is too small: The range of the scan is 1 byte too small in this case, yes. > > p = ((char *)bh->b_data) + (here >> 3); > r = memscan(p, 0, (maxblocks - here + 7) >> 3); > next = (r - ((char *)bh->b_data)) << 3; > > -> next is -1 I don't understand you: next was not -1, it was 0xd08. > if (next < maxblocks && next >= here) > return next; > > --> falls to false branch No, it passed the "next < maxblocks && next >= here" test (maxblocks being 0xd0e and here being 0xcfe), so returned pointing to an allocated block - then the caller finds it cannot set the bit. > > here = bitmap_search_next_usable_block(here, bh, maxblocks); > return here; > > So we failed to find a free byte in the range. That's seems fine to me. > It's only a nice thing to have -- try to allocate a block in a place > where it's neighbors are all free also. If it fails, it will search the > window bit by bit. So I don't understand why it is not being recovered > by bitmap_search_next_usable_block(), which test the bitmap bit by bit? It already returned, it doesn't reach that line. > > > Is this a problem for mainline ext2? No, because the "size" in its memscan > > is always EXT2_BLOCKS_PER_GROUP(sb), which mkfs.ext2 requires to be a > > multiple of 8. Is this a problem for ext3 or ext4? No, because they have > > an additional extN_test_allocatable test which rescues them from the error. > > > Hmm, if the error is it prematurely think there is no free block in the > range (bitmap on disk), then even in ext3/4, it will not bother checking > the jbd copy of the bitmap. I am not sure this is the cause that ext3/4 > may not has the problem. In the ext3/4 case, it indeed won't bother to check the jbd copy (having found this bitmap bit set), it'll fall through to the bitmap_search_next_usable_block you indicated above, and that should do the right thing, finding the first free bit in the area originally reserved. > > > But the bigger question is, why does the my_rsv case come here to > > find_next_usable_block at all? > > Because grp_goal is -1? Well, yes, but my point is that we've got a reservation, and we're hoping to allocate from it (even though we've given up on the "goal"), but find_next_usable_block is not respecting it at all - liable to be allocating out of others' reservations instead. > > > Doesn't its 64-bit boundary limit, and its > > memscan, blithely ignore what the reservation prepared? > > I agree with you that the double check is urgly. But it's necessary:( If > there to prevent contention: other file make steal that free block we > reserved for this file, in the case filesystem is full of reservation... I agree it's necessary to recheck the allocation; I disagree that the 64-bit boundary limit and memscan are appropriate when my_rsv is set. > > > It's messy too, > > the complement of the memscan being that "i < 7" loop over in > > ext2_try_to_allocate. I think this ought to be cleaned up, > > in ext2+reservations and ext3 and ext4. > > > The "i<7" loop there is for
Re: [PATCH 3/6] ext2 balloc: fix off-by-one against rsv_end
On Tue, 2006-11-28 at 17:41 +, Hugh Dickins wrote: > rsv_end is the last block within the reservation, > so alloc_new_reservation should accept start_block == rsv_end as success. > > Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> > --- > Thanks, Acked. This is not a problem for now, as the default window size is 8 blocks, and we never shrink the window size. But it could be a issue in the future, if the reservation window could be dynamically shrink, when it keep failing to create a new(large) reservation window, or we keep throw away a just-allocated-window as the application is doing very seeky random write. > fs/ext2/balloc.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.0 + > +++ linux/fs/ext2/balloc.c2006-11-27 19:28:41.0 + > @@ -950,7 +950,7 @@ retry: >* check if the first free block is within the >* free space we just reserved >*/ > - if (start_block >= my_rsv->rsv_start && start_block < my_rsv->rsv_end) > + if (start_block >= my_rsv->rsv_start && start_block <= my_rsv->rsv_end) > return 0; /* success */ > /* >* if the first free bit we found is out of the reservable space - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] ext2 balloc: reset windowsz when full
On Tue, 2006-11-28 at 17:40 +, Hugh Dickins wrote: > ext2_new_blocks should reset the reservation window size to 0 when squeezing > the last blocks out of an almost full filesystem, so the retry doesn't skip > any groups with less than half that free, reporting ENOSPC too soon. > I realize this is a bug when I was looking at the code as well. I was thinking we should not check for whether the window size is less than the free blocks counter at all, when the allocation is fall back to non reservation allocation. But your fix works as well. Mingming - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] ext2 balloc: fix _with_rsv freeze
On Tue, 2006-11-28 at 17:40 +, Hugh Dickins wrote: > After several days of testing ext2 with reservations, it got caught inside > ext2_try_to_allocate_with_rsv: alloc_new_reservation repeatedly succeeding > on the window [12cff,12d0e], ext2_try_to_allocate repeatedly failing to > find the free block guaranteed to be included (unless there's contention). > Hmm, I suspect there is other issue: alloc_new_reservation should not repeatedly allocating the same window, if ext2_try_to_allocate repeatedly fails to find a free block in that window. find_next_reservable_window() takes my_rsv (the old window that he thinks there is no free block) as a guide to find a window "after" the end block of my_rsv, so how could this happen? > Fix the range to find_next_usable_block's memscan: the scan from "here" > (0xcfe) up to (but excluding) "maxblocks" (0xd0e) needs to scan 3 bytes > not 2 (the relevant bytes of bitmap in this case being f7 df ff - none > 00, but the premature cutoff implying that the last was found 00). > alloc_new_reservation() reserved a window with free block, when come to the time to claim it, it scans the window again. So it seems that the range of the the scan is too small: p = ((char *)bh->b_data) + (here >> 3); r = memscan(p, 0, (maxblocks - here + 7) >> 3); next = (r - ((char *)bh->b_data)) << 3; -> next is -1 if (next < maxblocks && next >= here) return next; --> falls to false branch here = bitmap_search_next_usable_block(here, bh, maxblocks); return here; So we failed to find a free byte in the range. That's seems fine to me. It's only a nice thing to have -- try to allocate a block in a place where it's neighbors are all free also. If it fails, it will search the window bit by bit. So I don't understand why it is not being recovered by bitmap_search_next_usable_block(), which test the bitmap bit by bit? > Is this a problem for mainline ext2? No, because the "size" in its memscan > is always EXT2_BLOCKS_PER_GROUP(sb), which mkfs.ext2 requires to be a > multiple of 8. Is this a problem for ext3 or ext4? No, because they have > an additional extN_test_allocatable test which rescues them from the error. > Hmm, if the error is it prematurely think there is no free block in the range (bitmap on disk), then even in ext3/4, it will not bother checking the jbd copy of the bitmap. I am not sure this is the cause that ext3/4 may not has the problem. > But the bigger question is, why does the my_rsv case come here to > find_next_usable_block at all? Because grp_goal is -1? > Doesn't its 64-bit boundary limit, and its > memscan, blithely ignore what the reservation prepared? I agree with you that the double check is urgly. But it's necessary:( If there to prevent contention: other file make steal that free block we reserved for this file, in the case filesystem is full of reservation... > It's messy too, > the complement of the memscan being that "i < 7" loop over in > ext2_try_to_allocate. I think this ought to be cleaned up, > in ext2+reservations and ext3 and ext4. > The "i<7" loop there is for non reservation case. Since find_next_usable_block() could find a free byte, it's trying to avoid filesystem holes by shifting the start of the free block for at most 7 times. Thanks! Mingming > fs/ext2/balloc.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.0 + > +++ linux/fs/ext2/balloc.c2006-11-27 19:28:41.0 + > @@ -570,7 +570,7 @@ find_next_usable_block(int start, struct > here = 0; > > p = ((char *)bh->b_data) + (here >> 3); > - r = memscan(p, 0, (maxblocks - here + 7) >> 3); > + r = memscan(p, 0, ((maxblocks + 7) >> 3) - (here >> 3)); > next = (r - ((char *)bh->b_data)) << 3; > > if (next < maxblocks && next >= here) > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] ext2 balloc: fix off-by-one against grp_goal
grp_goal 0 is a genuine goal (unlike -1), so ext2_try_to_allocate_with_rsv should treat it as such. Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> --- fs/ext2/balloc.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.0 + +++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.0 + @@ -1053,7 +1053,7 @@ ext2_try_to_allocate_with_rsv(struct sup } /* * grp_goal is a group relative block number (if there is a goal) -* 0 < grp_goal < EXT2_BLOCKS_PER_GROUP(sb) +* 0 <= grp_goal < EXT2_BLOCKS_PER_GROUP(sb) * first block is a filesystem wide block number * first block is the block number of the first block in this group */ @@ -1089,7 +1089,7 @@ ext2_try_to_allocate_with_rsv(struct sup if (!goal_in_my_reservation(&my_rsv->rsv_window, grp_goal, group, sb)) grp_goal = -1; - } else if (grp_goal > 0) { + } else if (grp_goal >= 0) { int curr = my_rsv->rsv_end - (grp_goal + group_first_block) + 1; - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] ext2 balloc: say rb_entry not list_entry
The reservations tree is an rb_tree not a list, so it's less confusing to use rb_entry() than list_entry() - though they're both just container_of(). Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> --- fs/ext2/balloc.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.0 + +++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.0 + @@ -156,7 +156,7 @@ restart: printk("Block Allocation Reservation Windows Map (%s):\n", fn); while (n) { - rsv = list_entry(n, struct ext2_reserve_window_node, rsv_node); + rsv = rb_entry(n, struct ext2_reserve_window_node, rsv_node); if (verbose) printk("reservation window 0x%p " "start: %lu, end: %lu\n", @@ -753,7 +753,7 @@ static int find_next_reservable_window( prev = rsv; next = rb_next(&rsv->rsv_node); - rsv = list_entry(next,struct ext2_reserve_window_node,rsv_node); + rsv = rb_entry(next,struct ext2_reserve_window_node,rsv_node); /* * Reached the last reservation, we can just append to the @@ -995,7 +995,7 @@ static void try_to_extend_reservation(st if (!next) my_rsv->rsv_end += size; else { - next_rsv = list_entry(next, struct ext2_reserve_window_node, rsv_node); + next_rsv = rb_entry(next, struct ext2_reserve_window_node, rsv_node); if ((next_rsv->rsv_start - my_rsv->rsv_end - 1) >= size) my_rsv->rsv_end += size; - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] ext2 balloc: fix _with_rsv freeze
After several days of testing ext2 with reservations, it got caught inside ext2_try_to_allocate_with_rsv: alloc_new_reservation repeatedly succeeding on the window [12cff,12d0e], ext2_try_to_allocate repeatedly failing to find the free block guaranteed to be included (unless there's contention). Fix the range to find_next_usable_block's memscan: the scan from "here" (0xcfe) up to (but excluding) "maxblocks" (0xd0e) needs to scan 3 bytes not 2 (the relevant bytes of bitmap in this case being f7 df ff - none 00, but the premature cutoff implying that the last was found 00). Is this a problem for mainline ext2? No, because the "size" in its memscan is always EXT2_BLOCKS_PER_GROUP(sb), which mkfs.ext2 requires to be a multiple of 8. Is this a problem for ext3 or ext4? No, because they have an additional extN_test_allocatable test which rescues them from the error. Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> --- But the bigger question is, why does the my_rsv case come here to find_next_usable_block at all? Doesn't its 64-bit boundary limit, and its memscan, blithely ignore what the reservation prepared? It's messy too, the complement of the memscan being that "i < 7" loop over in ext2_try_to_allocate. I think this ought to be cleaned up, in ext2+reservations and ext3 and ext4. fs/ext2/balloc.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.0 + +++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.0 + @@ -570,7 +570,7 @@ find_next_usable_block(int start, struct here = 0; p = ((char *)bh->b_data) + (here >> 3); - r = memscan(p, 0, (maxblocks - here + 7) >> 3); + r = memscan(p, 0, ((maxblocks + 7) >> 3) - (here >> 3)); next = (r - ((char *)bh->b_data)) << 3; if (next < maxblocks && next >= here) - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] ext2 balloc: fix off-by-one against rsv_end
rsv_end is the last block within the reservation, so alloc_new_reservation should accept start_block == rsv_end as success. Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> --- fs/ext2/balloc.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.0 + +++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.0 + @@ -950,7 +950,7 @@ retry: * check if the first free block is within the * free space we just reserved */ - if (start_block >= my_rsv->rsv_start && start_block < my_rsv->rsv_end) + if (start_block >= my_rsv->rsv_start && start_block <= my_rsv->rsv_end) return 0; /* success */ /* * if the first free bit we found is out of the reservable space - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] ext2 balloc: use io_error label
ext2_new_blocks has a nice io_error label for setting -EIO, so goto that in the one place that doesn't already use it. Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> --- fs/ext2/balloc.c |7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.0 + +++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.0 + @@ -1258,10 +1258,9 @@ retry_alloc: if (group_no >= ngroups) group_no = 0; gdp = ext2_get_group_desc(sb, group_no, &gdp_bh); - if (!gdp) { - *errp = -EIO; - goto out; - } + if (!gdp) + goto io_error; + free_blocks = le16_to_cpu(gdp->bg_free_blocks_count); /* * skip this group if the number of - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] ext2 balloc: reset windowsz when full
ext2_new_blocks should reset the reservation window size to 0 when squeezing the last blocks out of an almost full filesystem, so the retry doesn't skip any groups with less than half that free, reporting ENOSPC too soon. Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> --- fs/ext2/balloc.c |1 + 1 file changed, 1 insertion(+) --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.0 + +++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.0 + @@ -1292,6 +1292,7 @@ retry_alloc: */ if (my_rsv) { my_rsv = NULL; + windowsz = 0; group_no = goal_group; goto retry_alloc; } - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html