Re: [2/3] 2.6.23-rc6: known regressions v2
FS Subject : hanging ext3 dbench tests References : http://lkml.org/lkml/2007/9/11/176 Last known good : ? Submitter : Andy Whitcroft [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : under test -- unreproducible at present Yep... Hard to do anything until Andy is able to reproduce it at least once more to gather needed info. Subject : umount triggers a warning in jfs and takes almost a minute References : http://lkml.org/lkml/2007/9/4/73 Last known good : ? Submitter : Oliver Neukum [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown I thought Shaggy asked Oliver about some details (and he did not answer so far) so I'd assume Shaggy is handling this. Honza -- Jan Kara [EMAIL PROTECTED] SuSE CR Labs - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote: On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote: Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Shouldn't we kill jbd_kmalloc instead? It seems useful to me to keep jbd_kmalloc/jbd_free. They are central places to handle memory (de)allocation(page size) via kmalloc/kfree, so in the future if we need to change memory allocation in jbd(e.g. not using kmalloc or using different flag), we don't need to touch every place in the jbd code calling jbd_kmalloc. Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Ext4 devel interlock meeting minutes (September 17, 2007)
Ext4 Developer Interlock Call: September 17, 2007 meeting minutes Attendees: Mingming Cao, Andreas Dilger, Eric Sandeen, Dave Kleikamp, Ted Ts'o, Avantika Mathur Minutes can be accessed at: http://ext4.wiki.kernel.org/index.php/Ext4_Developer%27s_Conference_Call PATCH QUEUE: Reviewed patches in the patch queue, identifying those that are ready to be merged with 2.6.24-rc1 - Journal_Checksum: Not yet ready for mainline, Avantika and Girish are looking into kernel oops on fsstress. - Uninitialized Block Groups: Ready to be pushed upstream, Avantika will post this patch to lkml - Large Block Support: Mingming tested these patches on various architectures. There should be new testcases added to e2fsprogs to test large block sizes. Andreas also described a possible error case with empty directories. Otherwise the feature is ready to be pushed to lkml - JBD Cleanups: Ready to be pushed upstream - Flex_BG Patches: should be pushed upstream asap. The required flag in e2fsprogs is already reserved for this feature. - Delalloc: Patch should be sent to lkml for detailed review, since not only ext4 code is being touched. Delalloc will be pushed with the other patches that are being pushed upstream now. - Shaggy will look into whether the delalloc framework can be used for JFS - We will also try porting the patch to ext3, with a search/replace, and try testing on ext3. - Mballoc: Not ready for 2.6.23. We will wait for the next release, as it hasn't been in the patch queue for very long. The main issue is to break the patch into smaller pieces making it easier to understand. This is tricky to break such a large patch into smaller patches. - Mingming will re-order the patch queue, putting the patches that are ready to be pushed upstream at the top. E2FSPROGS: - Ted sent out his initial extents support patches to the mailing list; he is continuing on ToDo items which includes a lot of error checking, and waiting for feedback on the patches. In his patches, any knowledge of the on-disk format of the filesystem is isolated to a small part of the library, making it easier to change format in the future. - 64-bit support in e2fsprogs is background priority, while adding support for other ext4 features (e.g. unintialized block groups) is higher priority right now. Some of Valerie's pre-work/cleanup patches are ready to be merged now. - Avantika and Aneesh are working on automated testing of e2fsprogs. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext34: ensure do_split leaves enough free space in both blocks
Hi, On Mon, 2007-09-17 at 11:06 -0500, Eric Sandeen wrote: The do_split() function for htree dir blocks is intended to split a leaf block to make room for a new entry. It sorts the entries in the original block by hash value, then moves the last half of the entries to the new block - without accounting for how much space this actually moves. Nasty. Also add a few comments to the functions involved. A big improvement. :) + /* Split the existing block in the middle, size-wise */ + size = 0; + move = 0; + for (i = count-1; i = 0; i--) { + /* is more than half of this entry in 2nd half of the block? */ + if (size + map[i].size/2 blocksize/2) + break; + size += map[i].size; + move++; + } + /* map index at which we will split */ + split = count - move; This is the guts of it, and I spent a while looking just to convince myself it was getting things right in the corner case and was splitting at _exactly_ the right point. If we have large dirents and 1k blocksize, even getting the split point off-by-one will still leave us vulnerable to the bug. But it all looks fine. My only other comment would be that move is redundant, you could lose it completely and just do split = i+1; but I think the logic's easier to understand the way it is. Nice catch. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote: On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote: On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote: Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Shouldn't we kill jbd_kmalloc instead? It seems useful to me to keep jbd_kmalloc/jbd_free. They are central places to handle memory (de)allocation(page size) via kmalloc/kfree, so in the future if we need to change memory allocation in jbd(e.g. not using kmalloc or using different flag), we don't need to touch every place in the jbd code calling jbd_kmalloc. I disagree. Why would jbd need to globally change the way it allocates memory? It currently uses kmalloc (and jbd_kmalloc) for allocating a variety of structures. Having to change one particular instance won't necessarily mean we want to change all of them. Adding unnecessary wrappers only obfuscates the code making it harder to understand. You wouldn't want every subsystem to have it's own *_kmalloc() that took different arguments. Besides, there aren't that many calls to kmalloc and kfree in the jbd code, so there wouldn't be much pain in changing GFP flags or whatever, if it ever needed to be done. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 13:04 -0500, Dave Kleikamp wrote: On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote: On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote: On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote: Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Shouldn't we kill jbd_kmalloc instead? It seems useful to me to keep jbd_kmalloc/jbd_free. They are central places to handle memory (de)allocation(page size) via kmalloc/kfree, so in the future if we need to change memory allocation in jbd(e.g. not using kmalloc or using different flag), we don't need to touch every place in the jbd code calling jbd_kmalloc. I disagree. Why would jbd need to globally change the way it allocates memory? It currently uses kmalloc (and jbd_kmalloc) for allocating a variety of structures. Having to change one particular instance won't necessarily mean we want to change all of them. Adding unnecessary wrappers only obfuscates the code making it harder to understand. You wouldn't want every subsystem to have it's own *_kmalloc() that took different arguments. Besides, there aren't that many calls to kmalloc and kfree in the jbd code, so there wouldn't be much pain in changing GFP flags or whatever, if it ever needed to be done. Shaggy Okay, Points taken, Here is the updated patch to get rid of slab management and jbd_kmalloc from jbd totally. This patch is intend to replace the patch in mm tree, Andrew, could you pick up this one instead? Thanks, Mingming jbd/jbd2: JBD memory allocation cleanups From: Christoph Lameter [EMAIL PROTECTED] JBD: Replace slab allocations with page cache allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/commit.c |6 +-- fs/jbd/journal.c | 99 ++ fs/jbd/transaction.c | 12 +++--- fs/jbd2/commit.c |6 +-- fs/jbd2/journal.c | 99 ++ fs/jbd2/transaction.c | 18 - include/linux/jbd.h | 18 + include/linux/jbd2.h | 21 +- 8 files changed, 52 insertions(+), 227 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-18 17:19:01.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-18 17:51:21.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd_slab_free(tmp, bh_in-b_size); + jbd_free(tmp, bh_in-b_size); goto repeat; } @@ -654,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* -* Create a slab for this blocksize -*/ - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could
Re: [PATCH] JBD slab cleanups
On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao [EMAIL PROTECTED] wrote: JBD: Replace slab allocations with page cache allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly __GFP_NOFAIL should only be used when we have no way of recovering from failure. The allocation in journal_init_common() (at least) _can_ recover and hence really shouldn't be using __GFP_NOFAIL. (Actually, nothing in the kernel should be using __GFP_NOFAIL. It is there as a marker which says we really shouldn't be doing this but we don't know how to fix it). So sometime it'd be good if you could review all the __GFP_NOFAILs in there and see if we can remove some, thanks. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Ext4: Uninitialized Block Groups
On Tue, 18 Sep 2007 17:25:31 -0700 Avantika Mathur [EMAIL PROTECTED] wrote: + +__u16 crc16(__u16 crc, __u8 const *buffer, size_t len) And is we really really have to do this, then the ext4-private crc16() should have static scope. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 16/22] ext34: ensure do_split leaves enough free space in both blocks
From: Eric Sandeen [EMAIL PROTECTED] The do_split() function for htree dir blocks is intended to split a leaf block to make room for a new entry. It sorts the entries in the original block by hash value, then moves the last half of the entries to the new block - without accounting for how much space this actually moves. (IOW, it moves half of the entry *count* not half of the entry *space*). If by chance we have both large small entries, and we move only the smallest entries, and we have a large new entry to insert, we may not have created enough space for it. The patch below stores each record size when calculating the dx_map, and then walks the hash-sorted dx_map, calculating how many entries must be moved to more evenly split the existing entries between the old block and the new block, guaranteeing enough space for the new entry. The dx_map offs member is reduced to u16 so that the overall map size does not change - it is temporarily stored at the end of the new block, and if it grows too large it may be overwritten. By making offs and size both u16, we won't grow the map size. Also add a few comments to the functions involved. This fixes the testcase reported by [EMAIL PROTECTED] on the linux-ext4 list, ext3 dir_index causes an error Thanks to Andreas Dilger for discussing the problem solution with me. Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Tested-by: Junjiro Okajima [EMAIL PROTECTED] Cc: Theodore Ts'o [EMAIL PROTECTED] Cc: linux-ext4@vger.kernel.org Cc: [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- fs/ext3/namei.c | 39 +++ fs/ext4/namei.c | 39 +++ 2 files changed, 70 insertions(+), 8 deletions(-) diff -puN fs/ext3/namei.c~ext34-ensure-do_split-leaves-enough-free-space-in-both-blocks fs/ext3/namei.c --- a/fs/ext3/namei.c~ext34-ensure-do_split-leaves-enough-free-space-in-both-blocks +++ a/fs/ext3/namei.c @@ -140,7 +140,8 @@ struct dx_frame struct dx_map_entry { u32 hash; - u32 offs; + u16 offs; + u16 size; }; #ifdef CONFIG_EXT3_INDEX @@ -697,6 +698,10 @@ errout: * Directory block splitting, compacting */ +/* + * Create map of hash values, offsets, and sizes, stored at end of block. + * Returns number of entries mapped. + */ static int dx_make_map (struct ext3_dir_entry_2 *de, int size, struct dx_hash_info *hinfo, struct dx_map_entry *map_tail) { @@ -710,7 +715,8 @@ static int dx_make_map (struct ext3_dir_ ext3fs_dirhash(de-name, de-name_len, h); map_tail--; map_tail-hash = h.hash; - map_tail-offs = (u32) ((char *) de - base); + map_tail-offs = (u16) ((char *) de - base); + map_tail-size = le16_to_cpu(de-rec_len); count++; cond_resched(); } @@ -720,6 +726,7 @@ static int dx_make_map (struct ext3_dir_ return count; } +/* Sort map by hash value */ static void dx_sort_map (struct dx_map_entry *map, unsigned count) { struct dx_map_entry *p, *q, *top = map + count - 1; @@ -1117,6 +1124,10 @@ static inline void ext3_set_de_type(stru } #ifdef CONFIG_EXT3_INDEX +/* + * Move count entries from end of map between two memory locations. + * Returns pointer to last entry moved. + */ static struct ext3_dir_entry_2 * dx_move_dirents(char *from, char *to, struct dx_map_entry *map, int count) { @@ -1135,6 +1146,10 @@ dx_move_dirents(char *from, char *to, st return (struct ext3_dir_entry_2 *) (to - rec_len); } +/* + * Compact each dir entry in the range to the minimal rec_len. + * Returns pointer to last entry in range. + */ static struct ext3_dir_entry_2* dx_pack_dirents(char *base, int size) { struct ext3_dir_entry_2 *next, *to, *prev, *de = (struct ext3_dir_entry_2 *) base; @@ -1157,6 +1172,11 @@ static struct ext3_dir_entry_2* dx_pack_ return prev; } +/* + * Split a full leaf block to make room for a new dir entry. + * Allocate a new block, and move entries so that they are approx. equally full. + * Returns pointer to de in block into which the new entry will be inserted. + */ static struct ext3_dir_entry_2 *do_split(handle_t *handle, struct inode *dir, struct buffer_head **bh,struct dx_frame *frame, struct dx_hash_info *hinfo, int *error) @@ -1168,7 +1188,7 @@ static struct ext3_dir_entry_2 *do_split u32 hash2; struct dx_map_entry *map; char *data1 = (*bh)-b_data, *data2; - unsigned split; + unsigned split, move, size, i; struct ext3_dir_entry_2 *de = NULL, *de2; int err = 0; @@ -1196,8 +1216,19 @@ static struct ext3_dir_entry_2 *do_split count = dx_make_map ((struct ext3_dir_entry_2 *) data1,