Re: 2.6.23-rc6: hanging ext3 dbench tests
On Mon, 2007-09-24 at 13:04 -0700, Linus Torvalds wrote: > > On Mon, 24 Sep 2007, Badari Pulavarty wrote: > > > > Whats happening on my machine is .. > > > > dbench forks of 4 children and sends them a signal to start the work. > > 3 out of 4 children gets the signal and does the work. One of the child > > never gets the signal so, it waits forever in pause(). So, parent waits > > for a longtime to kill it. > > Since this *seems* to have nothing to do with the filesystem, and since it > *seems* to have been introduced between -rc3 and -rc4, I did > > gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/ > > to see what has changed. One of the commits was signal-related, and that > one doesn't look like it could possibly matter. > > The rest were scheduler-related, which doesn't surprise me. In fact, even > before I looked, my reaction to your bug report was "That sounds like an > application race condition". > > Applications shouldn't use "pause()" for waiting for a signal. It's a > fundamentally racy interface - the signal could have happened just > *before* calling pause. So it's almost always a bug to use pause(), and > any users should be fixed to use "sigsuspend()" instead, which can > atomically (and correctly) pause for a signal while the process has masked > it outside of the system call. > > Now, I took a look at the dbench sources, and I have to say that the race > looks *very* unlikely (there's quite a small window in which it does > > children[i].status = getpid(); > ** race window here ** > pause(); > > and it would require *just* the right timing so that the parent doesn't > end up doing the "sleep(1)" (which would make the window even less likely > to be hit), but there does seem to be a race condition there. And it > *could* be that you just happen to hit it on your hw setup. > > So before you do anything else, does this patch (TOTALLY UNTESTED! DONE > ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU > BAD NAMES!) make any difference? > > (patch against unmodified dbench-2.0) > > Linus > > --- > diff --git a/dbench.c b/dbench.c > index ccf5624..4be5712 100644 > --- a/dbench.c > +++ b/dbench.c > @@ -91,10 +91,15 @@ static double create_procs(int nprocs, void (*fn)(struct > child_struct * )) > > for (i=0;i if (fork() == 0) { > + sigset_t old, blocked; > + > + sigemptyset(&blocked); > + sigaddset(&blocked, SIGCONT); > + sigprocmask(SIG_BLOCK, &blocked, &old); > setbuffer(stdout, NULL, 0); > nb_setup(&children[i]); > children[i].status = getpid(); > - pause(); > + sigsuspend(&old); > fn(&children[i]); > _exit(0); > } With the modified dbench, I couldn't reproduce the problem so far. I will let it run through the night (just to be sure). For now, we can treat it as a tool/App issue :) Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Patch format in ext4 patch queue.
On Fri, Sep 21, 2007 at 04:03:26PM +0530, Aneesh Kumar K.V wrote: > First line should be > : A single line description of the patch. > This appear in the subject line of the mail that results from different > commit list. BTW, we use this subject format for util-linux-ng and I have a simple python script that generates nice changelogs (for release notes) from git log. IMHO a changelog [1] that is sorted and grouped by is better for end-users that output from git-shortlog. Karel [1] see Changelog section: ftp://ftp.kernel.org/pub/linux/utils/util-linux-ng/v2.13/v2.13-ReleaseNotes -- Karel Zak <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc6: hanging ext3 dbench tests
On Mon, 2007-09-24 at 13:04 -0700, Linus Torvalds wrote: > > On Mon, 24 Sep 2007, Badari Pulavarty wrote: > > > > Whats happening on my machine is .. > > > > dbench forks of 4 children and sends them a signal to start the work. > > 3 out of 4 children gets the signal and does the work. One of the child > > never gets the signal so, it waits forever in pause(). So, parent waits > > for a longtime to kill it. > > Since this *seems* to have nothing to do with the filesystem, and since it > *seems* to have been introduced between -rc3 and -rc4, I did > > gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/ I was wrong. I managed to reproduce on 2.6.23-rc3, but it took a long time. But I never reproduced it on 2.6.22. Ran test for a day. > > to see what has changed. One of the commits was signal-related, and that > one doesn't look like it could possibly matter. > > The rest were scheduler-related, which doesn't surprise me. In fact, even > before I looked, my reaction to your bug report was "That sounds like an > application race condition". > > Applications shouldn't use "pause()" for waiting for a signal. It's a > fundamentally racy interface - the signal could have happened just > *before* calling pause. So it's almost always a bug to use pause(), and > any users should be fixed to use "sigsuspend()" instead, which can > atomically (and correctly) pause for a signal while the process has masked > it outside of the system call. > > Now, I took a look at the dbench sources, and I have to say that the race > looks *very* unlikely (there's quite a small window in which it does > > children[i].status = getpid(); > ** race window here ** > pause(); > > and it would require *just* the right timing so that the parent doesn't > end up doing the "sleep(1)" (which would make the window even less likely > to be hit), but there does seem to be a race condition there. And it > *could* be that you just happen to hit it on your hw setup. > > So before you do anything else, does this patch (TOTALLY UNTESTED! DONE > ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU > BAD NAMES!) make any difference? > > (patch against unmodified dbench-2.0) I am testing the updated version of dbench now. Normally, it takes 30min-1hour to reproduce the problem (when I do infinite "dbench 4"). I will post the results soon. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc6: hanging ext3 dbench tests
On Mon, 24 Sep 2007, Badari Pulavarty wrote: > > Whats happening on my machine is .. > > dbench forks of 4 children and sends them a signal to start the work. > 3 out of 4 children gets the signal and does the work. One of the child > never gets the signal so, it waits forever in pause(). So, parent waits > for a longtime to kill it. Since this *seems* to have nothing to do with the filesystem, and since it *seems* to have been introduced between -rc3 and -rc4, I did gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/ to see what has changed. One of the commits was signal-related, and that one doesn't look like it could possibly matter. The rest were scheduler-related, which doesn't surprise me. In fact, even before I looked, my reaction to your bug report was "That sounds like an application race condition". Applications shouldn't use "pause()" for waiting for a signal. It's a fundamentally racy interface - the signal could have happened just *before* calling pause. So it's almost always a bug to use pause(), and any users should be fixed to use "sigsuspend()" instead, which can atomically (and correctly) pause for a signal while the process has masked it outside of the system call. Now, I took a look at the dbench sources, and I have to say that the race looks *very* unlikely (there's quite a small window in which it does children[i].status = getpid(); ** race window here ** pause(); and it would require *just* the right timing so that the parent doesn't end up doing the "sleep(1)" (which would make the window even less likely to be hit), but there does seem to be a race condition there. And it *could* be that you just happen to hit it on your hw setup. So before you do anything else, does this patch (TOTALLY UNTESTED! DONE ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU BAD NAMES!) make any difference? (patch against unmodified dbench-2.0) Linus --- diff --git a/dbench.c b/dbench.c index ccf5624..4be5712 100644 --- a/dbench.c +++ b/dbench.c @@ -91,10 +91,15 @@ static double create_procs(int nprocs, void (*fn)(struct child_struct * )) for (i=0;ihttp://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc6: hanging ext3 dbench tests
On Mon, 24 Sep 2007 11:01:58 -0700 Badari Pulavarty <[EMAIL PROTECTED]> wrote: > Hi Andy, > > I managed to reproduce the dbench problem. (not sure if its the same > thing or not - but symptoms are same). My problem has nothing to do > with ext3. I can produce it on ext2, jfs also. > > Whats happening on my machine is .. > > dbench forks of 4 children and sends them a signal to start the work. > 3 out of 4 children gets the signal and does the work. One of the child > never gets the signal so, it waits forever in pause(). So, parent waits > for a longtime to kill it. > > BTW, I was trying to find out when this problem started showing up. > So far, I managed to track it to 2.6.23-rc4. (2.6.23-rc3 doesn't seem > to have this problem). I am going to do bi-sect and find out which > patch caused this. > > I am using dbench-2.0 which consistently reproduces the problem on > my x86-64 box. Did you find anything new with your setup ? > Thanks, Badari. That sounds like a bit of a showstopper for 2.6.23. Michal, do you have this regression in the dirt file? - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc6: hanging ext3 dbench tests
Hi Andy, I managed to reproduce the dbench problem. (not sure if its the same thing or not - but symptoms are same). My problem has nothing to do with ext3. I can produce it on ext2, jfs also. Whats happening on my machine is .. dbench forks of 4 children and sends them a signal to start the work. 3 out of 4 children gets the signal and does the work. One of the child never gets the signal so, it waits forever in pause(). So, parent waits for a longtime to kill it. BTW, I was trying to find out when this problem started showing up. So far, I managed to track it to 2.6.23-rc4. (2.6.23-rc3 doesn't seem to have this problem). I am going to do bi-sect and find out which patch caused this. I am using dbench-2.0 which consistently reproduces the problem on my x86-64 box. Did you find anything new with your setup ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Avoid rec_len overflow with 64KB block size
With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. The patch also converts some places to use ext4_next_entry() when we are changing them anyway. Signed-off-by: Jan Kara <[EMAIL PROTECTED]> diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext4/dir.c linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/dir.c --- linux-2.6.23-rc6/fs/ext4/dir.c 2007-09-18 19:22:28.0 +0200 +++ linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/dir.c 2007-09-24 18:30:13.0 +0200 @@ -69,7 +69,7 @@ int ext4_check_dir_entry (const char * f unsigned long offset) { const char * error_msg = NULL; - const int rlen = le16_to_cpu(de->rec_len); + const int rlen = ext4_rec_len_from_disk(de->rec_len); if (rlen < EXT4_DIR_REC_LEN(1)) error_msg = "rec_len is smaller than minimal"; @@ -176,10 +176,10 @@ revalidate: * least that it is non-zero. A * failure will be detected in the * dirent test below. */ - if (le16_to_cpu(de->rec_len) < - EXT4_DIR_REC_LEN(1)) + if (ext4_rec_len_from_disk(de->rec_len) + < EXT4_DIR_REC_LEN(1)) break; - i += le16_to_cpu(de->rec_len); + i += ext4_rec_len_from_disk(de->rec_len); } offset = i; filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1)) @@ -201,7 +201,7 @@ revalidate: ret = stored; goto out; } - offset += le16_to_cpu(de->rec_len); + offset += ext4_rec_len_from_disk(de->rec_len); if (le32_to_cpu(de->inode)) { /* We might block in the next section * if the data destination is @@ -223,7 +223,7 @@ revalidate: goto revalidate; stored ++; } - filp->f_pos += le16_to_cpu(de->rec_len); + filp->f_pos += ext4_rec_len_from_disk(de->rec_len); } offset = 0; brelse (bh); diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext4/namei.c linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/namei.c --- linux-2.6.23-rc6/fs/ext4/namei.c2007-09-18 19:22:28.0 +0200 +++ linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/namei.c 2007-09-24 18:35:34.0 +0200 @@ -280,7 +280,7 @@ static struct stats dx_show_leaf(struct space += EXT4_DIR_REC_LEN(de->name_len); names++; } - de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len)); + de = ext4_next_entry(de); } printk("(%i)\n", names); return (struct stats) { names, space, 1 }; @@ -525,7 +525,8 @@ static int ext4_htree_next_block(struct */ static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 *p) { - return (struct ext4_dir_entry_2 *)((char*)p + le16_to_cpu(p->rec_len)); + return (struct ext4_dir_entry_2 *)((char*)p + + ext4_rec_len_from_disk(p->rec_len)); } /* @@ -689,7 +690,7 @@ static int dx_make_map (struct ext4_dir_ cond_resched(); } /* XXX: do we need to check rec_len == 0 case? -Chris */ - de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len)); + de = ext4_next_entry(de); } return count; } @@ -790,7 +791,7 @@ static inline int search_dirblock(struct return 1; } /* prevent looping on a bad block */ - de_len = le16_to_cpu(de->rec_len); + de_len = ext4_rec_len_from_disk(de->rec_len); if (de_len <= 0) return -1; offset += de_len; @@ -1099,7 +1100,7 @@ dx_move_dirents(char *from, char *to, st rec_len = EXT4_DIR_REC_LEN(de->name_len); memcpy (to, de, rec_len); ((struct ext4_dir_entry_2 *) to)->rec_len = - cpu_to_le16(rec_len); + ext4_rec_len_to_disk(rec_len); de->inode = 0; map++; to += rec_len; @@ -1114,13 +1115,12 @@ static struct ext4_dir_entry_2* dx_pack_ prev = to = de; while ((char*)de < b
Re: [PATCH 2/3] Avoid rec_len overflow with 64KB block size
With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. The patch also converts some places to use ext3_next_entry() when we are changing them anyway. Signed-off-by: Jan Kara <[EMAIL PROTECTED]> diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext3/dir.c linux-2.6.23-rc6-1-ext3_64k_blocksize/fs/ext3/dir.c --- linux-2.6.23-rc6/fs/ext3/dir.c 2007-09-18 19:22:28.0 +0200 +++ linux-2.6.23-rc6-1-ext3_64k_blocksize/fs/ext3/dir.c 2007-09-24 18:54:56.0 +0200 @@ -69,7 +69,7 @@ int ext3_check_dir_entry (const char * f unsigned long offset) { const char * error_msg = NULL; - const int rlen = le16_to_cpu(de->rec_len); + const int rlen = ext3_rec_len_from_disk(de->rec_len); if (rlen < EXT3_DIR_REC_LEN(1)) error_msg = "rec_len is smaller than minimal"; @@ -177,10 +177,10 @@ revalidate: * least that it is non-zero. A * failure will be detected in the * dirent test below. */ - if (le16_to_cpu(de->rec_len) < + if (ext3_rec_len_from_disk(de->rec_len) < EXT3_DIR_REC_LEN(1)) break; - i += le16_to_cpu(de->rec_len); + i += ext3_rec_len_from_disk(de->rec_len); } offset = i; filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1)) @@ -201,7 +201,7 @@ revalidate: ret = stored; goto out; } - offset += le16_to_cpu(de->rec_len); + offset += ext3_rec_len_from_disk(de->rec_len); if (le32_to_cpu(de->inode)) { /* We might block in the next section * if the data destination is @@ -223,7 +223,7 @@ revalidate: goto revalidate; stored ++; } - filp->f_pos += le16_to_cpu(de->rec_len); + filp->f_pos += ext3_rec_len_from_disk(de->rec_len); } offset = 0; brelse (bh); diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext3/namei.c linux-2.6.23-rc6-1-ext3_64k_blocksize/fs/ext3/namei.c --- linux-2.6.23-rc6/fs/ext3/namei.c2007-09-18 19:22:28.0 +0200 +++ linux-2.6.23-rc6-1-ext3_64k_blocksize/fs/ext3/namei.c 2007-09-24 19:07:51.0 +0200 @@ -143,6 +143,15 @@ struct dx_map_entry u32 offs; }; +/* + * p is at least 6 bytes before the end of page + */ +static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p) +{ + return (struct ext3_dir_entry_2 *)((char*)p + + ext3_rec_len_from_disk(p->rec_len)); +} + #ifdef CONFIG_EXT3_INDEX static inline unsigned dx_get_block (struct dx_entry *entry); static void dx_set_block (struct dx_entry *entry, unsigned value); @@ -280,7 +289,7 @@ static struct stats dx_show_leaf(struct space += EXT3_DIR_REC_LEN(de->name_len); names++; } - de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len)); + de = ext3_next_entry(de); } printk("(%i)\n", names); return (struct stats) { names, space, 1 }; @@ -521,14 +530,6 @@ static int ext3_htree_next_block(struct /* - * p is at least 6 bytes before the end of page - */ -static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p) -{ - return (struct ext3_dir_entry_2 *)((char*)p + le16_to_cpu(p->rec_len)); -} - -/* * This function fills a red-black tree with information from a * directory block. It returns the number directory entries loaded * into the tree. If there is an error it is returned in err. @@ -689,7 +690,7 @@ static int dx_make_map (struct ext3_dir_ cond_resched(); } /* XXX: do we need to check rec_len == 0 case? -Chris */ - de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len)); + de = ext3_next_entry(de); } return count; } @@ -792,7 +793,7 @@ static inline int search_dirblock(struct return 1; } /* prevent looping on a bad block */ - de_len = le16_to_cpu(de->rec_len); + de_len = ext3_rec_len_from_disk(de->rec_len); if (de_len <= 0) return -1; of
Re: [PATCH 1/3] Avoid rec_len overflow with 64KB block size
With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. Signed-off-by: Jan Kara <[EMAIL PROTECTED]> diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext2/dir.c linux-2.6.23-rc6-1-ext2_64k_blocksize/fs/ext2/dir.c --- linux-2.6.23-rc6/fs/ext2/dir.c 2007-07-16 17:47:21.0 +0200 +++ linux-2.6.23-rc6-1-ext2_64k_blocksize/fs/ext2/dir.c 2007-09-24 19:26:13.0 +0200 @@ -26,6 +26,24 @@ typedef struct ext2_dir_entry_2 ext2_dirent; +static inline unsigned ext2_rec_len_from_disk(__le16 dlen) +{ + unsigned len = le16_to_cpu(dlen); + + if (len == EXT2_MAX_REC_LEN) + return 1 << 16; + return len; +} + +static inline __le16 ext2_rec_len_to_disk(unsigned len) +{ + if (len == (1 << 16)) + return cpu_to_le16(EXT2_MAX_REC_LEN); + else if (len > (1 << 16)) + BUG(); + return cpu_to_le16(len); +} + /* * ext2 uses block-sized chunks. Arguably, sector-sized ones would be * more robust, but we have what we have @@ -95,7 +113,7 @@ static void ext2_check_page(struct page } for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) { p = (ext2_dirent *)(kaddr + offs); - rec_len = le16_to_cpu(p->rec_len); + rec_len = ext2_rec_len_from_disk(p->rec_len); if (rec_len < EXT2_DIR_REC_LEN(1)) goto Eshort; @@ -193,7 +211,8 @@ static inline int ext2_match (int len, c */ static inline ext2_dirent *ext2_next_entry(ext2_dirent *p) { - return (ext2_dirent *)((char*)p + le16_to_cpu(p->rec_len)); + return (ext2_dirent *)((char*)p + + ext2_rec_len_from_disk(p->rec_len)); } static inline unsigned @@ -305,7 +324,7 @@ ext2_readdir (struct file * filp, void * return 0; } } - filp->f_pos += le16_to_cpu(de->rec_len); + filp->f_pos += ext2_rec_len_from_disk(de->rec_len); } ext2_put_page(page); } @@ -413,7 +432,7 @@ void ext2_set_link(struct inode *dir, st struct page *page, struct inode *inode) { unsigned from = (char *) de - (char *) page_address(page); - unsigned to = from + le16_to_cpu(de->rec_len); + unsigned to = from + ext2_rec_len_from_disk(de->rec_len); int err; lock_page(page); @@ -469,7 +488,7 @@ int ext2_add_link (struct dentry *dentry /* We hit i_size */ name_len = 0; rec_len = chunk_size; - de->rec_len = cpu_to_le16(chunk_size); + de->rec_len = ext2_rec_len_to_disk(chunk_size); de->inode = 0; goto got_it; } @@ -483,7 +502,7 @@ int ext2_add_link (struct dentry *dentry if (ext2_match (namelen, name, de)) goto out_unlock; name_len = EXT2_DIR_REC_LEN(de->name_len); - rec_len = le16_to_cpu(de->rec_len); + rec_len = ext2_rec_len_from_disk(de->rec_len); if (!de->inode && rec_len >= reclen) goto got_it; if (rec_len >= name_len + reclen) @@ -504,8 +523,8 @@ got_it: goto out_unlock; if (de->inode) { ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len); - de1->rec_len = cpu_to_le16(rec_len - name_len); - de->rec_len = cpu_to_le16(name_len); + de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len); + de->rec_len = ext2_rec_len_to_disk(name_len); de = de1; } de->name_len = namelen; @@ -536,7 +555,7 @@ int ext2_delete_entry (struct ext2_dir_e struct inode *inode = mapping->host; char *kaddr = page_address(page); unsigned from = ((char*)dir - kaddr) & ~(ext2_chunk_size(inode)-1); - unsigned to = ((char*)dir - kaddr) + le16_to_cpu(dir->rec_len); + unsigned to = ((char*)dir - kaddr) + ext2_rec_len_from_disk(dir->rec_len); ext2_dirent * pde = NULL; ext2_dirent * de = (ext2_dirent *) (kaddr + from); int err; @@ -557,7 +576,7 @@ int ext2_delete_entry (struct ext2_dir_e err = mapping->a_ops->prepare_write(NULL, page, from, to); BUG_ON(err); if (pde) - pde->rec_len = cpu_to_le16(to-from); + pde->rec_len = ext2_rec_len_to_disk(to-from); dir->inode = 0; err = ext2_commit_chunk(page, from, to); inode->i_ctime = ino
[PATCH 0/3] Avoid rec_len overflow with 64KB block size
Hi, In the following series I'm sending patches to ext2, ext3 and ext4 solving possible overflow of rec_len when using 64KB blocksize. Minming, would you add these to the patch queue? Thanks. Honza -- Jan Kara <[EMAIL PROTECTED]> SUSE Labs, CR - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[27/50] ext34: ensure do_split leaves enough free space in both blocks
From: Eric Sandeen <[EMAIL PROTECTED]> commit ef2b02d3e617cb0400eedf2668f86215e1b0e6af in mainline. The do_split() function for htree dir blocks is intended to split a leaf block to make room for a new entry. It sorts the entries in the original block by hash value, then moves the last half of the entries to the new block - without accounting for how much space this actually moves. (IOW, it moves half of the entry *count* not half of the entry *space*). If by chance we have both large & small entries, and we move only the smallest entries, and we have a large new entry to insert, we may not have created enough space for it. The patch below stores each record size when calculating the dx_map, and then walks the hash-sorted dx_map, calculating how many entries must be moved to more evenly split the existing entries between the old block and the new block, guaranteeing enough space for the new entry. The dx_map "offs" member is reduced to u16 so that the overall map size does not change - it is temporarily stored at the end of the new block, and if it grows too large it may be overwritten. By making offs and size both u16, we won't grow the map size. Also add a few comments to the functions involved. This fixes the testcase reported by [EMAIL PROTECTED] on the linux-ext4 list, "ext3 dir_index causes an error" Thanks to Andreas Dilger for discussing the problem & solution with me. Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]> Tested-by: Junjiro Okajima <[EMAIL PROTECTED]> Cc: Theodore Ts'o <[EMAIL PROTECTED]> Cc: ext4 Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]> --- fs/ext3/namei.c | 39 +++ fs/ext4/namei.c | 39 +++ 2 files changed, 70 insertions(+), 8 deletions(-) --- a/fs/ext3/namei.c +++ b/fs/ext3/namei.c @@ -140,7 +140,8 @@ struct dx_frame struct dx_map_entry { u32 hash; - u32 offs; + u16 offs; + u16 size; }; #ifdef CONFIG_EXT3_INDEX @@ -671,6 +672,10 @@ errout: * Directory block splitting, compacting */ +/* + * Create map of hash values, offsets, and sizes, stored at end of block. + * Returns number of entries mapped. + */ static int dx_make_map (struct ext3_dir_entry_2 *de, int size, struct dx_hash_info *hinfo, struct dx_map_entry *map_tail) { @@ -684,7 +689,8 @@ static int dx_make_map (struct ext3_dir_ ext3fs_dirhash(de->name, de->name_len, &h); map_tail--; map_tail->hash = h.hash; - map_tail->offs = (u32) ((char *) de - base); + map_tail->offs = (u16) ((char *) de - base); + map_tail->size = le16_to_cpu(de->rec_len); count++; cond_resched(); } @@ -694,6 +700,7 @@ static int dx_make_map (struct ext3_dir_ return count; } +/* Sort map by hash value */ static void dx_sort_map (struct dx_map_entry *map, unsigned count) { struct dx_map_entry *p, *q, *top = map + count - 1; @@ -1081,6 +1088,10 @@ static inline void ext3_set_de_type(stru } #ifdef CONFIG_EXT3_INDEX +/* + * Move count entries from end of map between two memory locations. + * Returns pointer to last entry moved. + */ static struct ext3_dir_entry_2 * dx_move_dirents(char *from, char *to, struct dx_map_entry *map, int count) { @@ -1099,6 +1110,10 @@ dx_move_dirents(char *from, char *to, st return (struct ext3_dir_entry_2 *) (to - rec_len); } +/* + * Compact each dir entry in the range to the minimal rec_len. + * Returns pointer to last entry in range. + */ static struct ext3_dir_entry_2* dx_pack_dirents(char *base, int size) { struct ext3_dir_entry_2 *next, *to, *prev, *de = (struct ext3_dir_entry_2 *) base; @@ -1121,6 +1136,11 @@ static struct ext3_dir_entry_2* dx_pack_ return prev; } +/* + * Split a full leaf block to make room for a new dir entry. + * Allocate a new block, and move entries so that they are approx. equally full. + * Returns pointer to de in block into which the new entry will be inserted. + */ static struct ext3_dir_entry_2 *do_split(handle_t *handle, struct inode *dir, struct buffer_head **bh,struct dx_frame *frame, struct dx_hash_info *hinfo, int *error) @@ -1132,7 +1152,7 @@ static struct ext3_dir_entry_2 *do_split u32 hash2; struct dx_map_entry *map; char *data1 = (*bh)->b_data, *data2; - unsigned split; + unsigned split, move, size, i; struct ext3_dir_entry_2 *de = NULL, *de2; int err = 0; @@ -1160,8 +1180,19 @@ static struct ext3_dir_entry_2 *do_split count = dx_make_map ((struct ext3_dir_entry_2 *) data1, bloc
The jbd_GFP_NORFAIL patch caused the jbd_kmalloc_kzalloc.patch fail
While putting together 2.6.23-rc7-ext4-1 patchset, I noticed that the jbd_GFP_NOFAIL patch caused a patch conflict with the kmalloc/kzalloc patch. I also fixed up the patch comments to make the git log look prettier. - Ted Fix up jbd k[mz]alloc patch so it applies and fix up GFP_NOFAIL comments diff --git a/jbd_GFP_NOFAIL_flag_cleanup.patch b/jbd_GFP_NOFAIL_flag_cleanup.patch index 8dec157..da02815 100644 --- a/jbd_GFP_NOFAIL_flag_cleanup.patch +++ b/jbd_GFP_NOFAIL_flag_cleanup.patch @@ -1,3 +1,8 @@ +jbd/jbd2: Journal initialization doesn't need __GFP_NOFAIL + +Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> +Signed-off-by: "Theodore Ts'o" <[EMAIL PROTECTED]> + --- fs/jbd/journal.c |2 +- fs/jbd2/journal.c |2 +- diff --git a/jbd_kmalloc_kzalloc.patch b/jbd_kmalloc_kzalloc.patch index a426d0e..8ca0469 100644 --- a/jbd_kmalloc_kzalloc.patch +++ b/jbd_kmalloc_kzalloc.patch @@ -22,7 +22,7 @@ Index: linux-2.6.23-rc6/fs/jbd/journal.c journal_t *journal; int err; -- journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); +- journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); if (!journal) goto fail; @@ -51,7 +51,7 @@ Index: linux-2.6.23-rc6/fs/jbd2/journal.c journal_t *journal; int err; -- journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); +- journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); if (!journal) goto fail; - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
mballoc-core
FYI, I updated the mballoc-core patch in the ext4-patch-queue to not try to pull in both ext4_jbd2.h and jbd.h; this was causing duplicate definition, so I removed the #include of jbd.h - Ted Remove #include of linux/jbd.h from mballoc-core so it will compile diff --git a/mballoc-core.patch b/mballoc-core.patch index b8839b4..1079692 100644 --- a/mballoc-core.patch +++ b/mballoc-core.patch @@ -357,7 +357,7 @@ Index: linux-2.6.23-rc6/fs/ext4/mballoc.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.23-rc6/fs/ext4/mballoc.c 2007-09-20 17:26:22.0 -0700 -@@ -0,0 +1,4460 @@ +@@ -0,0 +1,4459 @@ +/* + * Copyright (c) 2003-2006, Cluster File Systems, Inc, [EMAIL PROTECTED] + * Written by Alex Tomas <[EMAIL PROTECTED]> @@ -385,7 +385,6 @@ Index: linux-2.6.23-rc6/fs/ext4/mballoc.c +#include +#include +#include -+#include +#include +#include +#include - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
JBD2 naming cleanups patch
The jbd_to_jbd2_naming_cleanups.patch blew up if compiled with CONFIG_JBD2_DEBUG, because JBD2_POISON_FREE wasn't defined in poison.h. So I fixed up this patch as follows. - Ted jbd2: JBD_XXX to JBD2_XXX naming cleanup From: Mingming Cao <[EMAIL PROTECTED]> change JBD_XXX macros to JBD2_XXX in JBD2/Ext4 Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> Signed-off-by: "Theodore Ts'o" <[EMAIL PROTECTED]> --- diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 78beb09..2be404f 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -33,7 +33,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 3c1397f..42cbdb5 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -966,7 +966,7 @@ static int parse_options (char *options, struct super_block *sb, if (option < 0) return 0; if (option == 0) - option = JBD_DEFAULT_MAX_COMMIT_AGE; + option = JBD2_DEFAULT_MAX_COMMIT_AGE; sbi->s_commit_interval = HZ * option; break; case Opt_data_journal: diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index 2cac34a..b898ee4 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -278,7 +278,7 @@ static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag, unsigned long long block) { tag->t_blocknr = cpu_to_be32(block & (u32)~0); - if (tag_bytes > JBD_TAG_SIZE32) + if (tag_bytes > JBD2_TAG_SIZE32) tag->t_blocknr_high = cpu_to_be32((block >> 31) >> 1); } diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index f12c65b..ff07bff 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -670,7 +670,7 @@ static journal_t * journal_init_common (void) spin_lock_init(&journal->j_list_lock); spin_lock_init(&journal->j_state_lock); - journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE); + journal->j_commit_interval = (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE); /* The journal is marked for error until we succeed with recovery! */ journal->j_flags = JBD2_ABORT; @@ -1612,9 +1612,9 @@ int jbd2_journal_blocks_per_page(struct inode *inode) size_t journal_tag_bytes(journal_t *journal) { if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT)) - return JBD_TAG_SIZE64; + return JBD2_TAG_SIZE64; else - return JBD_TAG_SIZE32; + return JBD2_TAG_SIZE32; } /* @@ -1681,7 +1681,7 @@ static void journal_free_journal_head(struct journal_head *jh) { #ifdef CONFIG_JBD2_DEBUG atomic_dec(&nr_journal_heads); - memset(jh, JBD_POISON_FREE, sizeof(*jh)); + memset(jh, JBD2_POISON_FREE, sizeof(*jh)); #endif kmem_cache_free(jbd2_journal_head_cache, jh); } diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c index b50be8a..d0ce627 100644 --- a/fs/jbd2/recovery.c +++ b/fs/jbd2/recovery.c @@ -311,7 +311,7 @@ int jbd2_journal_skip_recovery(journal_t *journal) static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag_t *tag) { unsigned long long block = be32_to_cpu(tag->t_blocknr); - if (tag_bytes > JBD_TAG_SIZE32) + if (tag_bytes > JBD2_TAG_SIZE32) block |= (u64)be32_to_cpu(tag->t_blocknr_high) << 32; return block; } diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c index 01d8897..3595fd4 100644 --- a/fs/jbd2/revoke.c +++ b/fs/jbd2/revoke.c @@ -352,7 +352,7 @@ int jbd2_journal_revoke(handle_t *handle, unsigned long long blocknr, if (bh) BUFFER_TRACE(bh, "found on hash"); } -#ifdef JBD_EXPENSIVE_CHECKING +#ifdef JBD2_EXPENSIVE_CHECKING else { struct buffer_head *bh2; @@ -453,7 +453,7 @@ int jbd2_journal_cancel_revoke(handle_t *handle, struct journal_head *jh) } } -#ifdef JBD_EXPENSIVE_CHECKING +#ifdef JBD2_EXPENSIVE_CHECKING /* There better not be one left behind by now! */ record = find_revoke_record(journal, bh->b_blocknr); J_ASSERT_JH(jh, record == NULL); diff --git a/include/linux/ext4_jbd2.h b/include/linux/ext4_jbd2.h index d716e63..38c71d3 100644 --- a/include/linux/ext4_jbd2.h +++ b/include/linux/ext4_jbd2.h @@ -12,8 +12,8 @@ * Ext4-specific journaling extensions. */ -#ifndef _LINUX_EXT4_JBD_H -#define _LINUX_EXT4_JBD_H +#ifndef _LINUX_EXT4_JBD2_H +#define _LINUX_EXT4_JBD2_H #include #include @@ -228,4 +228,4 @@ static inline int ext4_should_writeback_data(struct inode *inode) return 0; } -#endif /* _LINUX_EXT4_JBD_H */ +#endif /* _LINUX_EXT4_JBD2_H */ diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 5f8b876..06ef114 100644 --
Re: [PATCH] obsolete libcom-err for SuSE e2fsprogs
Theodore Tso wrote: > On Mon, Sep 24, 2007 at 11:25:39AM +0200, Karel Zak wrote: >>> Nope, there isn't much rationale unless we also split out fsck, so >> util-linux-ng is open for fsck :-) > > It's something I'm considering, but at the moment I've got some higher > priority (ext4) work that I'm focusing on. > > What would you think about moving libblkid and the libdisk from > xfsprogs into util-linux-ng? That's things that I've been batting > around doing, although that means we end up moving more libraries that > distros would probably want separated into different packages into > util-linux-ng Heh, sounds a little like a prisoner exchange... "we'll give up libblkid & libcom_err if you give up libdisk" :) Seriously, though, that all sounds like a pretty good idea to me, especially since e2fsprogs has been making noise about utilizing some of the functionality in xfsprogs (stripe discovery etc). -Eric - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.23-rc7-ext4-1 pushed out
I've pushed out a new ext4 patchset, based off of 2.6.23-rc7. I had to fix a number of patch application errors and compile-time errors in the ext4 patch queue; I'll send mail separately about each of those fix ups. It's available in the standard place: git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 2.6.23-rc7-ext4-1 and ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/2.6.23-rc7-ext4-1 - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] obsolete libcom-err for SuSE e2fsprogs
On Mon, Sep 24, 2007 at 11:25:39AM +0200, Karel Zak wrote: > > > > Nope, there isn't much rationale unless we also split out fsck, so > > util-linux-ng is open for fsck :-) It's something I'm considering, but at the moment I've got some higher priority (ext4) work that I'm focusing on. What would you think about moving libblkid and the libdisk from xfsprogs into util-linux-ng? That's things that I've been batting around doing, although that means we end up moving more libraries that distros would probably want separated into different packages into util-linux-ng - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] obsolete libcom-err for SuSE e2fsprogs
On Thu, Sep 20, 2007 at 05:54:27PM -0400, Theodore Tso wrote: > On Thu, Sep 20, 2007 at 03:22:16PM -0500, Eric Sandeen wrote: > > I'd do this, my rpm-fu is still reasonably strong, though - I'm curious, > > is there a compelling reason to split out just libcom-err? what about > > libuuid? libblkid? e2fsprogs is a bit of a grab bag of things. What's > > the rationale for the split? > > Nope, there isn't much rationale unless we also split out fsck, so util-linux-ng is open for fsck :-) Seriously, when we move fsck(8) to util-linux-ng we can compile this util against libblkid **or** libvolume_id (udev). For example Suse prefers libvolume_id, but RHEL/Fedora prefers libblkid, ..etc. The util-linux-ng already has fsprobe API for both libraries -- port fsck(8) to this API is 10mins work... > that people who don't want to use any ext 2/3/4 filesystems don't need > to install programs like e2fsck, mke2fs, libext2fs.so, etc. mount(8) dependences on libblkid (or libvolume_id from udev). I think sane downstream distributions are using separate binary package(s) (RPM calls it subpackages) for libraries from e2fsprogs. My $0.02... Karel PS. my workstation with Fedora 7: # rpm -q --whatrequires libcom_err.so.2 libuuid.so.1 libblkid.so.1 | sort -u apr-1.2.8-6 apr-util-1.2.8-7 cryptsetup-luks-1.0.5-4.fc7.1 curl-7.16.2-1.fc7 e2fsprogs-libs-1.40.2-2.fc7 evolution-data-server-1.10.3.1-2.fc7 gnome-vfs2-2.18.1-4.fc7 krb5-devel-1.6.1-4.fc7 krb5-libs-1.6.1-4.fc7 libpurple-2.1.1-1.fc7 neon-0.25.5-6 openssl-0.9.8b-14.fc7 pam_krb5-2.2.11-1 parted-1.8.6-4.fc7 subversion-1.4.3-4 -- Karel Zak <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html