Re: 2.6.23-rc6: hanging ext3 dbench tests

2007-09-24 Thread Badari Pulavarty
On Mon, 2007-09-24 at 13:04 -0700, Linus Torvalds wrote:
> 
> On Mon, 24 Sep 2007, Badari Pulavarty wrote:
> > 
> > Whats happening on my machine is ..
> > 
> > dbench forks of 4 children and sends them a signal to start the work.
> > 3 out of 4 children gets the signal and does the work. One of the child
> > never gets the signal so, it waits forever in pause(). So, parent waits
> > for a longtime to kill it.
> 
> Since this *seems* to have nothing to do with the filesystem, and since it 
> *seems* to have been introduced between -rc3 and -rc4, I did
> 
>   gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/
> 
> to see what has changed. One of the commits was signal-related, and that 
> one doesn't look like it could possibly matter.
> 
> The rest were scheduler-related, which doesn't surprise me. In fact, even 
> before I looked, my reaction to your bug report was "That sounds like an 
> application race condition".
> 
> Applications shouldn't use "pause()" for waiting for a signal. It's a 
> fundamentally racy interface - the signal could have happened just 
> *before* calling pause. So it's almost always a bug to use pause(), and 
> any users should be fixed to use "sigsuspend()" instead, which can 
> atomically (and correctly) pause for a signal while the process has masked 
> it outside of the system call.
> 
> Now, I took a look at the dbench sources, and I have to say that the race 
> looks *very* unlikely (there's quite a small window in which it does
> 
>   children[i].status = getpid();
>   ** race window here **
>   pause();
> 
> and it would require *just* the right timing so that the parent doesn't 
> end up doing the "sleep(1)" (which would make the window even less likely 
> to be hit), but there does seem to be a race condition there. And it 
> *could* be that you just happen to hit it on your hw setup.
> 
> So before you do anything else, does this patch (TOTALLY UNTESTED! DONE 
> ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU 
> BAD NAMES!) make any difference?
> 
> (patch against unmodified dbench-2.0)
> 
>   Linus
> 
> ---
> diff --git a/dbench.c b/dbench.c
> index ccf5624..4be5712 100644
> --- a/dbench.c
> +++ b/dbench.c
> @@ -91,10 +91,15 @@ static double create_procs(int nprocs, void (*fn)(struct 
> child_struct * ))
>  
>   for (i=0;i   if (fork() == 0) {
> + sigset_t old, blocked;
> +
> + sigemptyset(&blocked);
> + sigaddset(&blocked, SIGCONT);
> + sigprocmask(SIG_BLOCK, &blocked, &old);
>   setbuffer(stdout, NULL, 0);
>   nb_setup(&children[i]);
>   children[i].status = getpid();
> - pause();
> + sigsuspend(&old);
>   fn(&children[i]);
>   _exit(0);
>   }

With the modified dbench, I couldn't reproduce the problem so far.
I will let it run through the night (just to be sure). 

For now, we can treat it as a tool/App issue :)

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patch format in ext4 patch queue.

2007-09-24 Thread Karel Zak
On Fri, Sep 21, 2007 at 04:03:26PM +0530, Aneesh Kumar K.V wrote:
> First line should be
> : A single line description of the patch.
> This appear in the subject line of the mail that results from different
> commit list.

 BTW, we use this subject format for util-linux-ng and I have a simple
 python script that generates nice changelogs (for release notes) from
 git log.

 IMHO a changelog [1] that is sorted and grouped by 
 is better for end-users that output from git-shortlog.

Karel


 [1] see Changelog section:
ftp://ftp.kernel.org/pub/linux/utils/util-linux-ng/v2.13/v2.13-ReleaseNotes

-- 
 Karel Zak  <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc6: hanging ext3 dbench tests

2007-09-24 Thread Badari Pulavarty
On Mon, 2007-09-24 at 13:04 -0700, Linus Torvalds wrote:
> 
> On Mon, 24 Sep 2007, Badari Pulavarty wrote:
> > 
> > Whats happening on my machine is ..
> > 
> > dbench forks of 4 children and sends them a signal to start the work.
> > 3 out of 4 children gets the signal and does the work. One of the child
> > never gets the signal so, it waits forever in pause(). So, parent waits
> > for a longtime to kill it.
> 
> Since this *seems* to have nothing to do with the filesystem, and since it 
> *seems* to have been introduced between -rc3 and -rc4, I did
> 
>   gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/

I was wrong. I managed to reproduce on 2.6.23-rc3, but it took a long
time. But I never reproduced it on 2.6.22. Ran test for a day.

> 
> to see what has changed. One of the commits was signal-related, and that 
> one doesn't look like it could possibly matter.
> 
> The rest were scheduler-related, which doesn't surprise me. In fact, even 
> before I looked, my reaction to your bug report was "That sounds like an 
> application race condition".
> 
> Applications shouldn't use "pause()" for waiting for a signal. It's a 
> fundamentally racy interface - the signal could have happened just 
> *before* calling pause. So it's almost always a bug to use pause(), and 
> any users should be fixed to use "sigsuspend()" instead, which can 
> atomically (and correctly) pause for a signal while the process has masked 
> it outside of the system call.
> 
> Now, I took a look at the dbench sources, and I have to say that the race 
> looks *very* unlikely (there's quite a small window in which it does
> 
>   children[i].status = getpid();
>   ** race window here **
>   pause();
> 
> and it would require *just* the right timing so that the parent doesn't 
> end up doing the "sleep(1)" (which would make the window even less likely 
> to be hit), but there does seem to be a race condition there. And it 
> *could* be that you just happen to hit it on your hw setup.
> 
> So before you do anything else, does this patch (TOTALLY UNTESTED! DONE 
> ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU 
> BAD NAMES!) make any difference?
> 
> (patch against unmodified dbench-2.0)

I am testing the updated version of dbench now. Normally, it takes
30min-1hour to reproduce the problem (when I do infinite "dbench 4").
I will post the results soon.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc6: hanging ext3 dbench tests

2007-09-24 Thread Linus Torvalds


On Mon, 24 Sep 2007, Badari Pulavarty wrote:
> 
> Whats happening on my machine is ..
> 
> dbench forks of 4 children and sends them a signal to start the work.
> 3 out of 4 children gets the signal and does the work. One of the child
> never gets the signal so, it waits forever in pause(). So, parent waits
> for a longtime to kill it.

Since this *seems* to have nothing to do with the filesystem, and since it 
*seems* to have been introduced between -rc3 and -rc4, I did

gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/

to see what has changed. One of the commits was signal-related, and that 
one doesn't look like it could possibly matter.

The rest were scheduler-related, which doesn't surprise me. In fact, even 
before I looked, my reaction to your bug report was "That sounds like an 
application race condition".

Applications shouldn't use "pause()" for waiting for a signal. It's a 
fundamentally racy interface - the signal could have happened just 
*before* calling pause. So it's almost always a bug to use pause(), and 
any users should be fixed to use "sigsuspend()" instead, which can 
atomically (and correctly) pause for a signal while the process has masked 
it outside of the system call.

Now, I took a look at the dbench sources, and I have to say that the race 
looks *very* unlikely (there's quite a small window in which it does

children[i].status = getpid();
** race window here **
pause();

and it would require *just* the right timing so that the parent doesn't 
end up doing the "sleep(1)" (which would make the window even less likely 
to be hit), but there does seem to be a race condition there. And it 
*could* be that you just happen to hit it on your hw setup.

So before you do anything else, does this patch (TOTALLY UNTESTED! DONE 
ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU 
BAD NAMES!) make any difference?

(patch against unmodified dbench-2.0)

Linus

---
diff --git a/dbench.c b/dbench.c
index ccf5624..4be5712 100644
--- a/dbench.c
+++ b/dbench.c
@@ -91,10 +91,15 @@ static double create_procs(int nprocs, void (*fn)(struct 
child_struct * ))
 
for (i=0;ihttp://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc6: hanging ext3 dbench tests

2007-09-24 Thread Andrew Morton
On Mon, 24 Sep 2007 11:01:58 -0700
Badari Pulavarty <[EMAIL PROTECTED]> wrote:

> Hi Andy,
> 
> I managed to reproduce the dbench problem. (not sure if its the same
> thing or not - but symptoms are same). My problem has nothing to do 
> with ext3. I can produce it on ext2, jfs also.
> 
> Whats happening on my machine is ..
> 
> dbench forks of 4 children and sends them a signal to start the work.
> 3 out of 4 children gets the signal and does the work. One of the child
> never gets the signal so, it waits forever in pause(). So, parent waits
> for a longtime to kill it.
> 
> BTW, I was trying to find out when this problem started showing up.
> So far, I managed to track it to 2.6.23-rc4. (2.6.23-rc3 doesn't seem
> to have this problem). I am going to do bi-sect and find out which
> patch caused this.
> 
> I am using dbench-2.0 which consistently reproduces the problem on
> my x86-64 box. Did you find anything new with your setup ?
> 

Thanks, Badari.  That sounds like a bit of a showstopper for 2.6.23.
Michal, do you have this regression in the dirt file?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc6: hanging ext3 dbench tests

2007-09-24 Thread Badari Pulavarty
Hi Andy,

I managed to reproduce the dbench problem. (not sure if its the same
thing or not - but symptoms are same). My problem has nothing to do 
with ext3. I can produce it on ext2, jfs also.

Whats happening on my machine is ..

dbench forks of 4 children and sends them a signal to start the work.
3 out of 4 children gets the signal and does the work. One of the child
never gets the signal so, it waits forever in pause(). So, parent waits
for a longtime to kill it.

BTW, I was trying to find out when this problem started showing up.
So far, I managed to track it to 2.6.23-rc4. (2.6.23-rc3 doesn't seem
to have this problem). I am going to do bi-sect and find out which
patch caused this.

I am using dbench-2.0 which consistently reproduces the problem on
my x86-64 box. Did you find anything new with your setup ?

Thanks,
Badari



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Avoid rec_len overflow with 64KB block size

2007-09-24 Thread Jan Kara

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0x instead and convert
value when read from / written to disk. The patch also converts some places
to use ext4_next_entry() when we are changing them anyway.

Signed-off-by: Jan Kara <[EMAIL PROTECTED]>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext4/dir.c 
linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/dir.c
--- linux-2.6.23-rc6/fs/ext4/dir.c  2007-09-18 19:22:28.0 +0200
+++ linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/dir.c 2007-09-24 
18:30:13.0 +0200
@@ -69,7 +69,7 @@ int ext4_check_dir_entry (const char * f
  unsigned long offset)
 {
const char * error_msg = NULL;
-   const int rlen = le16_to_cpu(de->rec_len);
+   const int rlen = ext4_rec_len_from_disk(de->rec_len);
 
if (rlen < EXT4_DIR_REC_LEN(1))
error_msg = "rec_len is smaller than minimal";
@@ -176,10 +176,10 @@ revalidate:
 * least that it is non-zero.  A
 * failure will be detected in the
 * dirent test below. */
-   if (le16_to_cpu(de->rec_len) <
-   EXT4_DIR_REC_LEN(1))
+   if (ext4_rec_len_from_disk(de->rec_len)
+   < EXT4_DIR_REC_LEN(1))
break;
-   i += le16_to_cpu(de->rec_len);
+   i += ext4_rec_len_from_disk(de->rec_len);
}
offset = i;
filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1))
@@ -201,7 +201,7 @@ revalidate:
ret = stored;
goto out;
}
-   offset += le16_to_cpu(de->rec_len);
+   offset += ext4_rec_len_from_disk(de->rec_len);
if (le32_to_cpu(de->inode)) {
/* We might block in the next section
 * if the data destination is
@@ -223,7 +223,7 @@ revalidate:
goto revalidate;
stored ++;
}
-   filp->f_pos += le16_to_cpu(de->rec_len);
+   filp->f_pos += ext4_rec_len_from_disk(de->rec_len);
}
offset = 0;
brelse (bh);
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext4/namei.c 
linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/namei.c
--- linux-2.6.23-rc6/fs/ext4/namei.c2007-09-18 19:22:28.0 +0200
+++ linux-2.6.23-rc6-1-ext4_64k_blocksize/fs/ext4/namei.c   2007-09-24 
18:35:34.0 +0200
@@ -280,7 +280,7 @@ static struct stats dx_show_leaf(struct 
space += EXT4_DIR_REC_LEN(de->name_len);
names++;
}
-   de = (struct ext4_dir_entry_2 *) ((char *) de + 
le16_to_cpu(de->rec_len));
+   de = ext4_next_entry(de);
}
printk("(%i)\n", names);
return (struct stats) { names, space, 1 };
@@ -525,7 +525,8 @@ static int ext4_htree_next_block(struct 
  */
 static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 
*p)
 {
-   return (struct ext4_dir_entry_2 *)((char*)p + le16_to_cpu(p->rec_len));
+   return (struct ext4_dir_entry_2 *)((char*)p +
+   ext4_rec_len_from_disk(p->rec_len));
 }
 
 /*
@@ -689,7 +690,7 @@ static int dx_make_map (struct ext4_dir_
cond_resched();
}
/* XXX: do we need to check rec_len == 0 case? -Chris */
-   de = (struct ext4_dir_entry_2 *) ((char *) de + 
le16_to_cpu(de->rec_len));
+   de = ext4_next_entry(de);
}
return count;
 }
@@ -790,7 +791,7 @@ static inline int search_dirblock(struct
return 1;
}
/* prevent looping on a bad block */
-   de_len = le16_to_cpu(de->rec_len);
+   de_len = ext4_rec_len_from_disk(de->rec_len);
if (de_len <= 0)
return -1;
offset += de_len;
@@ -1099,7 +1100,7 @@ dx_move_dirents(char *from, char *to, st
rec_len = EXT4_DIR_REC_LEN(de->name_len);
memcpy (to, de, rec_len);
((struct ext4_dir_entry_2 *) to)->rec_len =
-   cpu_to_le16(rec_len);
+   ext4_rec_len_to_disk(rec_len);
de->inode = 0;
map++;
to += rec_len;
@@ -1114,13 +1115,12 @@ static struct ext4_dir_entry_2* dx_pack_
 
prev = to = de;
while ((char*)de < b

Re: [PATCH 2/3] Avoid rec_len overflow with 64KB block size

2007-09-24 Thread Jan Kara

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0x instead and convert
value when read from / written to disk. The patch also converts some places
to use ext3_next_entry() when we are changing them anyway.

Signed-off-by: Jan Kara <[EMAIL PROTECTED]>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext3/dir.c 
linux-2.6.23-rc6-1-ext3_64k_blocksize/fs/ext3/dir.c
--- linux-2.6.23-rc6/fs/ext3/dir.c  2007-09-18 19:22:28.0 +0200
+++ linux-2.6.23-rc6-1-ext3_64k_blocksize/fs/ext3/dir.c 2007-09-24 
18:54:56.0 +0200
@@ -69,7 +69,7 @@ int ext3_check_dir_entry (const char * f
  unsigned long offset)
 {
const char * error_msg = NULL;
-   const int rlen = le16_to_cpu(de->rec_len);
+   const int rlen = ext3_rec_len_from_disk(de->rec_len);
 
if (rlen < EXT3_DIR_REC_LEN(1))
error_msg = "rec_len is smaller than minimal";
@@ -177,10 +177,10 @@ revalidate:
 * least that it is non-zero.  A
 * failure will be detected in the
 * dirent test below. */
-   if (le16_to_cpu(de->rec_len) <
+   if (ext3_rec_len_from_disk(de->rec_len) <
EXT3_DIR_REC_LEN(1))
break;
-   i += le16_to_cpu(de->rec_len);
+   i += ext3_rec_len_from_disk(de->rec_len);
}
offset = i;
filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1))
@@ -201,7 +201,7 @@ revalidate:
ret = stored;
goto out;
}
-   offset += le16_to_cpu(de->rec_len);
+   offset += ext3_rec_len_from_disk(de->rec_len);
if (le32_to_cpu(de->inode)) {
/* We might block in the next section
 * if the data destination is
@@ -223,7 +223,7 @@ revalidate:
goto revalidate;
stored ++;
}
-   filp->f_pos += le16_to_cpu(de->rec_len);
+   filp->f_pos += ext3_rec_len_from_disk(de->rec_len);
}
offset = 0;
brelse (bh);
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext3/namei.c 
linux-2.6.23-rc6-1-ext3_64k_blocksize/fs/ext3/namei.c
--- linux-2.6.23-rc6/fs/ext3/namei.c2007-09-18 19:22:28.0 +0200
+++ linux-2.6.23-rc6-1-ext3_64k_blocksize/fs/ext3/namei.c   2007-09-24 
19:07:51.0 +0200
@@ -143,6 +143,15 @@ struct dx_map_entry
u32 offs;
 };
 
+/*
+ * p is at least 6 bytes before the end of page
+ */
+static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 
*p)
+{
+   return (struct ext3_dir_entry_2 *)((char*)p +
+   ext3_rec_len_from_disk(p->rec_len));
+}
+
 #ifdef CONFIG_EXT3_INDEX
 static inline unsigned dx_get_block (struct dx_entry *entry);
 static void dx_set_block (struct dx_entry *entry, unsigned value);
@@ -280,7 +289,7 @@ static struct stats dx_show_leaf(struct 
space += EXT3_DIR_REC_LEN(de->name_len);
names++;
}
-   de = (struct ext3_dir_entry_2 *) ((char *) de + 
le16_to_cpu(de->rec_len));
+   de = ext3_next_entry(de);
}
printk("(%i)\n", names);
return (struct stats) { names, space, 1 };
@@ -521,14 +530,6 @@ static int ext3_htree_next_block(struct 
 
 
 /*
- * p is at least 6 bytes before the end of page
- */
-static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 
*p)
-{
-   return (struct ext3_dir_entry_2 *)((char*)p + le16_to_cpu(p->rec_len));
-}
-
-/*
  * This function fills a red-black tree with information from a
  * directory block.  It returns the number directory entries loaded
  * into the tree.  If there is an error it is returned in err.
@@ -689,7 +690,7 @@ static int dx_make_map (struct ext3_dir_
cond_resched();
}
/* XXX: do we need to check rec_len == 0 case? -Chris */
-   de = (struct ext3_dir_entry_2 *) ((char *) de + 
le16_to_cpu(de->rec_len));
+   de = ext3_next_entry(de);
}
return count;
 }
@@ -792,7 +793,7 @@ static inline int search_dirblock(struct
return 1;
}
/* prevent looping on a bad block */
-   de_len = le16_to_cpu(de->rec_len);
+   de_len = ext3_rec_len_from_disk(de->rec_len);
if (de_len <= 0)
return -1;
of

Re: [PATCH 1/3] Avoid rec_len overflow with 64KB block size

2007-09-24 Thread Jan Kara

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0x instead and convert
value when read from / written to disk.

Signed-off-by: Jan Kara <[EMAIL PROTECTED]>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-rc6/fs/ext2/dir.c 
linux-2.6.23-rc6-1-ext2_64k_blocksize/fs/ext2/dir.c
--- linux-2.6.23-rc6/fs/ext2/dir.c  2007-07-16 17:47:21.0 +0200
+++ linux-2.6.23-rc6-1-ext2_64k_blocksize/fs/ext2/dir.c 2007-09-24 
19:26:13.0 +0200
@@ -26,6 +26,24 @@
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
 
+static inline unsigned ext2_rec_len_from_disk(__le16 dlen)
+{
+   unsigned len = le16_to_cpu(dlen);
+
+   if (len == EXT2_MAX_REC_LEN)
+   return 1 << 16;
+   return len;
+}
+
+static inline __le16 ext2_rec_len_to_disk(unsigned len)
+{
+   if (len == (1 << 16))
+   return cpu_to_le16(EXT2_MAX_REC_LEN);
+   else if (len > (1 << 16))
+   BUG();
+   return cpu_to_le16(len);
+}
+
 /*
  * ext2 uses block-sized chunks. Arguably, sector-sized ones would be
  * more robust, but we have what we have
@@ -95,7 +113,7 @@ static void ext2_check_page(struct page 
}
for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
p = (ext2_dirent *)(kaddr + offs);
-   rec_len = le16_to_cpu(p->rec_len);
+   rec_len = ext2_rec_len_from_disk(p->rec_len);
 
if (rec_len < EXT2_DIR_REC_LEN(1))
goto Eshort;
@@ -193,7 +211,8 @@ static inline int ext2_match (int len, c
  */
 static inline ext2_dirent *ext2_next_entry(ext2_dirent *p)
 {
-   return (ext2_dirent *)((char*)p + le16_to_cpu(p->rec_len));
+   return (ext2_dirent *)((char*)p +
+   ext2_rec_len_from_disk(p->rec_len));
 }
 
 static inline unsigned 
@@ -305,7 +324,7 @@ ext2_readdir (struct file * filp, void *
return 0;
}
}
-   filp->f_pos += le16_to_cpu(de->rec_len);
+   filp->f_pos += ext2_rec_len_from_disk(de->rec_len);
}
ext2_put_page(page);
}
@@ -413,7 +432,7 @@ void ext2_set_link(struct inode *dir, st
struct page *page, struct inode *inode)
 {
unsigned from = (char *) de - (char *) page_address(page);
-   unsigned to = from + le16_to_cpu(de->rec_len);
+   unsigned to = from + ext2_rec_len_from_disk(de->rec_len);
int err;
 
lock_page(page);
@@ -469,7 +488,7 @@ int ext2_add_link (struct dentry *dentry
/* We hit i_size */
name_len = 0;
rec_len = chunk_size;
-   de->rec_len = cpu_to_le16(chunk_size);
+   de->rec_len = ext2_rec_len_to_disk(chunk_size);
de->inode = 0;
goto got_it;
}
@@ -483,7 +502,7 @@ int ext2_add_link (struct dentry *dentry
if (ext2_match (namelen, name, de))
goto out_unlock;
name_len = EXT2_DIR_REC_LEN(de->name_len);
-   rec_len = le16_to_cpu(de->rec_len);
+   rec_len = ext2_rec_len_from_disk(de->rec_len);
if (!de->inode && rec_len >= reclen)
goto got_it;
if (rec_len >= name_len + reclen)
@@ -504,8 +523,8 @@ got_it:
goto out_unlock;
if (de->inode) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
-   de1->rec_len = cpu_to_le16(rec_len - name_len);
-   de->rec_len = cpu_to_le16(name_len);
+   de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+   de->rec_len = ext2_rec_len_to_disk(name_len);
de = de1;
}
de->name_len = namelen;
@@ -536,7 +555,7 @@ int ext2_delete_entry (struct ext2_dir_e
struct inode *inode = mapping->host;
char *kaddr = page_address(page);
unsigned from = ((char*)dir - kaddr) & ~(ext2_chunk_size(inode)-1);
-   unsigned to = ((char*)dir - kaddr) + le16_to_cpu(dir->rec_len);
+   unsigned to = ((char*)dir - kaddr) + 
ext2_rec_len_from_disk(dir->rec_len);
ext2_dirent * pde = NULL;
ext2_dirent * de = (ext2_dirent *) (kaddr + from);
int err;
@@ -557,7 +576,7 @@ int ext2_delete_entry (struct ext2_dir_e
err = mapping->a_ops->prepare_write(NULL, page, from, to);
BUG_ON(err);
if (pde)
-   pde->rec_len = cpu_to_le16(to-from);
+   pde->rec_len = ext2_rec_len_to_disk(to-from);
dir->inode = 0;
err = ext2_commit_chunk(page, from, to);
inode->i_ctime = ino

[PATCH 0/3] Avoid rec_len overflow with 64KB block size

2007-09-24 Thread Jan Kara
  Hi,

  In the following series I'm sending patches to ext2, ext3 and ext4
solving possible overflow of rec_len when using 64KB blocksize. Minming,
would you add these to the patch queue? Thanks.

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SUSE Labs, CR
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[27/50] ext34: ensure do_split leaves enough free space in both blocks

2007-09-24 Thread Greg KH
From: Eric Sandeen <[EMAIL PROTECTED]>

commit ef2b02d3e617cb0400eedf2668f86215e1b0e6af in mainline.

The do_split() function for htree dir blocks is intended to split a leaf
block to make room for a new entry.  It sorts the entries in the original
block by hash value, then moves the last half of the entries to the new
block - without accounting for how much space this actually moves.  (IOW,
it moves half of the entry *count* not half of the entry *space*).  If by
chance we have both large & small entries, and we move only the smallest
entries, and we have a large new entry to insert, we may not have created
enough space for it.

The patch below stores each record size when calculating the dx_map, and
then walks the hash-sorted dx_map, calculating how many entries must be
moved to more evenly split the existing entries between the old block and
the new block, guaranteeing enough space for the new entry.

The dx_map "offs" member is reduced to u16 so that the overall map size
does not change - it is temporarily stored at the end of the new block, and
if it grows too large it may be overwritten.  By making offs and size both
u16, we won't grow the map size.

Also add a few comments to the functions involved.

This fixes the testcase reported by [EMAIL PROTECTED] on the
linux-ext4 list, "ext3 dir_index causes an error"

Thanks to Andreas Dilger for discussing the problem & solution with me.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
Tested-by: Junjiro Okajima <[EMAIL PROTECTED]>
Cc: Theodore Ts'o <[EMAIL PROTECTED]>
Cc: ext4 
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]>

---
 fs/ext3/namei.c |   39 +++
 fs/ext4/namei.c |   39 +++
 2 files changed, 70 insertions(+), 8 deletions(-)

--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -140,7 +140,8 @@ struct dx_frame
 struct dx_map_entry
 {
u32 hash;
-   u32 offs;
+   u16 offs;
+   u16 size;
 };
 
 #ifdef CONFIG_EXT3_INDEX
@@ -671,6 +672,10 @@ errout:
  * Directory block splitting, compacting
  */
 
+/*
+ * Create map of hash values, offsets, and sizes, stored at end of block.
+ * Returns number of entries mapped.
+ */
 static int dx_make_map (struct ext3_dir_entry_2 *de, int size,
struct dx_hash_info *hinfo, struct dx_map_entry 
*map_tail)
 {
@@ -684,7 +689,8 @@ static int dx_make_map (struct ext3_dir_
ext3fs_dirhash(de->name, de->name_len, &h);
map_tail--;
map_tail->hash = h.hash;
-   map_tail->offs = (u32) ((char *) de - base);
+   map_tail->offs = (u16) ((char *) de - base);
+   map_tail->size = le16_to_cpu(de->rec_len);
count++;
cond_resched();
}
@@ -694,6 +700,7 @@ static int dx_make_map (struct ext3_dir_
return count;
 }
 
+/* Sort map by hash value */
 static void dx_sort_map (struct dx_map_entry *map, unsigned count)
 {
 struct dx_map_entry *p, *q, *top = map + count - 1;
@@ -1081,6 +1088,10 @@ static inline void ext3_set_de_type(stru
 }
 
 #ifdef CONFIG_EXT3_INDEX
+/*
+ * Move count entries from end of map between two memory locations.
+ * Returns pointer to last entry moved.
+ */
 static struct ext3_dir_entry_2 *
 dx_move_dirents(char *from, char *to, struct dx_map_entry *map, int count)
 {
@@ -1099,6 +1110,10 @@ dx_move_dirents(char *from, char *to, st
return (struct ext3_dir_entry_2 *) (to - rec_len);
 }
 
+/*
+ * Compact each dir entry in the range to the minimal rec_len.
+ * Returns pointer to last entry in range.
+ */
 static struct ext3_dir_entry_2* dx_pack_dirents(char *base, int size)
 {
struct ext3_dir_entry_2 *next, *to, *prev, *de = (struct 
ext3_dir_entry_2 *) base;
@@ -1121,6 +1136,11 @@ static struct ext3_dir_entry_2* dx_pack_
return prev;
 }
 
+/*
+ * Split a full leaf block to make room for a new dir entry.
+ * Allocate a new block, and move entries so that they are approx. equally 
full.
+ * Returns pointer to de in block into which the new entry will be inserted.
+ */
 static struct ext3_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
struct buffer_head **bh,struct dx_frame *frame,
struct dx_hash_info *hinfo, int *error)
@@ -1132,7 +1152,7 @@ static struct ext3_dir_entry_2 *do_split
u32 hash2;
struct dx_map_entry *map;
char *data1 = (*bh)->b_data, *data2;
-   unsigned split;
+   unsigned split, move, size, i;
struct ext3_dir_entry_2 *de = NULL, *de2;
int err = 0;
 
@@ -1160,8 +1180,19 @@ static struct ext3_dir_entry_2 *do_split
count = dx_make_map ((struct ext3_dir_entry_2 *) data1,
 bloc

The jbd_GFP_NORFAIL patch caused the jbd_kmalloc_kzalloc.patch fail

2007-09-24 Thread Theodore Ts'o

While putting together 2.6.23-rc7-ext4-1 patchset, I noticed that the
jbd_GFP_NOFAIL patch caused a patch conflict with the kmalloc/kzalloc
patch.  I also fixed up the patch comments to make the git log look
prettier. 

- Ted


Fix up jbd k[mz]alloc patch so it applies and fix up GFP_NOFAIL comments

diff --git a/jbd_GFP_NOFAIL_flag_cleanup.patch 
b/jbd_GFP_NOFAIL_flag_cleanup.patch
index 8dec157..da02815 100644
--- a/jbd_GFP_NOFAIL_flag_cleanup.patch
+++ b/jbd_GFP_NOFAIL_flag_cleanup.patch
@@ -1,3 +1,8 @@
+jbd/jbd2: Journal initialization doesn't need __GFP_NOFAIL
+
+Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
+Signed-off-by: "Theodore Ts'o" <[EMAIL PROTECTED]>
+
 ---
  fs/jbd/journal.c  |2 +-
  fs/jbd2/journal.c |2 +-
diff --git a/jbd_kmalloc_kzalloc.patch b/jbd_kmalloc_kzalloc.patch
index a426d0e..8ca0469 100644
--- a/jbd_kmalloc_kzalloc.patch
+++ b/jbd_kmalloc_kzalloc.patch
@@ -22,7 +22,7 @@ Index: linux-2.6.23-rc6/fs/jbd/journal.c
journal_t *journal;
int err;
  
--  journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+-  journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 +  journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
if (!journal)
goto fail;
@@ -51,7 +51,7 @@ Index: linux-2.6.23-rc6/fs/jbd2/journal.c
journal_t *journal;
int err;
  
--  journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+-  journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 +  journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
if (!journal)
goto fail;
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mballoc-core

2007-09-24 Thread Theodore Ts'o
FYI, I updated the mballoc-core patch in the ext4-patch-queue to not try
to pull in both ext4_jbd2.h and jbd.h; this was causing duplicate
definition, so I removed the #include of jbd.h

 - Ted

Remove #include of linux/jbd.h from mballoc-core so it will compile

diff --git a/mballoc-core.patch b/mballoc-core.patch
index b8839b4..1079692 100644
--- a/mballoc-core.patch
+++ b/mballoc-core.patch
@@ -357,7 +357,7 @@ Index: linux-2.6.23-rc6/fs/ext4/mballoc.c
 ===
 --- /dev/null  1970-01-01 00:00:00.0 +
 +++ linux-2.6.23-rc6/fs/ext4/mballoc.c 2007-09-20 17:26:22.0 -0700
-@@ -0,0 +1,4460 @@
+@@ -0,0 +1,4459 @@
 +/*
 + * Copyright (c) 2003-2006, Cluster File Systems, Inc, [EMAIL PROTECTED]
 + * Written by Alex Tomas <[EMAIL PROTECTED]>
@@ -385,7 +385,6 @@ Index: linux-2.6.23-rc6/fs/ext4/mballoc.c
 +#include 
 +#include 
 +#include 
-+#include 
 +#include 
 +#include 
 +#include 


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


JBD2 naming cleanups patch

2007-09-24 Thread Theodore Ts'o

The jbd_to_jbd2_naming_cleanups.patch blew up if compiled with
CONFIG_JBD2_DEBUG, because JBD2_POISON_FREE wasn't defined in poison.h.  
So I fixed up this patch as follows.


- Ted


jbd2: JBD_XXX to JBD2_XXX naming cleanup

From: Mingming Cao <[EMAIL PROTECTED]>

change JBD_XXX macros to JBD2_XXX in JBD2/Ext4

Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
Signed-off-by: "Theodore Ts'o" <[EMAIL PROTECTED]>
---
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 78beb09..2be404f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -33,7 +33,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3c1397f..42cbdb5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -966,7 +966,7 @@ static int parse_options (char *options, struct super_block 
*sb,
if (option < 0)
return 0;
if (option == 0)
-   option = JBD_DEFAULT_MAX_COMMIT_AGE;
+   option = JBD2_DEFAULT_MAX_COMMIT_AGE;
sbi->s_commit_interval = HZ * option;
break;
case Opt_data_journal:
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 2cac34a..b898ee4 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -278,7 +278,7 @@ static inline void write_tag_block(int tag_bytes, 
journal_block_tag_t *tag,
   unsigned long long block)
 {
tag->t_blocknr = cpu_to_be32(block & (u32)~0);
-   if (tag_bytes > JBD_TAG_SIZE32)
+   if (tag_bytes > JBD2_TAG_SIZE32)
tag->t_blocknr_high = cpu_to_be32((block >> 31) >> 1);
 }
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index f12c65b..ff07bff 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -670,7 +670,7 @@ static journal_t * journal_init_common (void)
spin_lock_init(&journal->j_list_lock);
spin_lock_init(&journal->j_state_lock);
 
-   journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE);
+   journal->j_commit_interval = (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE);
 
/* The journal is marked for error until we succeed with recovery! */
journal->j_flags = JBD2_ABORT;
@@ -1612,9 +1612,9 @@ int jbd2_journal_blocks_per_page(struct inode *inode)
 size_t journal_tag_bytes(journal_t *journal)
 {
if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT))
-   return JBD_TAG_SIZE64;
+   return JBD2_TAG_SIZE64;
else
-   return JBD_TAG_SIZE32;
+   return JBD2_TAG_SIZE32;
 }
 
 /*
@@ -1681,7 +1681,7 @@ static void journal_free_journal_head(struct journal_head 
*jh)
 {
 #ifdef CONFIG_JBD2_DEBUG
atomic_dec(&nr_journal_heads);
-   memset(jh, JBD_POISON_FREE, sizeof(*jh));
+   memset(jh, JBD2_POISON_FREE, sizeof(*jh));
 #endif
kmem_cache_free(jbd2_journal_head_cache, jh);
 }
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index b50be8a..d0ce627 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -311,7 +311,7 @@ int jbd2_journal_skip_recovery(journal_t *journal)
 static inline unsigned long long read_tag_block(int tag_bytes, 
journal_block_tag_t *tag)
 {
unsigned long long block = be32_to_cpu(tag->t_blocknr);
-   if (tag_bytes > JBD_TAG_SIZE32)
+   if (tag_bytes > JBD2_TAG_SIZE32)
block |= (u64)be32_to_cpu(tag->t_blocknr_high) << 32;
return block;
 }
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index 01d8897..3595fd4 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -352,7 +352,7 @@ int jbd2_journal_revoke(handle_t *handle, unsigned long 
long blocknr,
if (bh)
BUFFER_TRACE(bh, "found on hash");
}
-#ifdef JBD_EXPENSIVE_CHECKING
+#ifdef JBD2_EXPENSIVE_CHECKING
else {
struct buffer_head *bh2;
 
@@ -453,7 +453,7 @@ int jbd2_journal_cancel_revoke(handle_t *handle, struct 
journal_head *jh)
}
}
 
-#ifdef JBD_EXPENSIVE_CHECKING
+#ifdef JBD2_EXPENSIVE_CHECKING
/* There better not be one left behind by now! */
record = find_revoke_record(journal, bh->b_blocknr);
J_ASSERT_JH(jh, record == NULL);
diff --git a/include/linux/ext4_jbd2.h b/include/linux/ext4_jbd2.h
index d716e63..38c71d3 100644
--- a/include/linux/ext4_jbd2.h
+++ b/include/linux/ext4_jbd2.h
@@ -12,8 +12,8 @@
  * Ext4-specific journaling extensions.
  */
 
-#ifndef _LINUX_EXT4_JBD_H
-#define _LINUX_EXT4_JBD_H
+#ifndef _LINUX_EXT4_JBD2_H
+#define _LINUX_EXT4_JBD2_H
 
 #include 
 #include 
@@ -228,4 +228,4 @@ static inline int ext4_should_writeback_data(struct inode 
*inode)
return 0;
 }
 
-#endif /* _LINUX_EXT4_JBD_H */
+#endif /* _LINUX_EXT4_JBD2_H */
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 5f8b876..06ef114 100644
--

Re: [PATCH] obsolete libcom-err for SuSE e2fsprogs

2007-09-24 Thread Eric Sandeen
Theodore Tso wrote:
> On Mon, Sep 24, 2007 at 11:25:39AM +0200, Karel Zak wrote:
>>> Nope, there isn't much rationale unless we also split out fsck, so
>>  util-linux-ng is open for fsck :-)
> 
> It's something I'm considering, but at the moment I've got some higher
> priority (ext4) work that I'm focusing on.  
> 
> What would you think about moving libblkid and the libdisk from
> xfsprogs into util-linux-ng?  That's things that I've been batting
> around doing, although that means we end up moving more libraries that
> distros would probably want separated into different packages into
> util-linux-ng

Heh, sounds a little like a prisoner exchange... "we'll give up libblkid
& libcom_err if you give up libdisk" :)

Seriously, though, that all sounds like a pretty good idea to me,
especially since e2fsprogs has been making noise about utilizing some of
the functionality in xfsprogs (stripe discovery etc).

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.6.23-rc7-ext4-1 pushed out

2007-09-24 Thread Theodore Ts'o

I've pushed out a new ext4 patchset, based off of 2.6.23-rc7.  I had to
fix a number of patch application errors and compile-time errors in the
ext4 patch queue; I'll send mail separately about each of those fix ups.

It's available in the standard place:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 2.6.23-rc7-ext4-1

and

ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/2.6.23-rc7-ext4-1

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] obsolete libcom-err for SuSE e2fsprogs

2007-09-24 Thread Theodore Tso
On Mon, Sep 24, 2007 at 11:25:39AM +0200, Karel Zak wrote:
> > 
> > Nope, there isn't much rationale unless we also split out fsck, so
> 
>  util-linux-ng is open for fsck :-)

It's something I'm considering, but at the moment I've got some higher
priority (ext4) work that I'm focusing on.  

What would you think about moving libblkid and the libdisk from
xfsprogs into util-linux-ng?  That's things that I've been batting
around doing, although that means we end up moving more libraries that
distros would probably want separated into different packages into
util-linux-ng

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] obsolete libcom-err for SuSE e2fsprogs

2007-09-24 Thread Karel Zak
On Thu, Sep 20, 2007 at 05:54:27PM -0400, Theodore Tso wrote:
> On Thu, Sep 20, 2007 at 03:22:16PM -0500, Eric Sandeen wrote:
> > I'd do this, my rpm-fu is still reasonably strong, though - I'm curious,
> > is there a compelling reason to split out just libcom-err?  what about
> > libuuid?  libblkid?  e2fsprogs is a bit of a grab bag of things.  What's
> > the rationale for the split?
> 
> Nope, there isn't much rationale unless we also split out fsck, so

 util-linux-ng is open for fsck :-)

 Seriously, when we move fsck(8) to util-linux-ng we can compile this
 util against libblkid **or** libvolume_id (udev). For example Suse prefers
 libvolume_id, but RHEL/Fedora prefers libblkid, ..etc. The util-linux-ng
 already has fsprobe API for both libraries -- port fsck(8) to
 this API is 10mins work...

> that people who don't want to use any ext 2/3/4 filesystems don't need
> to install programs like e2fsck, mke2fs, libext2fs.so, etc.

 mount(8) dependences on libblkid (or libvolume_id from udev).

 I think sane downstream distributions are using separate binary
 package(s) (RPM calls it subpackages) for libraries from e2fsprogs.

 My $0.02...

Karel


PS. my workstation with Fedora 7:

# rpm -q --whatrequires libcom_err.so.2 libuuid.so.1 libblkid.so.1 | sort -u
apr-1.2.8-6
apr-util-1.2.8-7
cryptsetup-luks-1.0.5-4.fc7.1
curl-7.16.2-1.fc7
e2fsprogs-libs-1.40.2-2.fc7
evolution-data-server-1.10.3.1-2.fc7
gnome-vfs2-2.18.1-4.fc7
krb5-devel-1.6.1-4.fc7
krb5-libs-1.6.1-4.fc7
libpurple-2.1.1-1.fc7
neon-0.25.5-6
openssl-0.9.8b-14.fc7
pam_krb5-2.2.11-1
parted-1.8.6-4.fc7
subversion-1.4.3-4

-- 
 Karel Zak  <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html