from:"Badari Pulavarty"

Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

2007-11-05 Thread Badari Pulavarty

On Tue, 2007-11-06 at 00:15 +0800, Andreas Dilger wrote:
> On Nov 05, 2007  08:04 -0800, Badari Pulavarty wrote:
> > On Sat, 2007-11-03 at 09:36 +0800, Andreas Dilger wrote:
> > > But...  this implies that every user of bh->b_data needs to kmap, and I
> > > don't see that in the code anywhere else.  That makes me think something
> > > else is going wrong here.
> > 
> > Most cases, this is handled in ll_rw_block() code - when we submit the
> > buffer head for IO. If the page is in highmem, we will end up creating
> > a bounce bufer for it. 
> > 
> > In our case, JBD code is trying to look at the data to do checksum
> > on it. Thats why we have to kmap() the page before looking.
> 
> My point is that there is a LOT of code in ext[234] that dereferences
> bh->b_data without kmap() (e.g. group descriptors, bitmaps, superblock,
> inode tables, etc).  Does that imply that something is forcing those
> bh pages into lowmem, or is the journal bh page in question being
> allocated in some different way that allows it to be in highmem?

Yes. You are right. Its been a while since I had to deal with HIGHMEM.
All the meta-data should be in LOWMEM. I asked Mingming to verify
what the buffer-head is pointing to when it has HIGHMEM page.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

2007-11-05 Thread Badari Pulavarty

On Sat, 2007-11-03 at 09:36 +0800, Andreas Dilger wrote:
> On Nov 02, 2007  08:31 -0800, Badari Pulavarty wrote:
> > On Fri, 2007-11-02 at 13:20 +0800, Andreas Dilger wrote:
> > > On Nov 01, 2007  17:40 -0700, Mingming Cao wrote:
> > > > Current journal checksumming patch failed fsstress test on NUMA. The 
> > > > bh->b_data passed to the crc32_be () function could be NULL pointer, 
> > > > which caused kernel oops immediately when running fsstress with -o 
> > > > journal_checksum. It is because the page is part of highmem on NUMA box.
> > > > We need to kmap the page before access the bh->b_data to calculate
> > > > the checksums.
> > > 
> > > I have no objection to the patch, per-se, but I'm surprised that there
> > > would ever be a buffer head pointing at a page in high memory?  That
> > > seems contrary to what I would expect...
> > 
> > I was surprised to see that too while helping Mingming/Avantika track
> > this issue. I was under impression that we are checksumming only
> > metadata and it should be lowmem. But only "buffer_head"s are in lowmem.
> > Pages that point to can be in Highmem.
> 
> But...  this implies that every user of bh->b_data needs to kmap, and I
> don't see that in the code anywhere else.  That makes me think something
> else is going wrong here.

Most cases, this is handled in ll_rw_block() code - when we submit the
buffer head for IO. If the page is in highmem, we will end up creating
a bounce bufer for it. 

In our case, JBD code is trying to look at the data to do checksum
on it. Thats why we have to kmap() the page before looking.

Thanks,
Badari


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

2007-11-02 Thread Badari Pulavarty

On Fri, 2007-11-02 at 13:20 +0800, Andreas Dilger wrote:
> On Nov 01, 2007  17:40 -0700, Mingming Cao wrote:
> > Current journal checksumming patch failed fsstress test on NUMA. The 
> > bh->b_data passed to the crc32_be () function could be NULL pointer, 
> > which caused kernel oops immediately when running fsstress with -o 
> > journal_checksum. It is because the page is part of highmem on NUMA box.
> > We need to kmap the page before access the bh->b_data to calculate
> > the checksums.
> 
> I have no objection to the patch, per-se, but I'm surprised that there
> would ever be a buffer head pointing at a page in high memory?  That
> seems contrary to what I would expect...

I was surprised to see that too while helping Mingming/Avantika track
this issue. I was under impression that we are checksumming only
metadata and it should be lowmem. But only "buffer_head"s are in lowmem.
Pages that point to can be in Highmem.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] add init_ext4_proc() stub for when CONFIG_PROC_FS is not set

2007-10-11 Thread Badari Pulavarty

On Wed, 2007-10-10 at 15:00 -0400, Theodore Tso wrote:
> On Wed, Oct 10, 2007 at 08:30:10AM -0700, Badari Pulavarty wrote:
> > Its a good start. I think there are lots of proc handling routines that
> > can be move into #ifdef CONFIG_PROC_FS also.
> > 
> > All the code around ext4_mb_read_prealloc_table(),
> > ext4_mb_write_prealloc_table(), MB_PROC_VALUE_READ(stats),
> > MB_PROC_VALUE_WRITE(stats), .. can be ifdefed out.
> 
> There's no need to ifdef them out; in include/proc_fs.h there are the
> following convenience #define's if CONFIG_PROC_FS is not defined:

Sorry. If I wasn't clear.. What I meant to say was the above
routines are NOT needed if we don't define CONFIG_PROC_FS. They
are supporting read/write to /proc files. We can #ifdef them
out to reduce the text size.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] add init_ext4_proc() stub for when CONFIG_PROC_FS is not set

2007-10-10 Thread Badari Pulavarty

On Tue, 2007-10-09 at 20:22 -0400, Theodore Tso wrote:
> On Tue, Oct 09, 2007 at 01:40:12PM -0400, Theodore Tso wrote:
> > On Tue, Oct 09, 2007 at 10:03:06AM -0700, Mingming Cao wrote:
> > > I guess our testing did not catch this up because we have CONFIG_PROC_FS
> > > enabled all the time. mballoc needs procfs for exporting some stats info
> > > and tunable paramenters to optimize/customize multiple allocation.
> > > 
> > > We could select CONFIG_PROC_FS at kconfig when ext4dev is enabled.
> > 
> > We definitely should be able to compile without CONFIG_PROC_FS; it's a
> > major flaw in the mballoc-core.patch that it doesn't work without it.  
> 
> I will fold the following into mballoc-core.patch.  It's sufficient to
> allow us to compile w/o CONFIG_PROC_FS defined.
> 
>   - Ted
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 4409c0c..2305af4 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2854,9 +2854,11 @@ int __init init_ext4_proc(void)
>   if (ext4_pspace_cachep == NULL)
>   return -ENOMEM;
>  
> +#ifdef CONFIG_PROC_FS
>   proc_root_ext4 = proc_mkdir(EXT4_ROOT, proc_root_fs);
>   if (proc_root_ext4 == NULL)
>   printk(KERN_ERR "EXT4-fs: Unable to create %s\n", EXT4_ROOT);
> +#endif
>  
>   return 0;
>  }
> @@ -2865,7 +2867,9 @@ void exit_ext4_proc(void)
>  {
>   /* XXX: synchronize_rcu(); */
>   kmem_cache_destroy(ext4_pspace_cachep);
> +#ifdef CONFIG_PROC_FS
>   remove_proc_entry(EXT4_ROOT, proc_root_fs);
> +#endif
>  }
>  

Its a good start. I think there are lots of proc handling routines that
can be move into #ifdef CONFIG_PROC_FS also.

All the code around ext4_mb_read_prealloc_table(),
ext4_mb_write_prealloc_table(), MB_PROC_VALUE_READ(stats),
MB_PROC_VALUE_WRITE(stats), .. can be ifdefed out.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] add init_ext4_proc() stub for when CONFIG_PROC_FS is not set

2007-10-09 Thread Badari Pulavarty

On Tue, 2007-10-09 at 10:03 -0700, Mingming Cao wrote:
> On Tue, 2007-10-09 at 09:31 -0700, Badari Pulavarty wrote:
> > On Tue, 2007-10-09 at 15:50 +1000, [EMAIL PROTECTED] wrote:
> > > plain text document attachment (ext4-add-init_ext4_proc-stub.patch)
> > > init_ext4_fs() calls init_ext4_proc() so we need a stub init_ext4_proc()
> > > for the case that CONFIG_PROC_FS is not set.
> > > Without the stub we get a build error:
> > > 
> > > fs/ext4/mballoc.c: In function 'init_ext4_proc':
> > > fs/ext4/mballoc.c:2837: error: 'proc_root_fs' undeclared (first use in 
> > > this function)
> > > fs/ext4/mballoc.c:2837: error: (Each undeclared identifier is reported 
> > > only once
> > > fs/ext4/mballoc.c:2837: error: for each function it appears in.)
> > > 
> > > Add a stub init_ext4_proc() function that does nothing but return 0
> > > 
> > > Signed-off-by: Mark Nelson <[EMAIL PROTECTED]>
> > > ---
> > >  fs/ext4/mballoc.c |7 +++
> > >  1 file changed, 7 insertions(+)
> > > 
> > > Index: ext4/fs/ext4/mballoc.c
> > > ===
> > > --- ext4.orig/fs/ext4/mballoc.c
> > > +++ ext4/fs/ext4/mballoc.c
> > > @@ -2825,6 +2825,7 @@ static int ext4_mb_destroy_per_dev_proc(
> > >   return 0;
> > >  }
> > >  
> > > +#ifdef CONFIG_PROC_FS
> > >  int __init init_ext4_proc(void)
> > >  {
> > >   ext4_pspace_cachep =
> > > @@ -2840,6 +2841,12 @@ int __init init_ext4_proc(void)
> > >  
> > >   return 0;
> > >  }
> > > +#else
> > > +int __init init_ext4_proc(void)
> > > +{
> > > + return 0;
> > > +}
> > > +#endif
> > >  
> > >  void exit_ext4_proc(void)
> > >  {
> > 
> > Nope. I don't think we can do this :(
> > 
> > For example, we need to create ext4_pspace_cachep kmem cache 
> > for the pre-allocation to work. We can't ifdef it out.
> > 
> > Mingming/Amit, can you take a look at this ? It looks like
> > we NEED procfs support to make mballoc work. If so, we need
> > to add it to the dependency.
> > 
> > 
> 
> I guess our testing did not catch this up because we have CONFIG_PROC_FS
> enabled all the time. mballoc needs procfs for exporting some stats info
> and tunable paramenters to optimize/customize multiple allocation.
> 
> We could select CONFIG_PROC_FS at kconfig when ext4dev is enabled.

For testing it may be okay, but in the long run we need to make sure
that ext4 doesn't depend on PROC_FS support for operation :( 

It would be nice to make it work without PROC_FS dependency, if you
configure procfs we should get more info/tunables..

ext4_pspace_cachep creation has to be moved to somewhere else ?

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] add init_ext4_proc() stub for when CONFIG_PROC_FS is not set

2007-10-09 Thread Badari Pulavarty

On Tue, 2007-10-09 at 15:50 +1000, [EMAIL PROTECTED] wrote:
> plain text document attachment (ext4-add-init_ext4_proc-stub.patch)
> init_ext4_fs() calls init_ext4_proc() so we need a stub init_ext4_proc()
> for the case that CONFIG_PROC_FS is not set.
> Without the stub we get a build error:
> 
> fs/ext4/mballoc.c: In function 'init_ext4_proc':
> fs/ext4/mballoc.c:2837: error: 'proc_root_fs' undeclared (first use in this 
> function)
> fs/ext4/mballoc.c:2837: error: (Each undeclared identifier is reported only 
> once
> fs/ext4/mballoc.c:2837: error: for each function it appears in.)
> 
> Add a stub init_ext4_proc() function that does nothing but return 0
> 
> Signed-off-by: Mark Nelson <[EMAIL PROTECTED]>
> ---
>  fs/ext4/mballoc.c |7 +++
>  1 file changed, 7 insertions(+)
> 
> Index: ext4/fs/ext4/mballoc.c
> ===
> --- ext4.orig/fs/ext4/mballoc.c
> +++ ext4/fs/ext4/mballoc.c
> @@ -2825,6 +2825,7 @@ static int ext4_mb_destroy_per_dev_proc(
>   return 0;
>  }
>  
> +#ifdef CONFIG_PROC_FS
>  int __init init_ext4_proc(void)
>  {
>   ext4_pspace_cachep =
> @@ -2840,6 +2841,12 @@ int __init init_ext4_proc(void)
>  
>   return 0;
>  }
> +#else
> +int __init init_ext4_proc(void)
> +{
> + return 0;
> +}
> +#endif
>  
>  void exit_ext4_proc(void)
>  {

Nope. I don't think we can do this :(

For example, we need to create ext4_pspace_cachep kmem cache 
for the pre-allocation to work. We can't ifdef it out.

Mingming/Amit, can you take a look at this ? It looks like
we NEED procfs support to make mballoc work. If so, we need
to add it to the dependency.


Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/2] move init_ext4_proc() last and add cleanup call exit_ext4_proc()

2007-10-09 Thread Badari Pulavarty

On Tue, 2007-10-09 at 15:50 +1000, [EMAIL PROTECTED] wrote:
> plain text document attachment (ext4-move-init_ext4_proc-add-
> cleanup.patch)
> The first problem that is addressed is that the addition of init_ext4_proc()
> means that the code doesn't clean up after itself on a failure of
> init_ext4_xattr(), and the second is that usually init_*_proc() should be
> the last thing called, so we move it to the end and add the cleanup call to
> exit_ext4_proc().
> 
> Signed-off-by: Mark Nelson <[EMAIL PROTECTED]>
> ---
>  fs/ext4/super.c |9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> Index: ext4/fs/ext4/super.c
> ===
> --- ext4.orig/fs/ext4/super.c
> +++ ext4/fs/ext4/super.c
> @@ -2999,10 +2999,6 @@ static int __init init_ext4_fs(void)
>  {
>   int err;
>  
> - err = init_ext4_proc();
> - if (err)
> - return err;
> -
>   err = init_ext4_xattr();
>   if (err)
>   return err;
> @@ -3012,7 +3008,12 @@ static int __init init_ext4_fs(void)
>   err = register_filesystem(&ext4dev_fs_type);
>   if (err)
>   goto out;
> + err = init_ext4_proc();
> + if (err)
> + goto out_proc;
>   return 0;
> +out_proc:
> + exit_ext4_proc();
>  out:
>   destroy_inodecache();
>  out1:

Nope. You can not call exit_ext4_proc() if there is a failure
in init_ext4_proc(). If the kmem_cache_create() for ext4_pspace_cachep
fails, you would end up calling kmem_cache_destory(NULL).

The best way is to make init_ext4_proc() cleanup itself, in case
of an error. Hmm.. we are not handling proc_mkdir(EXT4_ROOT) 
failures.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.23-rc9: Oops in cache_alloc_refill() mm/slab.c

2007-10-05 Thread Badari Pulavarty

On Fri, 2007-10-05 at 15:41 +0200, Valerie Clement wrote:
> Badari Pulavarty wrote:
> > On Thu, 2007-10-04 at 18:13 +0200, Valerie Clement wrote:
> >> While running ffsb tests on my ext4 filesystem, I got an Oops in 
> >> cache_alloc_refill().
> >> I turned on SLAB debugging and here is the message I got:
> >>
> >> slab: Internal list corruption detected in cache 'buffer_head'(30), 
> >> slabp 81007e100100(1515870810). Hexdump:
> > 
> > slabp->inuse = 1515870810 looks bogus. Is this easily reproducible ?
> 
> Hi Badari,
> Thanks for your answer.
> I didn't reproduce it without the latest ext4 patches. So I suspect a 
> bug in one of them.
> But how debugging this?
> Which other debug traces can I turn on?

Let me understand. You applied latest ext4 patchsets ? If so, Mingming
has some slab-cleanup changes in the patchset. You can try backing them
out and see. 

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel Oops in ext3 code

2007-09-28 Thread Badari Pulavarty

On Fri, 2007-09-28 at 06:54 +0200, Norbert Preining wrote:
> Hi Mingming,
> 
> On Do, 27 Sep 2007, Mingming Cao wrote:
> > Could you please sent the objdump of the ext4_discard_reservation
> > function? It doesn't match what I see here.
> 
> I assume you meant ext3_ I made
>   objdump -x -D -s super.o
> (the only place where I found this function in the source code). If you
> want something else, let me know, but a bit more specific. Can I do the
> objdump directly from the kernel image file?
> 

objdump -DlS balloc.o 

would give us ext3_discard_reservation()

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.23-rc6: hanging ext3 dbench tests

2007-09-24 Thread Badari Pulavarty

On Mon, 2007-09-24 at 13:04 -0700, Linus Torvalds wrote:
> 
> On Mon, 24 Sep 2007, Badari Pulavarty wrote:
> > 
> > Whats happening on my machine is ..
> > 
> > dbench forks of 4 children and sends them a signal to start the work.
> > 3 out of 4 children gets the signal and does the work. One of the child
> > never gets the signal so, it waits forever in pause(). So, parent waits
> > for a longtime to kill it.
> 
> Since this *seems* to have nothing to do with the filesystem, and since it 
> *seems* to have been introduced between -rc3 and -rc4, I did
> 
>   gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/
> 
> to see what has changed. One of the commits was signal-related, and that 
> one doesn't look like it could possibly matter.
> 
> The rest were scheduler-related, which doesn't surprise me. In fact, even 
> before I looked, my reaction to your bug report was "That sounds like an 
> application race condition".
> 
> Applications shouldn't use "pause()" for waiting for a signal. It's a 
> fundamentally racy interface - the signal could have happened just 
> *before* calling pause. So it's almost always a bug to use pause(), and 
> any users should be fixed to use "sigsuspend()" instead, which can 
> atomically (and correctly) pause for a signal while the process has masked 
> it outside of the system call.
> 
> Now, I took a look at the dbench sources, and I have to say that the race 
> looks *very* unlikely (there's quite a small window in which it does
> 
>   children[i].status = getpid();
>   ** race window here **
>   pause();
> 
> and it would require *just* the right timing so that the parent doesn't 
> end up doing the "sleep(1)" (which would make the window even less likely 
> to be hit), but there does seem to be a race condition there. And it 
> *could* be that you just happen to hit it on your hw setup.
> 
> So before you do anything else, does this patch (TOTALLY UNTESTED! DONE 
> ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU 
> BAD NAMES!) make any difference?
> 
> (patch against unmodified dbench-2.0)
> 
>   Linus
> 
> ---
> diff --git a/dbench.c b/dbench.c
> index ccf5624..4be5712 100644
> --- a/dbench.c
> +++ b/dbench.c
> @@ -91,10 +91,15 @@ static double create_procs(int nprocs, void (*fn)(struct 
> child_struct * ))
>  
>   for (i=0;i   if (fork() == 0) {
> + sigset_t old, blocked;
> +
> + sigemptyset(&blocked);
> + sigaddset(&blocked, SIGCONT);
> + sigprocmask(SIG_BLOCK, &blocked, &old);
>   setbuffer(stdout, NULL, 0);
>   nb_setup(&children[i]);
>   children[i].status = getpid();
> - pause();
> + sigsuspend(&old);
>   fn(&children[i]);
>   _exit(0);
>   }

With the modified dbench, I couldn't reproduce the problem so far.
I will let it run through the night (just to be sure). 

For now, we can treat it as a tool/App issue :)

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.23-rc6: hanging ext3 dbench tests

2007-09-24 Thread Badari Pulavarty

On Mon, 2007-09-24 at 13:04 -0700, Linus Torvalds wrote:
> 
> On Mon, 24 Sep 2007, Badari Pulavarty wrote:
> > 
> > Whats happening on my machine is ..
> > 
> > dbench forks of 4 children and sends them a signal to start the work.
> > 3 out of 4 children gets the signal and does the work. One of the child
> > never gets the signal so, it waits forever in pause(). So, parent waits
> > for a longtime to kill it.
> 
> Since this *seems* to have nothing to do with the filesystem, and since it 
> *seems* to have been introduced between -rc3 and -rc4, I did
> 
>   gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/

I was wrong. I managed to reproduce on 2.6.23-rc3, but it took a long
time. But I never reproduced it on 2.6.22. Ran test for a day.

> 
> to see what has changed. One of the commits was signal-related, and that 
> one doesn't look like it could possibly matter.
> 
> The rest were scheduler-related, which doesn't surprise me. In fact, even 
> before I looked, my reaction to your bug report was "That sounds like an 
> application race condition".
> 
> Applications shouldn't use "pause()" for waiting for a signal. It's a 
> fundamentally racy interface - the signal could have happened just 
> *before* calling pause. So it's almost always a bug to use pause(), and 
> any users should be fixed to use "sigsuspend()" instead, which can 
> atomically (and correctly) pause for a signal while the process has masked 
> it outside of the system call.
> 
> Now, I took a look at the dbench sources, and I have to say that the race 
> looks *very* unlikely (there's quite a small window in which it does
> 
>   children[i].status = getpid();
>   ** race window here **
>   pause();
> 
> and it would require *just* the right timing so that the parent doesn't 
> end up doing the "sleep(1)" (which would make the window even less likely 
> to be hit), but there does seem to be a race condition there. And it 
> *could* be that you just happen to hit it on your hw setup.
> 
> So before you do anything else, does this patch (TOTALLY UNTESTED! DONE 
> ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU 
> BAD NAMES!) make any difference?
> 
> (patch against unmodified dbench-2.0)

I am testing the updated version of dbench now. Normally, it takes
30min-1hour to reproduce the problem (when I do infinite "dbench 4").
I will post the results soon.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.23-rc6: hanging ext3 dbench tests

2007-09-24 Thread Badari Pulavarty

Hi Andy,

I managed to reproduce the dbench problem. (not sure if its the same
thing or not - but symptoms are same). My problem has nothing to do 
with ext3. I can produce it on ext2, jfs also.

Whats happening on my machine is ..

dbench forks of 4 children and sends them a signal to start the work.
3 out of 4 children gets the signal and does the work. One of the child
never gets the signal so, it waits forever in pause(). So, parent waits
for a longtime to kill it.

BTW, I was trying to find out when this problem started showing up.
So far, I managed to track it to 2.6.23-rc4. (2.6.23-rc3 doesn't seem
to have this problem). I am going to do bi-sect and find out which
patch caused this.

I am using dbench-2.0 which consistently reproduces the problem on
my x86-64 box. Did you find anything new with your setup ?

Thanks,
Badari



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ext4: FLEX_BG Kernel support v2.

2007-09-21 Thread Badari Pulavarty

On Fri, 2007-09-21 at 09:06 -0500, Jose R. Santos wrote:
> From: Jose R. Santos <[EMAIL PROTECTED]>
> 
> ext4: FLEX_BG Kernel support v2.
> 

> @@ -702,13 +702,15 @@ static inline int ext4_valid_inum(struct super_block 
> *sb, unsigned long ino)
>  #define EXT4_FEATURE_INCOMPAT_META_BG0x0010
>  #define EXT4_FEATURE_INCOMPAT_EXTENTS0x0040 /* extents 
> support */
>  #define EXT4_FEATURE_INCOMPAT_64BIT  0x0080
> +#define EXT4_FEATURE_INCOMPAT_FLEX_BG0x0200

Any reason why 0x100 is skipped ?

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] JBD slab cleanups

2007-09-17 Thread Badari Pulavarty

On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote:
> On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
> > jbd/jbd2: Replace slab allocations with page cache allocations
> > 
> > From: Christoph Lameter <[EMAIL PROTECTED]>
> > 
> > JBD should not pass slab pages down to the block layer.
> > Use page allocator pages instead. This will also prepare
> > JBD for the large blocksize patchset.
> > 
> 
> Currently memory allocation for committed_data(and frozen_buffer) for
> bufferhead is done through jbd slab management, as Christoph Hellwig
> pointed out that this is broken as jbd should not pass slab pages down
> to IO layer. and suggested to use get_free_pages() directly.
> 
> The problem with this patch, as Andreas Dilger pointed today in ext4
> interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> 1/3-1/2 page space. 
> 
> What was the originally intention to set up slabs for committed_data(and
> frozen_buffer) in JBD? Why not using kmalloc?
> 
> Mingming

Looks good. Small suggestion is to get rid of all kmalloc() usages and
consistently use jbd_kmalloc() or jbd2_kmalloc().

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ext2 statfs improvement for block and inode free count

2007-07-19 Thread Badari Pulavarty

On Wed, 2007-07-18 at 20:18 -0700, Andrew Morton wrote:
> On Fri, 13 Jul 2007 18:36:54 -0700 Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> 
> > More statfs() improvements for ext2. ext2 already maintains
> > percpu counters for free blocks and inodes. Derive free
> > block count and inode count by summing up percpu counters,
> > instead of counting up all the groups in the filesystem
> > each time.
> > 
> 
> hm, another speedup patch with no measurements which demonstrate its
> benefit.

In my setups (4 & 8-way), I didn't measure any significant performance
improvements (in any reasonable workload). I see some decent
improvements on cooked-up (1 million stats) tests :(
> 
> > 
> > Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
> > Acked-by: Andreas Dilger <[EMAIL PROTECTED]>
> > 
> >  fs/ext2/super.c |4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6.22/fs/ext2/super.c
> > ===
> > --- linux-2.6.22.orig/fs/ext2/super.c   2007-07-13 20:06:38.0 
> > -0700
> > +++ linux-2.6.22/fs/ext2/super.c2007-07-13 20:06:51.0 -0700
> > @@ -1136,12 +1136,12 @@ static int ext2_statfs (struct dentry * 
> > buf->f_type = EXT2_SUPER_MAGIC;
> > buf->f_bsize = sb->s_blocksize;
> > buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead;
> > -   buf->f_bfree = ext2_count_free_blocks(sb);
> > +   buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
> > buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
> > if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
> > buf->f_bavail = 0;
> > buf->f_files = le32_to_cpu(es->s_inodes_count);
> > -   buf->f_ffree = ext2_count_free_inodes(sb);
> > +   buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
> > buf->f_namelen = EXT2_NAME_LEN;
> > fsid = le64_to_cpup((void *)es->s_uuid) ^
> >le64_to_cpup((void *)es->s_uuid + sizeof(u64));
> > 
> 
> Well there's a tradeoff here.  At large CPU counts, percpu_counter_sum()
> becomes quite expensive - it takes a global lock and then goes off fishing
> in every CPU's percpu_alloced memory.
> 
> So there is some value of (num_online_cpus / sb->s_groups_count) at which
> this change becomes a loss.  Where does that value lie?

Yes. I debated long time whether I should submit this or not - due to
very reason. Old code wasn't holding any locks. I don't have any high
count CPU machine (>8way) with me. I will request for time on one.

> 
> Bear in mind that the global lock in percpu_counter_sum() will tilt the
> scales quite a bit.

Noticed that too. I added WARN_ON() to see if percpu sum doesn't match
computed sum. I saw few stacks in a 24 hour run of fsx runs. 

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ext2 statfs improvement for block and inode free count

2007-07-13 Thread Badari Pulavarty

Andrew,

Can you include it in -mm ? 

BTW, this patch is against mainline, won't apply cleanly to -mm, due to
other statfs() improvements.

Thanks,
Badari

More statfs() improvements for ext2. ext2 already maintains
percpu counters for free blocks and inodes. Derive free
block count and inode count by summing up percpu counters,
instead of counting up all the groups in the filesystem
each time.


Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
Acked-by: Andreas Dilger <[EMAIL PROTECTED]>

 fs/ext2/super.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.22/fs/ext2/super.c
===
--- linux-2.6.22.orig/fs/ext2/super.c   2007-07-13 20:06:38.0 -0700
+++ linux-2.6.22/fs/ext2/super.c2007-07-13 20:06:51.0 -0700
@@ -1136,12 +1136,12 @@ static int ext2_statfs (struct dentry * 
buf->f_type = EXT2_SUPER_MAGIC;
buf->f_bsize = sb->s_blocksize;
buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead;
-   buf->f_bfree = ext2_count_free_blocks(sb);
+   buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
buf->f_bavail = 0;
buf->f_files = le32_to_cpu(es->s_inodes_count);
-   buf->f_ffree = ext2_count_free_inodes(sb);
+   buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
buf->f_namelen = EXT2_NAME_LEN;
fsid = le64_to_cpup((void *)es->s_uuid) ^
   le64_to_cpup((void *)es->s_uuid + sizeof(u64));


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATHC] Fix for ext2 reservation (Re: -mm merge plans for 2.6.23)

2007-07-10 Thread Badari Pulavarty

On Tue, 2007-07-10 at 16:55 -0700, Andrew Morton wrote:
> On Tue, 10 Jul 2007 15:15:57 -0700
> Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> 
> > Well, I looked at the problem now and here is the fix :)
> 
> whee, thanks.
> 
> > Greg, Please consider this for stable release also.
> 
> No, it is only relevant to -mm's ext2-reservations.patch.

Yes. Sorry. I was looking at -mm tree and assumed that
ext2-reservation code made into mainline also :(

Greg, please ignore the patch for stable.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATHC] Fix for ext2 reservation (Re: -mm merge plans for 2.6.23)

2007-07-10 Thread Badari Pulavarty

On Tue, 2007-07-10 at 12:39 -0700, Martin Bligh wrote:
> Andrew Morton wrote:
> > 
> > Begin forwarded message:
> > 
> > Date: Tue, 10 Jul 2007 21:49:23 +0400
> > From: Alexey Dobriyan <[EMAIL PROTECTED]>
> > To: Andrew Morton <[EMAIL PROTECTED]>
> > Cc: [EMAIL PROTECTED], linux-ext4@vger.kernel.org
> > Subject: ext2 reservations (Re: -mm merge plans for 2.6.23)
> > 
> > 
> >> ext2-reservations.patch
> >>
> >>  Still needs decent testing.
> > 
> > Was this oops silently fixed?
> > http://lkml.org/lkml/2007/3/2/138
> > 2.6.21-rc2-mm1: EIP is at ext2_discard_reservation+0x1c/0x52
> > 
> > I still have that ext2 partition backed up.
> 
> Now I'm confused - I thought there was a latent issue there, and
> then we went back and revisited it, and we decided there wasn't ;-(

Well, I looked at the problem now and here is the fix :)

Greg, Please consider this for stable release also.

Thanks,
Badari

ext2 reservation fix - Alexey Dobriyan reported ext2 discard
reservation panic while ago (http://lkml.org/lkml/2007/3/2/138).
If ext2_new_inode() fails for any reason it would end up calling 
ext2_discard_reservation() (due to last iput). Normally, it does
nothing since we don't have a reservation window structure
allocated. But the NULL pointer check wouldn't work with  slab 
poisioning, and causes oops.

Fix is to initialize i_block_alloc_info to NULL in ext2_alloc_inode()
code instead of assuming that it would be NULL. Same fix already
exists in ext3 and ext4.

Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>

 fs/ext2/super.c |1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6.22/fs/ext2/super.c
===
--- linux-2.6.22.orig/fs/ext2/super.c   2007-07-08 16:32:17.0 -0700
+++ linux-2.6.22/fs/ext2/super.c2007-07-10 16:36:42.0 -0700
@@ -147,6 +147,7 @@ static struct inode *ext2_alloc_inode(st
ei->i_acl = EXT2_ACL_NOT_CACHED;
ei->i_default_acl = EXT2_ACL_NOT_CACHED;
 #endif
+   ei->i_block_alloc_info = NULL;
ei->vfs_inode.i_version = 1;
return &ei->vfs_inode;
 }


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: fallocate support for bitmap-based files

2007-07-06 Thread Badari Pulavarty

On Fri, 2007-07-06 at 14:33 -0700, Mike Waychison wrote:
> Badari Pulavarty wrote:
> > On Sat, 2007-06-30 at 10:13 -0400, Mingming Cao wrote:
> >> On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote:
> >>> Guys, Mike and Sreenivasa at google are looking into implementing
> >>> fallocate() on ext2.  Of course, any such implementation could and should
> >>> also be portable to ext3 and ext4 bitmapped files.
> >>>
> >>> I believe that Sreenivasa will mainly be doing the implementation work.
> >>>
> >>>
> >>> The basic plan is as follows:
> >>>
> >>> - Create (with tune2fs and mke2fs) a hidden file using one of the
> >>>   reserved inode numbers.  That file will be sized to have one bit for 
> >>> each
> >>>   block in the partition.  Let's call this the "unwritten block file".
> >>>
> >>>   The unwritten block file will be initialised with all-zeroes
> >>>
> >>> - at fallocate()-time, allocate the blocks to the user's file (in some
> >>>   yet-to-be-determined fashion) and, for each one which is uninitialised,
> >>>   set its bit in the unwritten block file.  The set bit means "this block
> >>>   is uninitialised and needs to be zeroed out on read".
> >>>
> >>> - truncate() would need to clear out set-bits in the unwritten blocks 
> >>> file.
> >>>
> >>> - When the fs comes to read a block from disk, it will need to consult
> >>>   the unwritten blocks file to see if that block should be zeroed by the
> >>>   CPU.
> >>>
> >>> - When the unwritten-block is written to, its bit in the unwritten blocks
> >>>   file gets zeroed.
> >>>
> >>> - An obvious efficiency concern: if a user file has no unwritten blocks
> >>>   in it, we don't need to consult the unwritten blocks file.
> >>>
> >>>   Need to work out how to do this.  An obvious solution would be to have
> >>>   a number-of-unwritten-blocks counter in the inode.  But do we have space
> >>>   for that?
> >>>
> >>>   (I expect google and others would prefer that the on-disk format be
> >>>   compatible with legacy ext2!)
> >>>
> >>> - One concern is the following scenario:
> >>>
> >>>   - Mount fs with "new" kernel, fallocate() some blocks to a file.
> >>>
> >>>   - Now, mount the fs under "old" kernel (which doesn't understand the
> >>> unwritten blocks file).
> >>>
> >>> - This kernel will be able to read uninitialised data from that
> >>>   fallocated-to file, which is a security concern.
> >>>
> >>>   - Now, the "old" kernel writes some data to a fallocated block.  But
> >>> this kernel doesn't know that it needs to clear that block's flag in
> >>> the unwritten blocks file!
> >>>
> >>>   - Now mount that fs under the "new" kernel and try to read that file.
> >>>  The flag for the block is set, so this kernel will still zero out the
> >>> data on a read, thus corrupting the user's data
> >>>
> >>>   So how to fix this?  Perhaps with a per-inode flag indicating "this
> >>>   inode has unwritten blocks".  But to fix this problem, we'd require that
> >>>   the "old" kernel clear out that flag.
> >>>
> >>>   Can anyone propose a solution to this?
> >>>
> >>>   Ah, I can!  Use the compatibility flags in such a way as to prevent the
> >>>   "old" kernel from mounting this filesystem at all.  To mount this fs
> >>>   under an "old" kernel the user will need to run some tool which will
> >>>
> >>>   - read the unwritten blocks file
> >>>
> >>>   - for each set-bit in the unwritten blocks file, zero out the
> >>> corresponding block
> >>>
> >>>   - zero out the unwritten blocks file
> >>>
> >>>   - rewrite the superblock to indicate that this fs may now be mounted
> >>> by an "old" kernel.
> >>>
> >>>   Sound sane?
> >>>
> >>> - I'm assuming that there are more reserved inodes available, and that
> >>>   the changes to tune2fs and mke2fs will be basically a copy-n-paste job
&g

Re: [PATCH] ext2 statfs speed up

2007-07-05 Thread Badari Pulavarty

On Thu, 2007-07-05 at 15:06 -0600, Andreas Dilger wrote:
> On Jul 05, 2007  11:11 -0700, Badari Pulavarty wrote:
> > @@ -1131,17 +1134,22 @@ static int ext2_statfs (struct dentry * 
> > buf->f_bfree = ext2_count_free_blocks(sb);
> > +   es->s_free_blocks_count = cpu_to_le32(buf->f_bfree);
> > buf->f_ffree = ext2_count_free_inodes(sb);
> > +   es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
> 
> Hmm, this is still sub-optimal.  For ext3 and ext4 it just uses
> percpu_counter_sum() instead of the slow ext*_count_free_blocks(), which
> walks all of the groups.  Not that this is a reason to hold this patch,
> because at least we are removing 1/2 of the overhead for ext2.

Andrea,

I am wondering why we are not currently using percpu_counter_sum()
for ext2 ? I see that ext2 already has all the stuff it needs. 
Can't I just do following ?

Thanks,
Badari

 fs/ext2/super.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc7/fs/ext2/super.c
===
--- linux-2.6.22-rc7.orig/fs/ext2/super.c   2007-07-05 12:35:15.0 
-0700
+++ linux-2.6.22-rc7/fs/ext2/super.c2007-07-05 20:37:32.0 -0700
@@ -1142,13 +1142,13 @@ static int ext2_statfs (struct dentry * 
buf->f_type = EXT2_SUPER_MAGIC;
buf->f_bsize = sb->s_blocksize;
buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last;
-   buf->f_bfree = ext2_count_free_blocks(sb);
+   buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
es->s_free_blocks_count = cpu_to_le32(buf->f_bfree);
buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
buf->f_bavail = 0;
buf->f_files = le32_to_cpu(es->s_inodes_count);
-   buf->f_ffree = ext2_count_free_inodes(sb);
+   buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
buf->f_namelen = EXT2_NAME_LEN;
fsid = le64_to_cpup((void *)es->s_uuid) ^


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ext4 statfs speed up

2007-07-05 Thread Badari Pulavarty

On Fri, 2007-07-06 at 02:44 +0800, coly li wrote:
> Badari,
> 
> How about the performance improvement, is there any benchmark result ?
> 
> Coly

The original patch came from Andrea Dilger - where they use this patch
in their luster code. Andrea, do you have perf. numbers ?

BTW, I am doing perf. analysis too. I will share the results.

Thanks,
Badari

> 
> 
> 在 2007-07-05四的 11:11 -0700，Badari Pulavarty写道：
> > This is a patch that speeds up statfs.  It is very simple - 
> > the "overhead" calculation, which takes a huge amount of time 
> > for large filesystems, never changes unless the size of the 
> > filesystem itself changes.  That means we can store it in memory 
> > and only recalculate if the filesystem has been resized (almost 
> > never).
> > 
> > It also fixes a minor problem that we never update the on-disk 
> > superblock free blocks/inodes counts until the filesystem is 
> > unmounted.  While not fatal, we may as well update that on disk 
> > when we have the information, and it makes things like debugfs 
> > and dumpe2fs report a bit more accurate info.
> > 
> > Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
> > Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
> > 
> >  fs/ext4/super.c|   25 +++--
> >  include/linux/ext4_fs_sb.h |2 ++
> >  2 files changed, 17 insertions(+), 10 deletions(-)
> > 
> > Index: linux-2.6.22-rc7/include/linux/ext4_fs_sb.h
> > ===
> > --- linux-2.6.22-rc7.orig/include/linux/ext4_fs_sb.h2007-07-01 
> > 12:54:24.0 -0700
> > +++ linux-2.6.22-rc7/include/linux/ext4_fs_sb.h 2007-07-04 
> > 22:09:59.0 -0700
> > @@ -39,6 +39,8 @@ struct ext4_sb_info {
> > unsigned long s_gdb_count;  /* Number of group descriptor blocks */
> > unsigned long s_desc_per_block; /* Number of group descriptors per 
> > block */
> > unsigned long s_groups_count;   /* Number of groups in the fs */
> > +   unsigned long s_overhead_last;  /* Last calculated overhead */
> > +   unsigned long s_blocks_last;/* Last seen block count */
> > struct buffer_head * s_sbh; /* Buffer containing the super block */
> > struct ext4_super_block * s_es; /* Pointer to the super block in the 
> > buffer */
> > struct buffer_head ** s_group_desc;
> > Index: linux-2.6.22-rc7/fs/ext4/super.c
> > ===
> > --- linux-2.6.22-rc7.orig/fs/ext4/super.c   2007-07-01 12:54:24.0 
> > -0700
> > +++ linux-2.6.22-rc7/fs/ext4/super.c2007-07-04 22:12:27.0 
> > -0700
> > @@ -2481,19 +2481,19 @@ static int ext4_statfs (struct dentry * 
> > struct super_block *sb = dentry->d_sb;
> > struct ext4_sb_info *sbi = EXT4_SB(sb);
> > struct ext4_super_block *es = sbi->s_es;
> > -   ext4_fsblk_t overhead;
> > -   int i;
> > u64 fsid;
> >  
> > -   if (test_opt (sb, MINIX_DF))
> > -   overhead = 0;
> > -   else {
> > -   unsigned long ngroups;
> > -   ngroups = EXT4_SB(sb)->s_groups_count;
> > +   if (test_opt(sb, MINIX_DF)) {
> > +   sbi->s_overhead_last = 0;
> > +   } else if (sbi->s_blocks_last != le32_to_cpu(es->s_blocks_count)) {
> > +   unsigned long ngroups = sbi->s_groups_count, i;
> > +   ext4_fsblk_t overhead = 0;
> > smp_rmb();
> >  
> > /*
> > -* Compute the overhead (FS structures)
> > +* Compute the overhead (FS structures).  This is constant
> > +* for a given filesystem unless the number of block groups
> > +* changes so we cache the previous value until it does.
> >  */
> >  
> > /*
> > @@ -2517,18 +2517,23 @@ static int ext4_statfs (struct dentry * 
> >  * Every block group has an inode bitmap, a block
> >  * bitmap, and an inode table.
> >  */
> > -   overhead += (ngroups * (2 + EXT4_SB(sb)->s_itb_per_group));
> > +   overhead += ngroups * (2 + sbi->s_itb_per_group);
> > +   sbi->s_overhead_last = overhead;
> > +   smp_wmb();
> > +   sbi->s_blocks_last = le32_to_cpu(es->s_blocks_count);
> > }
> >  
> > buf->f_type = EXT4_SUPER_MAGIC;
> > buf->f_bsize = sb->s_blocksize;
> > -   buf->f_blocks = ext4_blocks_count(es) - overhead;
> >

[PATCH] ext3 statfs speed up

2007-07-05 Thread Badari Pulavarty

This is a patch that speeds up statfs.  It is very simple - 
the "overhead" calculation, which takes a huge amount of time 
for large filesystems, never changes unless the size of the 
filesystem itself changes.  That means we can store it in memory 
and only recalculate if the filesystem has been resized (almost 
never).

It also fixes a minor problem that we never update the on-disk 
superblock free blocks/inodes counts until the filesystem is 
unmounted.  While not fatal, we may as well update that on disk 
when we have the information, and it makes things like debugfs 
and dumpe2fs report a bit more accurate info.

Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>

 fs/ext3/super.c|   25 +++--
 include/linux/ext3_fs_sb.h |2 ++
 2 files changed, 17 insertions(+), 10 deletions(-)

Index: linux-2.6.22-rc7/include/linux/ext3_fs_sb.h
===
--- linux-2.6.22-rc7.orig/include/linux/ext3_fs_sb.h2007-07-01 
12:54:24.0 -0700
+++ linux-2.6.22-rc7/include/linux/ext3_fs_sb.h 2007-07-04 21:59:28.0 
-0700
@@ -38,6 +38,8 @@ struct ext3_sb_info {
unsigned long s_gdb_count;  /* Number of group descriptor blocks */
unsigned long s_desc_per_block; /* Number of group descriptors per 
block */
unsigned long s_groups_count;   /* Number of groups in the fs */
+   unsigned long s_overhead_last;  /* Last calculated overhead */
+   unsigned long s_blocks_last;/* Last seen block count */
struct buffer_head * s_sbh; /* Buffer containing the super block */
struct ext3_super_block * s_es; /* Pointer to the super block in the 
buffer */
struct buffer_head ** s_group_desc;
Index: linux-2.6.22-rc7/fs/ext3/super.c
===
--- linux-2.6.22-rc7.orig/fs/ext3/super.c   2007-07-01 12:54:24.0 
-0700
+++ linux-2.6.22-rc7/fs/ext3/super.c2007-07-04 21:59:28.0 -0700
@@ -2406,19 +2406,19 @@ static int ext3_statfs (struct dentry * 
struct super_block *sb = dentry->d_sb;
struct ext3_sb_info *sbi = EXT3_SB(sb);
struct ext3_super_block *es = sbi->s_es;
-   ext3_fsblk_t overhead;
-   int i;
u64 fsid;
 
-   if (test_opt (sb, MINIX_DF))
-   overhead = 0;
-   else {
-   unsigned long ngroups;
-   ngroups = EXT3_SB(sb)->s_groups_count;
+   if (test_opt(sb, MINIX_DF)) {
+   sbi->s_overhead_last = 0;
+   } else if (sbi->s_blocks_last != le32_to_cpu(es->s_blocks_count)) {
+   unsigned long ngroups = sbi->s_groups_count, i;
+   ext3_fsblk_t overhead = 0;
smp_rmb();
 
/*
-* Compute the overhead (FS structures)
+* Compute the overhead (FS structures).  This is constant
+* for a given filesystem unless the number of block groups
+* changes so we cache the previous value until it does.
 */
 
/*
@@ -2442,18 +2442,23 @@ static int ext3_statfs (struct dentry * 
 * Every block group has an inode bitmap, a block
 * bitmap, and an inode table.
 */
-   overhead += (ngroups * (2 + EXT3_SB(sb)->s_itb_per_group));
+   overhead += ngroups * (2 + sbi->s_itb_per_group);
+   sbi->s_overhead_last = overhead;
+   smp_wmb();
+   sbi->s_blocks_last = le32_to_cpu(es->s_blocks_count);
}
 
buf->f_type = EXT3_SUPER_MAGIC;
buf->f_bsize = sb->s_blocksize;
-   buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead;
+   buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last;
buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
+   es->s_free_blocks_count = cpu_to_le32(buf->f_bfree);
buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
buf->f_bavail = 0;
buf->f_files = le32_to_cpu(es->s_inodes_count);
buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
+   es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
buf->f_namelen = EXT3_NAME_LEN;
fsid = le64_to_cpup((void *)es->s_uuid) ^
   le64_to_cpup((void *)es->s_uuid + sizeof(u64));


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ext2 statfs speed up

2007-07-05 Thread Badari Pulavarty

This is a patch that speeds up statfs.  It is very simple - 
the "overhead" calculation, which takes a huge amount of time 
for large filesystems, never changes unless the size of the 
filesystem itself changes.  That means we can store it in memory 
and only recalculate if the filesystem has been resized (almost 
never).

It also fixes a minor problem that we never update the on-disk 
superblock free blocks/inodes counts until the filesystem is 
unmounted.  While not fatal, we may as well update that on disk 
when we have the information, and it makes things like debugfs 
and dumpe2fs report a bit more accurate info.

Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>

 fs/ext2/super.c|   20 ++--
 include/linux/ext2_fs_sb.h |2 ++
 2 files changed, 16 insertions(+), 6 deletions(-)

Index: linux-2.6.22-rc7/include/linux/ext2_fs_sb.h
===
--- linux-2.6.22-rc7.orig/include/linux/ext2_fs_sb.h2007-07-01 
12:54:24.0 -0700
+++ linux-2.6.22-rc7/include/linux/ext2_fs_sb.h 2007-07-04 22:02:01.0 
-0700
@@ -33,6 +33,8 @@ struct ext2_sb_info {
unsigned long s_gdb_count;  /* Number of group descriptor blocks */
unsigned long s_desc_per_block; /* Number of group descriptors per 
block */
unsigned long s_groups_count;   /* Number of groups in the fs */
+   unsigned long s_overhead_last;  /* Last calculated overhead */
+   unsigned long s_blocks_last;/* Last seen block count */
struct buffer_head * s_sbh; /* Buffer containing the super block */
struct ext2_super_block * s_es; /* Pointer to the super block in the 
buffer */
struct buffer_head ** s_group_desc;
Index: linux-2.6.22-rc7/fs/ext2/super.c
===
--- linux-2.6.22-rc7.orig/fs/ext2/super.c   2007-07-01 12:54:24.0 
-0700
+++ linux-2.6.22-rc7/fs/ext2/super.c2007-07-05 12:29:03.0 -0700
@@ -1099,15 +1099,18 @@ static int ext2_statfs (struct dentry * 
struct super_block *sb = dentry->d_sb;
struct ext2_sb_info *sbi = EXT2_SB(sb);
struct ext2_super_block *es = sbi->s_es;
-   unsigned long overhead;
-   int i;
u64 fsid;
 
if (test_opt (sb, MINIX_DF))
-   overhead = 0;
-   else {
+   sbi->s_overhead_last = 0;
+   else if (sbi->s_blocks_last != le32_to_cpu(es->s_blocks_count)) {
+   unsigned long i, overhead = 0;
+   smp_rmb();
+
/*
-* Compute the overhead (FS structures)
+* Compute the overhead (FS structures). This is constant
+* for a given filesystem unless the number of block groups
+* changes so we cache the previous value until it does.
 */
 
/*
@@ -1131,17 +1134,22 @@ static int ext2_statfs (struct dentry * 
 */
overhead += (sbi->s_groups_count *
 (2 + sbi->s_itb_per_group));
+   sbi->s_overhead_last = overhead;
+   smp_wmb();
+   sbi->s_blocks_last = le32_to_cpu(es->s_blocks_count);
}
 
buf->f_type = EXT2_SUPER_MAGIC;
buf->f_bsize = sb->s_blocksize;
-   buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead;
+   buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last;
buf->f_bfree = ext2_count_free_blocks(sb);
+   es->s_free_blocks_count = cpu_to_le32(buf->f_bfree);
buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
buf->f_bavail = 0;
buf->f_files = le32_to_cpu(es->s_inodes_count);
buf->f_ffree = ext2_count_free_inodes(sb);
+   es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
buf->f_namelen = EXT2_NAME_LEN;
fsid = le64_to_cpup((void *)es->s_uuid) ^
   le64_to_cpup((void *)es->s_uuid + sizeof(u64));


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ext4 statfs speed up

2007-07-05 Thread Badari Pulavarty

This is a patch that speeds up statfs.  It is very simple - 
the "overhead" calculation, which takes a huge amount of time 
for large filesystems, never changes unless the size of the 
filesystem itself changes.  That means we can store it in memory 
and only recalculate if the filesystem has been resized (almost 
never).

It also fixes a minor problem that we never update the on-disk 
superblock free blocks/inodes counts until the filesystem is 
unmounted.  While not fatal, we may as well update that on disk 
when we have the information, and it makes things like debugfs 
and dumpe2fs report a bit more accurate info.

Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>

 fs/ext4/super.c|   25 +++--
 include/linux/ext4_fs_sb.h |2 ++
 2 files changed, 17 insertions(+), 10 deletions(-)

Index: linux-2.6.22-rc7/include/linux/ext4_fs_sb.h
===
--- linux-2.6.22-rc7.orig/include/linux/ext4_fs_sb.h2007-07-01 
12:54:24.0 -0700
+++ linux-2.6.22-rc7/include/linux/ext4_fs_sb.h 2007-07-04 22:09:59.0 
-0700
@@ -39,6 +39,8 @@ struct ext4_sb_info {
unsigned long s_gdb_count;  /* Number of group descriptor blocks */
unsigned long s_desc_per_block; /* Number of group descriptors per 
block */
unsigned long s_groups_count;   /* Number of groups in the fs */
+   unsigned long s_overhead_last;  /* Last calculated overhead */
+   unsigned long s_blocks_last;/* Last seen block count */
struct buffer_head * s_sbh; /* Buffer containing the super block */
struct ext4_super_block * s_es; /* Pointer to the super block in the 
buffer */
struct buffer_head ** s_group_desc;
Index: linux-2.6.22-rc7/fs/ext4/super.c
===
--- linux-2.6.22-rc7.orig/fs/ext4/super.c   2007-07-01 12:54:24.0 
-0700
+++ linux-2.6.22-rc7/fs/ext4/super.c2007-07-04 22:12:27.0 -0700
@@ -2481,19 +2481,19 @@ static int ext4_statfs (struct dentry * 
struct super_block *sb = dentry->d_sb;
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_super_block *es = sbi->s_es;
-   ext4_fsblk_t overhead;
-   int i;
u64 fsid;
 
-   if (test_opt (sb, MINIX_DF))
-   overhead = 0;
-   else {
-   unsigned long ngroups;
-   ngroups = EXT4_SB(sb)->s_groups_count;
+   if (test_opt(sb, MINIX_DF)) {
+   sbi->s_overhead_last = 0;
+   } else if (sbi->s_blocks_last != le32_to_cpu(es->s_blocks_count)) {
+   unsigned long ngroups = sbi->s_groups_count, i;
+   ext4_fsblk_t overhead = 0;
smp_rmb();
 
/*
-* Compute the overhead (FS structures)
+* Compute the overhead (FS structures).  This is constant
+* for a given filesystem unless the number of block groups
+* changes so we cache the previous value until it does.
 */
 
/*
@@ -2517,18 +2517,23 @@ static int ext4_statfs (struct dentry * 
 * Every block group has an inode bitmap, a block
 * bitmap, and an inode table.
 */
-   overhead += (ngroups * (2 + EXT4_SB(sb)->s_itb_per_group));
+   overhead += ngroups * (2 + sbi->s_itb_per_group);
+   sbi->s_overhead_last = overhead;
+   smp_wmb();
+   sbi->s_blocks_last = le32_to_cpu(es->s_blocks_count);
}
 
buf->f_type = EXT4_SUPER_MAGIC;
buf->f_bsize = sb->s_blocksize;
-   buf->f_blocks = ext4_blocks_count(es) - overhead;
+   buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last;
buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
+   es->s_free_blocks_count = cpu_to_le32(buf->f_bfree);
buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es);
if (buf->f_bfree < ext4_r_blocks_count(es))
buf->f_bavail = 0;
buf->f_files = le32_to_cpu(es->s_inodes_count);
buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
+   es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
buf->f_namelen = EXT4_NAME_LEN;
fsid = le64_to_cpup((void *)es->s_uuid) ^
   le64_to_cpup((void *)es->s_uuid + sizeof(u64));


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] speed up statfs

2007-07-04 Thread Badari Pulavarty

On Wed, 2007-07-04 at 16:23 -0600, Andreas Dilger wrote:
> Hi,
> this is a patch we are currently using that speeds up statfs.
> It is very simple - the "overhead" calculation, which takes a
> huge amount of time for large filesystems, never changes unless
> the size of the filesystem itself changes.  That means we can
> store it in memory and only recalculate if the filesystem has
> been resized (almost never).
> 
> It is based on ext3 but is trivally updated to ext4.  It also
> fixes a minor problem that we never update the on-disk superblock
> free blocks/inodes counts until the filesystem is unmounted.
> While not fatal, we may as well update that on disk when we have
> the information, and it makes things like debugfs and dumpe2fs
> report a bit more accurate info.
> 
> I'd be happy if someone could update this to the latest kernel and
> for ext2 and ext4 also.
> 

I forward ported it to latest + did the same for ext2 and ext4 also.
Compiles fine. I will post them tomorrow after little testing.

Thanks,
Badaru

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: fallocate support for bitmap-based files

2007-07-02 Thread Badari Pulavarty

On Sat, 2007-06-30 at 10:13 -0400, Mingming Cao wrote:
> On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote:
> > Guys, Mike and Sreenivasa at google are looking into implementing
> > fallocate() on ext2.  Of course, any such implementation could and should
> > also be portable to ext3 and ext4 bitmapped files.
> > 
> > I believe that Sreenivasa will mainly be doing the implementation work.
> > 
> > 
> > The basic plan is as follows:
> > 
> > - Create (with tune2fs and mke2fs) a hidden file using one of the
> >   reserved inode numbers.  That file will be sized to have one bit for each
> >   block in the partition.  Let's call this the "unwritten block file".
> > 
> >   The unwritten block file will be initialised with all-zeroes
> > 
> > - at fallocate()-time, allocate the blocks to the user's file (in some
> >   yet-to-be-determined fashion) and, for each one which is uninitialised,
> >   set its bit in the unwritten block file.  The set bit means "this block
> >   is uninitialised and needs to be zeroed out on read".
> > 
> > - truncate() would need to clear out set-bits in the unwritten blocks file.
> > 
> > - When the fs comes to read a block from disk, it will need to consult
> >   the unwritten blocks file to see if that block should be zeroed by the
> >   CPU.
> > 
> > - When the unwritten-block is written to, its bit in the unwritten blocks
> >   file gets zeroed.
> > 
> > - An obvious efficiency concern: if a user file has no unwritten blocks
> >   in it, we don't need to consult the unwritten blocks file.
> > 
> >   Need to work out how to do this.  An obvious solution would be to have
> >   a number-of-unwritten-blocks counter in the inode.  But do we have space
> >   for that?
> > 
> >   (I expect google and others would prefer that the on-disk format be
> >   compatible with legacy ext2!)
> > 
> > - One concern is the following scenario:
> > 
> >   - Mount fs with "new" kernel, fallocate() some blocks to a file.
> > 
> >   - Now, mount the fs under "old" kernel (which doesn't understand the
> > unwritten blocks file).
> > 
> > - This kernel will be able to read uninitialised data from that
> >   fallocated-to file, which is a security concern.
> > 
> >   - Now, the "old" kernel writes some data to a fallocated block.  But
> > this kernel doesn't know that it needs to clear that block's flag in
> > the unwritten blocks file!
> > 
> >   - Now mount that fs under the "new" kernel and try to read that file.
> >  The flag for the block is set, so this kernel will still zero out the
> > data on a read, thus corrupting the user's data
> > 
> >   So how to fix this?  Perhaps with a per-inode flag indicating "this
> >   inode has unwritten blocks".  But to fix this problem, we'd require that
> >   the "old" kernel clear out that flag.
> > 
> >   Can anyone propose a solution to this?
> > 
> >   Ah, I can!  Use the compatibility flags in such a way as to prevent the
> >   "old" kernel from mounting this filesystem at all.  To mount this fs
> >   under an "old" kernel the user will need to run some tool which will
> > 
> >   - read the unwritten blocks file
> > 
> >   - for each set-bit in the unwritten blocks file, zero out the
> > corresponding block
> > 
> >   - zero out the unwritten blocks file
> > 
> >   - rewrite the superblock to indicate that this fs may now be mounted
> > by an "old" kernel.
> > 
> >   Sound sane?
> > 
> > - I'm assuming that there are more reserved inodes available, and that
> >   the changes to tune2fs and mke2fs will be basically a copy-n-paste job
> >   from the `tune2fs -j' code.  Correct?
> > 
> > - I haven't thought about what fsck changes would be needed.
> > 
> >   Presumably quite a few.  For example, fsck should check that set-bits
> >   in the unwriten blobks file do not correspond to freed blocks.  If they
> >   do, that should be fixed up.
> > 
> >   And fsck can check each inodes number-of-unwritten-blocks counters
> >   against the unwritten blocks file (if we implement the per-inode
> >   number-of-unwritten-blocks counter)
> > 
> >   What else should fsck do?
> > 
> > - I haven't thought about the implications of porting this into ext3/4. 
> >   Probably the commit to the unwritten blocks file will need to be atomic
> >   with the commit to the user's file's metadata, so the unwritten-blocks
> >   file will effectively need to be in journalled-data mode.
> > 
> >   Or, more likely, we access the unwritten blocks file via the blockdev
> >   pagecache (ie: use bmap, like the journal file) and then we're just
> >   talking direct to the disk's blocks and it becomes just more fs metadata.
> > 
> > - I guess resize2fs will need to be taught about the unwritten blocks
> >   file: to shrink and grow it appropriately.
> > 
> > 
> > That's all I can think of for now - I probably missed something. 
> > 
> > Suggestions and thought are sought, please.
> > 
> > 
> 
> Another approach we have been thinking  is using a backing
> inode

Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree

2007-06-13 Thread Badari Pulavarty

On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote:
..
>  
> > 5) ext3_write_end:
> > Before  write_begin/write_end patch set we have folowing locking
> > order:
> > stop_journal(handle);
> > unlock_page(page);
> > But now order is oposite:
> > unlock_page(page);
> > stop_journal(handle);
> > Can we got any race condition now? I'm not sure is it actual problem,
> > may be somebody cant describe this.
> 
> Can we just change it to the original order? That would seem to be
> safest unless one of the ext3 devs explicitly acks it.

It would be nice to go back to original order, but its not that
simple with current structure of the code. With Nick's patches
unlock_page() happens in generic_write_end(). journal_stop() 
needs to happen after generic_write_end(). :(

Mingming, can you take a look at the current & proposed order ?
I ran into bunch of races when I tried to change the order for
->writepages() support earlier :(

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fix section conflict of ext4_ext_{find_goal,invalidate_cache}

2007-05-15 Thread Badari Pulavarty

On Tue, 2007-05-15 at 19:15 +0200, Martin Michlmayr wrote:
> Building with GCC 4.2, I get the following error:
> 
>   CC [M]  fs/ext4/extents.o
> fs/ext4/extents.c:2166: error: __ksymtab_ext4_ext_find_goal causes a section 
> type conflict
> fs/ext4/extents.c:2163: error: __ksymtab_ext4_ext_invalidate_cache causes a 
> section type conflict
> 
> This is because ext4_ext_find_goal and ext4_ext_invalidate_cache are
> declared static but also exported.

Hmm.. Why are these exported ?

Looking at the code


EXPORT_SYMBOL(ext4_ext_invalidate_cache);
EXPORT_SYMBOL(ext4_ext_insert_extent);
EXPORT_SYMBOL(ext4_ext_walk_space);
EXPORT_SYMBOL(ext4_ext_find_goal);
EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert);

Mingming ? Why are we exporting these ?

Thanks,
Badari


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: which kernel version should I submit patch ?

2007-05-02 Thread Badari Pulavarty

On Thu, 2007-05-03 at 02:13 +0800, coly wrote:
> Hi, list.
> 
> I find these days 2.6.21 released. If I will submit patches for ext4dev,
> should I generate the patch against 2.6.21 ?


If you find a bug in ext4 on 2.6.21 - yes, generate against 2.6.21.

Make sure its not already fixed in -mm or Ted's ext4 tree.

http://www2.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Badari Pulavarty

On Fri, 2007-03-02 at 09:16 -0600, Eric Sandeen wrote:
> Badari Pulavarty wrote:
> > 
> > Amit K. Arora wrote:
> > 
> >> This is to give a heads up on few patches that we will be soon coming up
> >> with. These patches implement a new system call sys_fallocate() and a
> >> new inode operation "fallocate", for persistent preallocation. The new
> >> system call, as Andrew suggested, will look like:
> >>
> >>  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> >>
> > I am wondering about return values from this syscall ? Is it supposed to 
> > return the
> > number of bytes allocated ? What about partial allocations ? 
> 
> If you don't have enough blocks to cover the request, you should 
> probably just return -ENOSPC, not a partial allocation.

That could be challenging, when multiple writers are working in
parallel. You may not be able to return -ENOSPC, till you fail the
allocation (for filesystems which alllocates a block at a time).

> 
> > What about 
> > if the
> > blocks already exists ? What would be return values in those cases ?
> 
> 0 on success, other normal errors oetherwise..
> 
> If asked for a range that includes already-allocated blocks, you just 
> allocate any non-allocated blocks in the range, I think.

Yes. What I was trying to figure out is, if there is a requirement that
interface need to return exact number of bytes it *really* allocated
(like write() or read()). I can't think of any, but just wanted to
through it out..

BTW, what is the interface for finding out what is the size of the
pre-allocated file ? 

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Heads up on sys_fallocate()

2007-03-01 Thread Badari Pulavarty



Amit K. Arora wrote:


This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation "fallocate", for persistent preallocation. The new
system call, as Andrew suggested, will look like:

 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

I am wondering about return values from this syscall ? Is it supposed to 
return the
number of bytes allocated ? What about partial allocations ? What about 
if the

blocks already exists ? What would be return values in those cases ?

Just curious .. What does posix_fallocate() return ?

Thanks,
Badari







-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 15/23] ext3 -nobh option causes oops

2006-11-16 Thread Badari Pulavarty


On Thu, 2006-11-16 at 23:51 +0100, Adrian Bunk wrote:
> It seems this patch that went into 2.6.17.8 should also be included in 
> the 2.6.16.x branch, or do I miss anything?
> 

Yes. This is needed for 2.6.16.x branch also.

Thanks,
Badari


> On Thu, Aug 03, 2006 at 10:40:03PM -0700, Greg KH wrote:
> > -stable review patch.  If anyone has any objections, please let us know.
> > 
> > ------
> > From: Badari Pulavarty <[EMAIL PROTECTED]>
> > 
> > For files other than IFREG, nobh option doesn't make sense.  Modifications
> > to them are journalled and needs buffer heads to do that.  Without this
> > patch, we get kernel oops in page_buffers().
> > 
> > Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
> > Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> > Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]>
> > 
> > ---
> >  fs/ext3/inode.c |6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > --- linux-2.6.17.7.orig/fs/ext3/inode.c
> > +++ linux-2.6.17.7/fs/ext3/inode.c
> > @@ -1159,7 +1159,7 @@ retry:
> > ret = PTR_ERR(handle);
> > goto out;
> > }
> > -   if (test_opt(inode->i_sb, NOBH))
> > +   if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode))
> > ret = nobh_prepare_write(page, from, to, ext3_get_block);
> > else
> > ret = block_prepare_write(page, from, to, ext3_get_block);
> > @@ -1245,7 +1245,7 @@ static int ext3_writeback_commit_write(s
> > if (new_i_size > EXT3_I(inode)->i_disksize)
> > EXT3_I(inode)->i_disksize = new_i_size;
> >  
> > -   if (test_opt(inode->i_sb, NOBH))
> > +   if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode))
> > ret = nobh_commit_write(file, page, from, to);
> > else
> > ret = generic_commit_write(file, page, from, to);
> > @@ -1495,7 +1495,7 @@ static int ext3_writeback_writepage(stru
> > goto out_fail;
> > }
> >  
> > -   if (test_opt(inode->i_sb, NOBH))
> > +   if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode))
> > ret = nobh_writepage(page, ext3_get_block, wbc);
> > else
> > ret = block_write_full_page(page, ext3_get_block, wbc);
> > 

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pagefault in generic_file_buffered_write() causing deadlock

2006-11-15 Thread Badari Pulavarty


Andrew Morton wrote:

On Wed, 15 Nov 2006 07:57:45 -0800
Badari Pulavarty <[EMAIL PROTECTED]> wrote:

  

We are looking at a customer situation (on 2.6.16-based distro) - where
system becomes almost useless while running some java & stress tests.

Root cause seems to be taking a pagefault in generic_file_buffered_write
() after calling prepare_write. I am wondering 


1) Why & How this can happen - since we made sure to fault the user
buffer before prepare write.



When using writev() we only fault in the first segment of the iovec.  If
the second or succesive segment isn't mapped into pagetables we're
vulnerable to the deadlock.
  


Yes. I remember this change. Thank you.
  

2) If this is already fixed in current mainline (I can't see how).



It was fixed in 2.6.17.

You'll need 6527c2bdf1f833cc18e8f42bd97973d583e4aa83 and
81b0c8713385ce1b1b9058e916edcf9561ad76d6
  

I will try to get this change into customer :(

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pagefault in generic_file_buffered_write() causing deadlock

2006-11-15 Thread Badari Pulavarty


Andrew Morton wrote:

On Wed, 15 Nov 2006 07:57:45 -0800
Badari Pulavarty <[EMAIL PROTECTED]> wrote:

  

We are looking at a customer situation (on 2.6.16-based distro) - where
system becomes almost useless while running some java & stress tests.

Root cause seems to be taking a pagefault in generic_file_buffered_write
() after calling prepare_write. I am wondering 


1) Why & How this can happen - since we made sure to fault the user
buffer before prepare write.



When using writev() we only fault in the first segment of the iovec.  If
the second or succesive segment isn't mapped into pagetables we're
vulnerable to the deadlock.

  

Hmm.. Not it :(
Its coming from write() not writev().

[C0002ABBF290] [C039D58C] .do_page_fault+0x2e4/0x75c
[C0002ABBF460] [C0004860] .handle_page_fault+0x20/0x54
--- Exception: 301 at .__copy_tofrom_user+0x11c/0x580
   LR = .generic_file_buffered_write+0x39c/0x7c8
[C0002ABBF750] [C0095A94]
.generic_file_buffered_write+0x2c0/0x7c8 (
unreliable)
[C0002ABBF8F0] [C00962EC]
.__generic_file_aio_write_nolock+0x350/0x3
e0
[C0002ABBFA20] [C0096908] .generic_file_aio_write+0x78/0x104
[C0002ABBFAE0] [C01649F0] .ext3_file_write+0x2c/0xd4
[C0002ABBFB70] [C00C5168] .do_sync_write+0xd4/0x130
[C0002ABBFCF0] [C00C5ED4] .vfs_write+0x128/0x20c
[C0002ABBFD90] [C00C664C] .sys_write+0x4c/0x8c
[C0002ABBFE30] [C000871C] syscall_exit+0x0/0x40

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pagefault in generic_file_buffered_write() causing deadlock

2006-11-15 Thread Badari Pulavarty

Hi Andrew & MM experts,

We are looking at a customer situation (on 2.6.16-based distro) - where
system becomes almost useless while running some java & stress tests.

Root cause seems to be taking a pagefault in generic_file_buffered_write
() after calling prepare_write. I am wondering 

1) Why & How this can happen - since we made sure to fault the user
buffer before prepare write.

2) If this is already fixed in current mainline (I can't see how).

Ideas on what I can do to fix it ?

Thanks,
Badari

Here is the analysis & stacks:
===

Java thread doing mmap() holding for mmap_sem and waiting for
transaction to be unlocked:

java  D 0fed3ff4  7104  2447   2391  2448  2446
(NOTLB)
Call Trace:
[C0002AC8F410] [C1315AC0] 0xc1315ac0 (unreliable)
[C0002AC8F5E0] [C000F0B4] .__switch_to+0x12c/0x150
[C0002AC8F670] [C039980C] .schedule+0xcec/0xe4c
[C0002AC8F780] [C017BC24] .start_this_handle+0x3b4/0x4ac
[C0002AC8F8A0] [C017BE08] .journal_start+0xec/0x140
[C0002AC8F940] [C0171374] .ext3_journal_start_sb+0x58/0x78
[C0002AC8F9C0] [C016AB90] .ext3_dirty_inode+0x38/0xb0
[C0002AC8FA50] [C00F6820] .__mark_inode_dirty+0x60/0x1d4
[C0002AC8FAF0] [C00E9F60] .touch_atime+0xc8/0xe0
[C0002AC8FB80] [C0093834] .generic_file_mmap+0x54/0x80
[C0002AC8FC00] [C00AC450] .do_mmap_pgoff+0x558/0x870
[C0002AC8FD10] [C000A9C0] .sys_mmap+0xdc/0x160
[C0002AC8FDC0] [C0014258] .compat_sys_mmap2+0x14/0x28
[C0002AC8FE30] [C000871C] syscall_exit+0x0/0x40


kjournald locked the transaction and waiting for journal stop
(t_updates to go to zero):

kjournald D   8704  2167  1  2203  2028
(L-TLB)
Call Trace:
[C0003514F980] [C05257D8] amd74xx_pci_tbl+0x8/0x200 (unreliable)
[C0003514FB50] [C000F0B4] .__switch_to+0x12c/0x150
[C0003514FBE0] [C039980C] .schedule+0xcec/0xe4c
[C0003514FCF0] [C017DA58] .journal_commit_transaction+0x190/0x1448
[C0003514FE50] [C0182F44] .kjournald+0xf0/0x27c
[C0003514FF90] [C0025630] .kernel_thread+0x4c/0x68

Another java thread, did journal_start() in prepare_write() and
took a pagefault while copying. Now this is waiting for mmap_sem
to finish the fault :(

java  D 0ffd76f0  6384  2452   2391  2453  2451
(NOTLB)
Call Trace:
[C0002ABBEE50] [C0002ABBEEE0] 0xc0002abbeee0 (unreliable)
[C0002ABBF020] [C000F0B4] .__switch_to+0x12c/0x150
[C0002ABBF0B0] [C039980C] .schedule+0xcec/0xe4c
[C0002ABBF1C0] [C039B688] .rwsem_down_read_failed
+0x284/0x2d0
[C0002ABBF290] [C039D58C] .do_page_fault+0x2e4/0x75c
[C0002ABBF460] [C0004860] .handle_page_fault+0x20/0x54
--- Exception: 301 at .__copy_tofrom_user+0x11c/0x580
LR = .generic_file_buffered_write+0x39c/0x7c8
[C0002ABBF750] [C0095A94]
.generic_file_buffered_write+0x2c0/0x7c8 (
unreliable)
[C0002ABBF8F0] [C00962EC]
.__generic_file_aio_write_nolock+0x350/0x3
e0
[C0002ABBFA20] [C0096908] .generic_file_aio_write+0x78/0x104
[C0002ABBFAE0] [C01649F0] .ext3_file_write+0x2c/0xd4
[C0002ABBFB70] [C00C5168] .do_sync_write+0xd4/0x130
[C0002ABBFCF0] [C00C5ED4] .vfs_write+0x128/0x20c
[C0002ABBFD90] [C00C664C] .sys_write+0x4c/0x8c
[C0002ABBFE30] [C000871C] syscall_exit+0x0/0x40


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.19-rc2-mm1 warning in invalidate_inode_pages2_range()

2006-10-19 Thread Badari Pulavarty

On Thu, 2006-10-19 at 16:02 -0700, Andrew Morton wrote:
> On Thu, 19 Oct 2006 15:39:06 -0700
> Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> 
> > Hi Zach,
> > 
> > While running IO tests I get following messages on 2.6.19-rc2-mm1
> > 
> > BUG: warning at mm/truncate.c:400/invalidate_inode_pages2_range()
> > 
> > Call Trace:
> >  [] show_trace+0x41/0x70
> >  [] dump_stack+0x12/0x20
> >  [] invalidate_inode_pages2_range+0x297/0x2e0
> >  [] generic_file_direct_IO+0xf5/0x130
> >  [] generic_file_direct_write+0x74/0x140
> >  [] __generic_file_aio_write_nolock+0x36c/0x4b0
> >  [] generic_file_aio_write+0x67/0xd0
> >  [] :ext4dev:ext4_file_write+0x23/0xc0
> > DWARF2 unwinder stuck at ext4_file_write+0x23/0xc0 [ext4dev]
> > Leftover inexact backtrace:
> >  [] do_sync_write+0xcf/0x120
> >  [] cp_new_stat+0xe7/0x100
> >  [] autoremove_wake_function+0x0/0x30
> >  [] __mutex_lock_slowpath+0x1df/0x1f0
> >  [] vfs_write+0xbd/0x180
> >  [] sys_write+0x53/0x90
> >  [] system_call+0x7e/0x83
> > 
> 
> that's warning that we weren't able to invalidate some pagecache.  That's
> not really suprising.  Perhaps we should be more careful in deciding when
> to fail the call (ie: leaving behind a clean page is ok, leaving behind a
> dirty page is bad).
> 
> Probably it's leaving behind dirty pagecache and you're about to lose your
> data.  Which I bet is your own darn fault for doing silly things.
> 
> What workload was in use?

Running fsx and fsstress.

- Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

2.6.19-rc2-mm1 warning in invalidate_inode_pages2_range()

2006-10-19 Thread Badari Pulavarty

Hi Zach,

While running IO tests I get following messages on 2.6.19-rc2-mm1

BUG: warning at mm/truncate.c:400/invalidate_inode_pages2_range()

Call Trace:
 [] show_trace+0x41/0x70
 [] dump_stack+0x12/0x20
 [] invalidate_inode_pages2_range+0x297/0x2e0
 [] generic_file_direct_IO+0xf5/0x130
 [] generic_file_direct_write+0x74/0x140
 [] __generic_file_aio_write_nolock+0x36c/0x4b0
 [] generic_file_aio_write+0x67/0xd0
 [] :ext4dev:ext4_file_write+0x23/0xc0
DWARF2 unwinder stuck at ext4_file_write+0x23/0xc0 [ext4dev]
Leftover inexact backtrace:
 [] do_sync_write+0xcf/0x120
 [] cp_new_stat+0xe7/0x100
 [] autoremove_wake_function+0x0/0x30
 [] __mutex_lock_slowpath+0x1df/0x1f0
 [] vfs_write+0xbd/0x180
 [] sys_write+0x53/0x90
 [] system_call+0x7e/0x83


FYI.

Thaks
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.19-rc1-mm1 (ext4 problem ?)

2006-10-11 Thread Badari Pulavarty

On Wed, 2006-10-11 at 09:56 -0700, Andrew Morton wrote:
> On Wed, 11 Oct 2006 08:02:14 -0700
> Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, 2006-10-10 at 21:01 -0700, Andrew Morton wrote: 
> > > 
> > > > Andrew Morton wrote:
> > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.19-rc1/2.6.19-rc1-mm1/
> > > > >
> > > > >
> > > > > - Added the ext4 filesystem.  Quick usage instructions:
> > > > >
> > > > >   - Grab updated e2fsprogs from
> > > > > 
> > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/
> > > > >   
> > 
> > ext4 did not survive my simple stress tests. 
> > 
> > I ran 4 copies of fsx on ext4dev filesystem (without extents) +
> > 4 copies of fsx on ext4dev (with extents). 
> > 
> > Machine hung after running for few hours. There are 4 fsx sigsegv
> > messages on the console and the last message on the console is
> > 
> > do_IRQ: 0.62 No irq handler for vector
> > 
> 
> Quite a few people are hitting that - it's related to Eric's recent
> IRQ/APIC-routing changes.
> 
> I don't know why that would cause fsx to get a sigsegv though.

I don't think they are related. Hopefully once we figure out IRQ
problem, we can try to track down fsx problems.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ext3 fsx failures on 2.6.19-rc1

2006-10-10 Thread Badari Pulavarty

On Tue, 2006-10-10 at 14:30 +0200, Jan Kara wrote:
>   Hi,
> 
> > I am having fsx failures on 2.6.19-rc1.
>   :(
> 
> > I don't have any useful information at this time to track it down.
> > I am running 4 copies of fsx (+ fsstress) on a 1k filesystem and
> > one copy of fsx dies.
>   How long does it take? 

Random. It happens any where from 1 hours to 6 hours.

> 
> > fsx-linux[20667]: segfault at  rip 2af0fe031690 rsp
> > 7fffacc03b88 error 4
> > 
> > READ BAD DATA: offset = 0xa352, size = 0x5fef
> > OFFSET  GOODBAD RANGE
> > 0x df90 0x48e4  0x  0x   70
> > operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
>   Hmm, so fsx-linux wrote something and read back zeros. Strange. Do you
> know what that 'segfault' message means? I cannot find it in my copy of
> fsx-linux...
> 
>   
fsx got a segmentation fault and died. These are the messages in the
dmesg.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ext3 fsx failures on 2.6.19-rc1

2006-10-09 Thread Badari Pulavarty

Hi Andrew,

I am having fsx failures on 2.6.19-rc1.

I don't have any useful information at this time to track it down.
I am running 4 copies of fsx (+ fsstress) on a 1k filesystem and
one copy of fsx dies.

FYI.

Thanks,
Badari


fsx-linux[20667]: segfault at  rip 2af0fe031690 rsp
7fffacc03b88 error 4

READ BAD DATA: offset = 0xa352, size = 0x5fef
OFFSET  GOODBAD RANGE
0x df90 0x48e4  0x  0x   70
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
LOG DUMP (71764 total operations):
71765(85 mod 256): TRUNCATE DOWNfrom 0x2464c to 0x362a
**
71766(86 mod 256): READ 0x1b16 thru 0x3629  (0x1b14 bytes)
71767(87 mod 256): MAPWRITE 0x3ca1b thru 0x3  (0x35e5 bytes)
71768(88 mod 256): MAPREAD  0x20f08 thru 0x2d401  (0xc4fa bytes)
71769(89 mod 256): MAPREAD  0x84d5 thru 0x16a99   (0xe5c5 bytes)
******
71770(90 mod 256): WRITE0x22dfb thru 0x26da3  (0x3fa9 bytes)
71771(91 mod 256): WRITE0x11326 thru 0x19589  (0x8264 bytes)
71772(92 mod 256): READ 0x8db3 thru 0xc813  (0x3a61 bytes)
71773(93 mod 256): WRITE0x33aff thru 0x3a725  (0x6c27 bytes)
71774(94 mod 256): TRUNCATE DOWNfrom 0x4 to 0x25e02
71775(95 mod 256): MAPWRITE 0x7f4 thru 0xffc0   (0xf7cd bytes)
**
71776(96 mod 256): WRITE0xcfa8 thru 0xdba9(0xc02 bytes)
71777(97 mod 256): READ 0x233e9 thru 0x25e01(0x2a19 bytes)
71778(98 mod 256): TRUNCATE DOWNfrom 0x25e02 to 0x1ea56
71779(99 mod 256): MAPWRITE 0x2b905 thru 0x3aaba  (0xf1b6 bytes)
71780(100 mod 256): READ0x32e45 thru 0x3aaba  (0x7c76 bytes)
71781(101 mod 256): READ0x229ae thru 0x31eaf  (0xf502 bytes)
71782(102 mod 256): TRUNCATE UP from 0x3aabb to 0x3b640
71783(103 mod 256): MAPREAD 0x2d2f4 thru 0x32a6d  (0x577a bytes)
71784(104 mod 256): MAPWRITE 0x38d96 thru 0x3c980 (0x3beb bytes)
71785(105 mod 256): WRITE   0x964d thru 0x9a7e(0x432 bytes)
71786(106 mod 256): TRUNCATE DOWN   from 0x3c981 to 0x2de7
**
71787(107 mod 256): WRITE   0x5134 thru 0x81e4(0x30b1 bytes)
HOLE
71788(108 mod 256): MAPWRITE 0xb02f thru 0x15d39  (0xad0b bytes)
**
71789(109 mod 256): MAPWRITE 0x22da2 thru 0x2c461 (0x96c0 bytes)
71790(110 mod 256): MAPREAD 0xe52e thru 0x10d99   (0x286c bytes)
71791(111 mod 256): MAPREAD 0x801e thru 0x14492   (0xc475 bytes)
******
71792(112 mod 256): MAPWRITE 0x222e5 thru 0x2760f (0x532b bytes)
71793(113 mod 256): MAPREAD 0xc9e3 thru 0xe947(0x1f65 bytes)
******
71794(114 mod 256): READ0x15618 thru 0x20a3e  (0xb427 bytes)
71795(115 mod 256): READ0x21fcf thru 0x27c2b  (0x5c5d bytes)
71796(116 mod 256): MAPREAD 0x4322 thru 0x13c95   (0xf974 bytes)
******
71797(117 mod 256): MAPREAD 0x28ce6 thru 0x2c461  (0x377c bytes)
71798(118 mod 256): MAPWRITE 0x6467 thru 0xeef2 (0x8a8c bytes)
**
71799(119 mod 256): READ0x140e4 thru 0x1a3f1  (0x630e bytes)
71800(120 mod 256): MAPWRITE 0x15fdf thru 0x2186b (0xb88d bytes)
71801(121 mod 256): WRITE   0x22d1b thru 0x23bfb  (0xee1 bytes)
71802(122 mod 256): WRITE   0x3ada8 thru 0x3  (0x5258 bytes)
HOLE
71803(123 mod 256): TRUNCATE DOWN   from 0x4 to 0x2276c
71804(124 mod 256): READ0x1d99c thru 0x2276b  (0x4dd0 bytes)
71805(125 mod 256): READ0x8b7 thru 0x1189 (0x8d3 bytes)
71806(126 mod 256): MAPWRITE 0xe5ed thru 0x117b0  (0x31c4 bytes)
71807(127 mod 256): MAPWRITE 0x17121 thru 0x1d177 (0x6057 bytes)
71808(128 mod 256): READ0x6033 thru 0x11909   (0xb8d7 bytes)
******
71809(129 mod 256): WRITE   0x3196d thru 0x3e669  (0xccfd bytes)
HOLE
71810(130 mod 256): READ0x217f thru 0x25ab(0x42d bytes)
71811(131 mod 256): MAPWRITE 0x39b78 thru 0x3a677 (0xb00 bytes)
71812(132 mod 256): TRUNCATE DOWN   from 0x3e66a to 0x330dc
71813(133 mod 256): TRUNCATE DOWN   from 0x330dc to 0x2ca2d
71814(134 mod 256): TRUNCATE DOWN   from 0x2ca2d to 0x184d1
71815(135 mod 256): TRUNCATE UP from 0x184d1 to 0x3f897
71816(136 mod 256): MAPREAD 0x1dd98 thru 0x2aaa8  (0xcd11 bytes)
71817(137 mod 256): MAPREAD 0x3ad15 thru 0x3f39f  (0x468b bytes)
71818(138 mod 256): WRITE   0x323a6 thru 0x3b549  (0x91a4 bytes)
71819(139 mod 256): TRUNCATE DOWN   from 0x3f897 to 0x200a7
71820(140 mod 256): TRUNCATE DOWN   from 0x200a7 to 0x169f7
71821(141 mod 256): MAPWRITE 0x23ac9 thru 0x26308 (0x2840 bytes)
71822(142 mod 256): WRITE   0xc67d thru 0xcee6(0x86a bytes)
71823(143 mod 256): READ0x1e85f thru 0x26012  (0x77b4 bytes)
71824(144 mod 256): WRITE   0x12f5a thru 0x1c19a  (0x9241 bytes)
71825(145 mod 256): MAPWRITE 0x2ee03 thru 0x30fdb (0x21d9 bytes)
71826(146 mod 256): READ0x2a481 thru 0x2f630  (0x51b0 bytes)
71827(147 mod 256): READ0x13e64 thru 0x1d7e6  (0x9983 bytes)
71828(148 mod 256): TRUNCATE DOWN   from 0x30fdc to 0x7c66
**
71829(149 mod 256): WRI

Re: [patch 003/152] jbd: fix commit of ordered data buffers

2006-09-29 Thread Badari Pulavarty

On Fri, 2006-09-29 at 11:02 +0200, Jan Kara wrote:
...
> > >+  }
> > >+  /* Someone already cleaned up the buffer? */
> > >+  if (!buffer_jbd(bh)
> > >+  || jh->b_transaction != commit_transaction
> > >+  || jh->b_jlist != BJ_SyncData) {
> > >+  jbd_unlock_bh_state(bh);
> > >+  if (locked)
> > >+  unlock_buffer(bh);
> > >+  BUFFER_TRACE(bh, "already cleaned up");
> > >+  put_bh(bh);
> > >+  continue;
>---> Here the buffer was refiled by someone else

I am little concerned about this particular code. We know that
some one else will do the unfile/remove - but we will keep
spinning on it till that happens. Isn't it ? Why don't we
skip it and move to next one ?

We are seeing few following message while running tests and
wondering if your patch is causing it ..

BUG: spinlock lockup on CPU#1, scp/30189, cfb503d8 (Not tainted)
Call Trace:
[C00018FDB320] [C00102E0] .show_stack+0x68/0x1b0 (unreliable)
[C00018FDB3C0] [C01734F4] ._raw_spin_lock+0x138/0x184
[C00018FDB460] [C025AD24] ._spin_lock+0x10/0x24
[C00018FDB4E0] [D0172E14] .journal_dirty_data+0xa4/0x2c0 [jbd]
[C00018FDB580] [D0205BAC] .ext3_journal_dirty_data+0x28/0x70 [ext3]
[C00018FDB610] [D02048BC] .walk_page_buffers+0xb0/0x134 [ext3]
[C00018FDB6D0] [D0208280] .ext3_ordered_commit_write+0x74/0x114 
[ext
3]
[C00018FDB780] [C007E710] .generic_file_buffered_write+0x51c/0x794
[C00018FDB930] 
[C007ECB8] .__generic_file_aio_write_nolock+0x330/0x3
bc
[C00018FDBA20] [C007EDBC] .generic_file_aio_write+0x78/0x104
[C00018FDBAE0] [D0202EFC] .ext3_file_write+0x2c/0xd4 [ext3]
[C00018FDBB70] [C00A8090] .do_sync_write+0xd4/0x130
[C00018FDBCF0] [C00A8E14] .vfs_write+0x118/0x200
[C00018FDBD90] [C00A9584] .sys_write+0x4c/0x8c
[C00018FDBE30] [C0008434] syscall_exit+0x0/0x40

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Fwd: Re: [patch 003/152] jbd: fix commit of ordered data buffers]

2006-09-28 Thread Badari Pulavarty


Forgot to CC: ext4 & lkml earlier..

Thanks,
Badari
--- Begin Message ---

[EMAIL PROTECTED] wrote:

From: Jan Kara <[EMAIL PROTECTED]>


[suitable for 2.6.18.x around the 2.6.19-rc2 timeframe]

Signed-off-by: Jan Kara <[EMAIL PROTECTED]>
Cc: Badari Pulavarty <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 fs/jbd/commit.c |  182 --
 1 file changed, 113 insertions(+), 69 deletions(-)

diff -puN fs/jbd/commit.c~jbd-fix-commit-of-ordered-data-buffers fs/jbd/commit.c
--- a/fs/jbd/commit.c~jbd-fix-commit-of-ordered-data-buffers
+++ a/fs/jbd/commit.c
@@ -160,6 +160,117 @@ static int journal_write_commit_record(j
return (ret == -EIO);
 }
 
+static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)

+{
+   int i;
+
+   for (i = 0; i < bufs; i++) {
+   wbuf[i]->b_end_io = end_buffer_write_sync;
+   /* We use-up our safety reference in submit_bh() */
+   submit_bh(WRITE, wbuf[i]);
+   }
+}
+
+/*
+ *  Submit all the data buffers to disk
+ */
+static void journal_submit_data_buffers(journal_t *journal,
+   transaction_t *commit_transaction)
+{
+   struct journal_head *jh;
+   struct buffer_head *bh;
+   int locked;
+   int bufs = 0;
+   struct buffer_head **wbuf = journal->j_wbuf;
+
+   /*
+* Whenever we unlock the journal and sleep, things can get added
+* onto ->t_sync_datalist, so we have to keep looping back to
+* write_out_data until we *know* that the list is empty.
+*
+* Cleanup any flushed data buffers from the data list.  Even in
+* abort mode, we want to flush this out as soon as possible.
+*/
+write_out_data:
+   cond_resched();
+   spin_lock(&journal->j_list_lock);
+
+   while (commit_transaction->t_sync_datalist) {
+   jh = commit_transaction->t_sync_datalist;
  
Don't we need  


commit_transaction->t_sync_datalist = jh->b_tnext;

here to skip to the next "jh" for next time ?

Thanks,
Badari


+   bh = jh2bh(jh);
+   locked = 0;
+
+   /* Get reference just to make sure buffer does not disappear
+* when we are forced to drop various locks */
+   get_bh(bh);
+   /* If the buffer is dirty, we need to submit IO and hence
+* we need the buffer lock. We try to lock the buffer without
+* blocking. If we fail, we need to drop j_list_lock and do
+* blocking lock_buffer().
+*/
+   if (buffer_dirty(bh)) {
+   if (test_set_buffer_locked(bh)) {
+   BUFFER_TRACE(bh, "needs blocking lock");
+   spin_unlock(&journal->j_list_lock);
+   /* Write out all data to prevent deadlocks */
+   journal_do_submit_data(wbuf, bufs);
+   bufs = 0;
+   lock_buffer(bh);
+   spin_lock(&journal->j_list_lock);
+   }
+   locked = 1;
+   }
+   /* We have to get bh_state lock. Again out of order, sigh. */
+   if (!inverted_lock(journal, bh)) {
+   jbd_lock_bh_state(bh);
+   spin_lock(&journal->j_list_lock);
+   }
+   /* Someone already cleaned up the buffer? */
+   if (!buffer_jbd(bh)
+   || jh->b_transaction != commit_transaction
+   || jh->b_jlist != BJ_SyncData) {
+   jbd_unlock_bh_state(bh);
+   if (locked)
+   unlock_buffer(bh);
+   BUFFER_TRACE(bh, "already cleaned up");
+   put_bh(bh);
+   continue;
+   }
+   if (locked && test_clear_buffer_dirty(bh)) {
+   BUFFER_TRACE(bh, "needs writeout, adding to array");
+   wbuf[bufs++] = bh;
+   __journal_file_buffer(jh, commit_transaction,
+   BJ_Locked);
+   jbd_unlock_bh_state(bh);
+   if (bufs == journal->j_wbufsize) {
+   spin_unlock(&journal->j_list_lock);
+   journal_do_submit_data(wbuf, bufs);
+   bufs = 0;
+   goto write_out_data;
+   }
+   }
+   else {
+   BUFFER_TRACE(bh, "write

[PATCH] ext3 sequential read regression fix

2006-09-15 Thread Badari Pulavarty

ext3-get-blocks support caused ~20% degrade in Sequential read
performance (tiobench). Problem is with marking the buffer boundary
so IO can be submitted right away. Here is the patch to fix it.

2.6.18-rc6:
---
# ./iotest
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 75.2726 seconds, 57.1 MB/s

real1m15.285s
user0m0.276s
sys 0m3.884s


2.6.18-rc6 + fix:
-
[EMAIL PROTECTED] ~]# ./iotest
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 62.9356 seconds, 68.2 MB/s


The boundary block check in ext3_get_blocks_handle needs to be adjusted
against the count of blocks mapped in this call, now that it can map
more than one block.

Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]>
Tested-by: Badari Pulavarty <[EMAIL PROTECTED]>

 linux-2.6.18-rc5-suparna/fs/ext3/inode.c |2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

diff -puN fs/ext3/inode.c~ext3-multiblock-boundary-fix fs/ext3/inode.c
--- linux-2.6.18-rc5/fs/ext3/inode.c~ext3-multiblock-boundary-fix   
2006-09-15 10:53:12.0 +0530
+++ linux-2.6.18-rc5-suparna/fs/ext3/inode.c2006-09-15 10:54:30.0 
+0530
@@ -925,7 +925,7 @@ int ext3_get_blocks_handle(handle_t *han
set_buffer_new(bh_result);
 got_it:
map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
-   if (blocks_to_boundary == 0)
+   if (count > blocks_to_boundary)
set_buffer_boundary(bh_result);
err = count;
/* Clean up and exit */



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ext3 sequential read performance (~20%) degrade

2006-09-15 Thread Badari Pulavarty

On Fri, 2006-09-15 at 11:20 +0530, Suparna Bhattacharya wrote:
> On Thu, Sep 14, 2006 at 05:03:08PM -0700, Andrew Morton wrote:
> > On Thu, 14 Sep 2006 16:36:12 -0700
> > Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> > 
> > > Hi Andrew,
> > > 
> > > I have been working on tracking down ~20% performance degrade for
> > > sequential read performance for ext3. 
> > 
> > oop.  I'd kinda prefer that we discover things like this before the patch
> > gets into mainline.
> > 
> > > Finally narrowed it down to get_blocks() support. If I force
> > > ext3_get_blocks_handle() to always return 1 block - I get better
> > > IO rate. I did all the usual stuff, tracked down requests, traced
> > > blocksizes, looked at readahead code, looked at mpage_readpages()
> > > etc.. I still can't figure out how to explain the degrade..
> > > 
> > > Any suggestions on how to track it down.
> > 
> > Learn to driver Jens's blktrace stuff, find out why the IO scheduling went
> > bad.
> > 
> > Number one suspicion: the buffer_boundary() stuff isn't working.
> 
> I think you are right about that  - perhaps something along
> the lines of the following patch (untested) would help ?

Yep. It works :)

Thanks,
Badari



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ext3 sequential read performance (~20%) degrade

2006-09-14 Thread Badari Pulavarty

Hi Andrew,

I have been working on tracking down ~20% performance degrade for
sequential read performance for ext3. 

Finally narrowed it down to get_blocks() support. If I force
ext3_get_blocks_handle() to always return 1 block - I get better
IO rate. I did all the usual stuff, tracked down requests, traced
blocksizes, looked at readahead code, looked at mpage_readpages()
etc.. I still can't figure out how to explain the degrade..

Any suggestions on how to track it down.

Thanks,
Badari

# cat iotest
mount /dev/sdb2 /mnt/tmp
time dd if=/mnt/tmp/testfile of=/dev/null bs=4k count=1048576
umount /mnt/tmp

2.6.18-rc6: (multiblock):

# ./iotest
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 75.2654 seconds, 57.1 MB/s

real1m15.282s
user0m0.248s
sys 0m4.292s

2.6.18-rc6 (force single block in ext3_get_blocks_handle():

# ./iotest
./iotest
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 62.9472 seconds, 68.2 MB/s

real1m2.976s
user0m0.268s
sys 0m4.280s


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-11 Thread Badari Pulavarty

On Mon, 2006-09-11 at 11:46 +0200, Jan Kara wrote:
...
> > 
> > I don't have any performance tests handy. We have some automated tests I 
> > can schedule to run to verify the stability aspects.
>   OK. I've run IOZONE rewrite throughput test on my computer with
> iozone -t 10 -i 0 -s 10M -e
>   2.6.18-rc6 and the same kernel + my patch seem to give almost the same
> results. The strange thing was that both in vanilla and patched kernel there
> were several runs where a write througput (when iozone was creating the file)
> was suddenly 10% of the usual value (18MB/s vs. 2MB/s). The rewrite numbers
> were always fine. Maybe that has something to do with block allocation
> code. Anyway, it is not a regression of my patch so unless your test
> finds some problem I think the patch should be ready for inclusion into
> -mm...

Your patch seems to be working fine. I haven't found any major
regression yet. 

I spent lot of time trying to reproduce the problem with buffer-debug
Andrew sent out - I really wanted to get to bottom of whats really
happening here (since your patch made it go away).

Yes. Your theory is correct. journal_dirty_data() is moving the
buffer-head from commited transaction to current one and
journal_unmap_buffer() is discarding and cleaning up the buffer-head.
Later set_page_dirty() dirties the buffer-head there by causing
BUG() in submit_bh().

Here is the buffer-trace-debug output to confirm it. I can sleep better
now :) Now we can figure out, if your fix is the right one or not ..

Thanks,
Badari


buffer trace for buffer at 0x8101d8540678 (I am CPU 3)
 journal_dirty_data():[fs/jbd/transaction.c:939] entry
 b_state:0x402d b_jlist:BJ_SyncData cpu:3 b_count:15 b_blocknr:45037
 b_jbd:1 b_frozen_data: b_committed_data:
 b_transaction:1 b_next_transaction:0 b_cp_transaction:0 
b_trans_is_running:0
 b_trans_is_comitting:1 b_jcount:1 pg_dirty:1
 journal_dirty_data():[fs/jbd/transaction.c:971] has transaction
 b_state:0x10402d b_jlist:BJ_SyncData cpu:3 b_count:15 b_blocknr:45037
 b_jbd:1 b_frozen_data: b_committed_data:
 b_transaction:1 b_next_transaction:0 b_cp_transaction:0 
b_trans_is_running:0
 b_trans_is_comitting:1 b_jcount:1 pg_dirty:1
 journal_dirty_data():[fs/jbd/transaction.c:973] belongs to older transaction
 b_state:0x10402d b_jlist:BJ_SyncData cpu:3 b_count:15 b_blocknr:45037
 b_jbd:1 b_frozen_data: b_committed_data:
 b_transaction:1 b_next_transaction:0 b_cp_transaction:0 
b_trans_is_running:0
 b_trans_is_comitting:1 b_jcount:1 pg_dirty:1
 journal_dirty_data():[fs/jbd/transaction.c:1037] unfile from commit
 b_state:0x10402d b_jlist:BJ_SyncData cpu:3 b_count:15 b_blocknr:45037
 b_jbd:1 b_frozen_data: b_committed_data:
 b_transaction:1 b_next_transaction:0 b_cp_transaction:0 
b_trans_is_running:0
 b_trans_is_comitting:1 b_jcount:1 pg_dirty:1
 __journal_temp_unlink_buffer():[fs/jbd/transaction.c:1501] entry
 b_state:0x10402d b_jlist:BJ_SyncData cpu:3 b_count:15 b_blocknr:45037
 b_jbd:1 b_frozen_data: b_committed_data:
 b_transaction:1 b_next_transaction:0 b_cp_transaction:0 
b_trans_is_running:0
 b_trans_is_comitting:1 b_jcount:1 pg_dirty:1
 journal_dirty_data():[fs/jbd/transaction.c:1055] not on correct data list: 
unfile
 b_state:0x10402d b_jlist:BJ_None cpu:3 b_count:15 b_blocknr:45037
 b_jbd:1 b_frozen_data: b_committed_data:
 b_transaction:1 b_next_transaction:0 b_cp_transaction:0 
b_trans_is_running:1
 b_trans_is_comitting:0 b_jcount:1 pg_dirty:1
 __journal_temp_unlink_buffer():[fs/jbd/transaction.c:1501] entry
 b_state:0x10402d b_jlist:BJ_None cpu:3 b_count:15 b_blocknr:45037
 b_jbd:1 b_frozen_data: b_committed_data:
 b_transaction:1 b_next_transaction:0 b_cp_transaction:0 
b_trans_is_running:1
 b_trans_is_comitting:0 b_jcount:1 pg_dirty:1
 journal_dirty_data():[fs/jbd/transaction.c:1059] file as data
 b_state:0x10402d b_jlist:BJ_None cpu:3 b_count:15 b_blocknr:45037
 b_jbd:1 b_frozen_data: b_committed_data:
 b_transaction:1 b_next_transaction:0 b_cp_transaction:0 
b_trans_is_running:1
 b_trans_is_comitting:0 b_jcount:1 pg_dirty:1
 __journal_file_buffer():[fs/jbd/transaction.c:1948] entry
 b_state:0x10402d b_jlist:BJ_None cpu:3 b_count:15 b_blocknr:45037
 b_jbd:1 b_frozen_data: b_committed_data:
 b_transaction:1 b_next_transaction:0 b_cp_transaction:0 
b_trans_is_running:1
 b_trans_is_comitting:0 b_jcount:1 pg_dirty:1
 __journal_temp_unlink_buffer():[fs/jbd/transaction.c:1501] entry
 b_state:0x10402d b_jlist:BJ_None cpu:3 b_count:15 b_blocknr:45037
 b_jbd:1 b_frozen_data: b_committed_data:
 b_transaction:1 b_n

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-08 Thread Badari Pulavarty


Jan Kara wrote:

  Hi,

  

Jan Kara wrote:


 I've been looking more at the code and I have revived my patch fixing
this part of the code. I've mildly tested the patch. Could you also give
it a try? Thanks.

Honza
 



Original commit code assumes, that when a buffer on BJ_SyncData list is 
locked,
it is being written to disk. But this is not true and hence it can lead to 
a

potential data loss on crash. Also the code didn't count with the fact that
journal_dirty_data() can steal buffers from committing transaction and 
hence

could write buffers that no longer belong to the committing transaction.
Finally it could possibly happen that we tried writing out one buffer 
several

times.

The patch below tries to solve these problems by a complete rewrite of the 
data

commit code. We go through buffers on t_sync_datalist, lock buffers needing
write out and store them in an array. Buffers are also immediately refiled 
to
BJ_Locked list or unfiled (if the write out is completed). When the array 
is
full or we have to block on buffer lock, we submit all accumulated buffers 
for

IO.

Signed-off-by: Jan Kara <[EMAIL PROTECTED]>

 
  
I have been running 4+ hours with this patch and seems to work fine. I 
haven't hit any

assert yet :)

I will let it run till tomorrow. I will let you know, how it goes.


  Great, thanks. BTW: Do you have any performance tests handy? The
changes are big enough to cause some unexpected performance regressions,
livelocks... If you don't have anything ready, I can setup and run
something myself.  Just that I don't like this testing too much ;).
  

Tests are still running fine.

I don't have any performance tests handy. We have some automated tests I 
can schedule

to run to verify the stability aspects.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-07 Thread Badari Pulavarty


Jan Kara wrote:

  Ugh! Are you sure? For this path the buffer must be attached (only) to
the running transaction. But then how the commit code comes to it?
Somebody would have to even manage to refile the buffer from the
committing transaction to the running one while the buffer is in wbuf[].
Could you check whether someone does __journal_refile_buffer() on your
marked buffers, please? Or whether we move buffer to BJ_Locked list in
the write_out_data: loop? Thanks.


  

I added more debug in __journal_refile_buffer() to see if the marked
buffers are getting refiled. I am able to reproduce the problem,
but I don't see any debug including my original prints. (It looks as 
if none of my debug code exists) - its really confusing. 


I will keep looking and get back to you.


  I've been looking more at the code and I have revived my patch fixing
this part of the code. I've mildly tested the patch. Could you also give
it a try? Thanks.

Honza
  



Original commit code assumes, that when a buffer on BJ_SyncData list is locked,
it is being written to disk. But this is not true and hence it can lead to a
potential data loss on crash. Also the code didn't count with the fact that
journal_dirty_data() can steal buffers from committing transaction and hence
could write buffers that no longer belong to the committing transaction.
Finally it could possibly happen that we tried writing out one buffer several
times.

The patch below tries to solve these problems by a complete rewrite of the data
commit code. We go through buffers on t_sync_datalist, lock buffers needing
write out and store them in an array. Buffers are also immediately refiled to
BJ_Locked list or unfiled (if the write out is completed). When the array is
full or we have to block on buffer lock, we submit all accumulated buffers for
IO.

Signed-off-by: Jan Kara <[EMAIL PROTECTED]>

  
I have been running 4+ hours with this patch and seems to work fine. I 
haven't hit any

assert yet :)

I will let it run till tomorrow. I will let you know, how it goes.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ext3_getblk should handle HOLE correctly

2006-09-07 Thread Badari Pulavarty

On Thu, 2006-09-07 at 11:45 -0700, Andrew Morton wrote:
> On Wed, 06 Sep 2006 10:39:06 -0700
> Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> 
> > Hi Andrew,
> > 
> > Its been reported that ext3_getblk() is not doing the right thing
> > and triggering following WARN():
> > 
> > BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()
> >   ext3_getblk+0x98/0x2a6   md_wakeup_thread
> > +0x26/0x2a
> >   ext3_bread+0x1f/0x88   ext3_quota_read+0x136/0x1ae
> >   v1_read_dqblk+0x61/0xac   dquot_acquire+0xf6/0x107
> >   ext3_acquire_dquot+0x46/0x68   dqget+0x155/0x1e7
> >   dquot_transfer+0x3e0/0x3e9   dput+0x23/0x13e
> >   ext3_setattr+0xc3/0x240   current_fs_time
> > +0x52/0x6a
> >   notify_change+0x2bd/0x30d   chown_common+0x9c/0xc5
> >   strncpy_from_user+0x3b/0x68   do_path_lookup
> > +0xdf/0x266
> >   __user_walk_fd+0x44/0x5a   sys_chown+0x4a/0x55
> >   vfs_write+0xe7/0x13c   sys_mkdir+0x1f/0x23
> >   syscall_call+0x7/0xb 
> > 
> > Looking at the code, it looks like its not handle HOLE correctly.
> > It ends up returning -EIO.
> 
> Strange.  The fs should be spewing these warnings all over the place.  For
> some reason this code is hard to trigger.  Why??

I guess - ext3_getblk() mostly used by ext3_bread() and most callers 
to it would be reading already allocated block.

> 
> > -   if (err == 1) {
> > +   /*
> > +* ext3_get_blocks_handle() returns number of blocks
> > +* mapped. 0 in case of a HOLE.
> > +*/
> > +   if (err > 0) {
> > err = 0;
> > -   } else if (err >= 0) {
> > -   WARN_ON(1);
> > -   err = -EIO;
> > }
> 
> That removes the warning if ext3_get_blocks_handle() returned a positive
> number greater than one.  And it looks like we still need debugging support
> in this area.

I am not sure why we need it ? All we care about is one block. If
ext3_get_blocks_handle() returns more than one (which it shouldn't) -
it still be okay. Whats wrong with that ? Just curious ..

May be we should add a WARN() in ext3_get_blocks_handle() when it
returns more than asked for.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-07 Thread Badari Pulavarty


>   Ugh! Are you sure? For this path the buffer must be attached (only) to
> the running transaction. But then how the commit code comes to it?
> Somebody would have to even manage to refile the buffer from the
> committing transaction to the running one while the buffer is in wbuf[].
> Could you check whether someone does __journal_refile_buffer() on your
> marked buffers, please? Or whether we move buffer to BJ_Locked list in
> the write_out_data: loop? Thanks.
> 
>   

I added more debug in __journal_refile_buffer() to see if the marked
buffers are getting refiled. I am able to reproduce the problem,
but I don't see any debug including my original prints. (It looks as 
if none of my debug code exists) - its really confusing. 

I will keep looking and get back to you.

I may try Andrew's buffer debug patch - if I get desperate.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-06 Thread Badari Pulavarty


Andrew Morton wrote:

On Wed, 6 Sep 2006 19:27:33 +0200
Jan Kara <[EMAIL PROTECTED]> wrote:

  

  Ugh! Are you sure? For this path the buffer must be attached (only) to
the running transaction. But then how the commit code comes to it?
Somebody would have to even manage to refile the buffer from the
committing transaction to the running one while the buffer is in wbuf[].
Could you check whether someone does __journal_refile_buffer() on your
marked buffers, please? Or whether we move buffer to BJ_Locked list in
the write_out_data: loop? Thanks.



The ext3-debug patch will help here.  It records, within the bh, the inputs
from the last 32 BUFFER_TRACE()s which were run against this bh.  If a
J_ASSERT fails then you get a nice trace of the last 32 "things" which
happened to this bh, including the bh's state at that transition.  It
basically tells you everything you need to know to find the bug.

It's worth spending the time to become familiar with it - I used it a lot in
early ext3 development.
  


I will try the patch. Unfortunately, adding more debug is causing the 
problem reproduction

difficult.

I've been unable to reproduce this crash, btw.  Is there some magic
incantation apat from running `fsx-linux'?
  
All I do is on a single 1k filesystem, run 4 copies of fsx (on 4 
different files, ofcourse).

I hit the assert anywhere between 10min-2hours.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ext3_getblk should handle HOLE correctly

2006-09-06 Thread Badari Pulavarty

Hi Andrew,

Its been reported that ext3_getblk() is not doing the right thing
and triggering following WARN():

BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()
  ext3_getblk+0x98/0x2a6   md_wakeup_thread
+0x26/0x2a
  ext3_bread+0x1f/0x88   ext3_quota_read+0x136/0x1ae
  v1_read_dqblk+0x61/0xac   dquot_acquire+0xf6/0x107
  ext3_acquire_dquot+0x46/0x68   dqget+0x155/0x1e7
  dquot_transfer+0x3e0/0x3e9   dput+0x23/0x13e
  ext3_setattr+0xc3/0x240   current_fs_time
+0x52/0x6a
  notify_change+0x2bd/0x30d   chown_common+0x9c/0xc5
  strncpy_from_user+0x3b/0x68   do_path_lookup
+0xdf/0x266
  __user_walk_fd+0x44/0x5a   sys_chown+0x4a/0x55
  vfs_write+0xe7/0x13c   sys_mkdir+0x1f/0x23
  syscall_call+0x7/0xb 

Looking at the code, it looks like its not handle HOLE correctly.
It ends up returning -EIO. Here is the patch to fix it.

If we really want to be paranoid, we can allow return values
0 (HOLE), 1 (we asked for one block) and return -EIO for
more than 1 block. But I really don't see a reason for
doing it - all we need is the block# here. (doesn't matter
how many blocks are mapped).

Thanks,
Badari

ext3_get_blocks_handle() returns number of blocks it mapped.
It returns 0 in case of HOLE. ext3_getblk() should handle
HOLE properly (currently its dumping warning stack and
returning -EIO).

Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
Acked-by: Mingming Cao <[EMAIL PROTECTED]>
---
 fs/ext3/inode.c |9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

Index: linux-2.6.18-rc5/fs/ext3/inode.c
===
--- linux-2.6.18-rc5.orig/fs/ext3/inode.c   2006-08-27 20:41:48.0 
-0700
+++ linux-2.6.18-rc5/fs/ext3/inode.c2006-09-05 15:32:57.0 -0700
@@ -1009,11 +1009,12 @@ struct buffer_head *ext3_getblk(handle_t
buffer_trace_init(&dummy.b_history);
err = ext3_get_blocks_handle(handle, inode, block, 1,
&dummy, create, 1);
-   if (err == 1) {
+   /*
+* ext3_get_blocks_handle() returns number of blocks
+* mapped. 0 in case of a HOLE.
+*/
+   if (err > 0) {
err = 0;
-   } else if (err >= 0) {
-   WARN_ON(1);
-   err = -EIO;
}
*errp = err;
if (!err && buffer_mapped(&dummy)) {


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-06 Thread Badari Pulavarty

On Wed, 2006-09-06 at 18:27 +0200, Jan Kara wrote:
> > On Wed, 2006-09-06 at 17:34 +0200, Jan Kara wrote:
> > > > On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote:
> > > > 
> > > > > > Andrew, what should we do ? Do you suggest handling this in jbd
> > > > > > itself (like this patch) ?
> > > > >   Actually that part of commit code needs rewrite anyway (and after 
> > > > > that
> > > > > rewrite you get rid of ll_rw_block()) because of other problems - the
> > > > > code assumes that whenever buffer is locked, it is being written to 
> > > > > disk
> > > > > which is not true... I have some preliminary patches for that but they
> > > > > are not very nice and so far I didn't have enough time to find a nice
> > > > > solution.
> > > > 
> > > > Are you okay with current not-so-elegant fix ? 
> > >   Actually I don't quite understand how it can happen what you describe
> > > (so probably I missed something). How it can happen that some buffers
> > > are unmapped while we are committing them?  journal_unmap_buffers()
> > > checks whether we are not committing truncated buffers and if so, it
> > > does not do anything to such buffers...
> > >   Bye
> > >   Honza
> > 
> > Yep. I spent lot of time trying to understand - why they are not
> > getting skipped :(
> > 
> > But my debug clearly shows that we are clearing the buffer, while
> > we haven't actually submitted to ll_rw_block() code. (I added "track"
> > flag to bh and set it in journal_commit_transaction() when we add
> > them to wbuf[] and clear it in ll_rw_block() after submit. I checked
> > for this flag in journal_unmap_buffer() while clearing the buffer).
> > Here is what my debug shows:
> > 
> > buffer is tracked bh 8101686ea850 size 1024 
> > 
> > Call Trace:
> >  [] show_trace+0xb5/0x370
> >  [] dump_stack+0x15/0x20
> >  [] journal_invalidatepage+0x314/0x3b0
>   I see just journal_invalidatepage() here. That is fine. It calls
> journal_unmap_buffer() which should do nothing return 0. If it does
> not it would be IMO bug.. If the buffer is really unmapped here, in what
> state it is (i.e. which list is it on?).

Okay.. here is the path its taking according to my debug ..
Remember, the issue is: after the buffer is cleaned - they are
still (left) attached to the page (since a page can have 4
buffer heads and we partially truncated the page). After
we clean up the buffers any subsequent call to set_page_dirty()
would end up marking *all* the buffers dirty. If ll_rw_block()
happens after this, we will run into the assert. If no 
set_page_dirty() happens before ll_rw_block() happens, things
would be fine - as the buffer won't be dirty and be skipped.


journal_unmap_buffer() 
{
.
} else {
/* Good, the buffer belongs to the running transaction.
 * We are writing our own transaction's data, not any
 * previous one's, so it is safe to throw it away
 * (remember that we expect the filesystem to have set
 * i_size already for this truncate so recovery will not
 * expose the disk blocks we are discarding here.) */
J_ASSERT_JH(jh, transaction == journal-
>j_running_transaction);
may_free = __dispose_buffer(jh, transaction);
}
zap_buffer:
journal_put_journal_head(jh);
zap_buffer_no_jh:
spin_unlock(&journal->j_list_lock);
jbd_unlock_bh_state(bh);
spin_unlock(&journal->j_state_lock);
zap_buffer_unlocked:
clear_buffer_dirty(bh);
J_ASSERT_BH(bh, !buffer_jbddirty(bh));
clear_buffer_mapped(bh);
clear_buffer_req(bh);
clear_buffer_new(bh);
bh->b_bdev = NULL;
return may_free;
}



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-06 Thread Badari Pulavarty

On Wed, 2006-09-06 at 18:27 +0200, Jan Kara wrote:
> > On Wed, 2006-09-06 at 17:34 +0200, Jan Kara wrote:
> > > > On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote:
> > > > 
> > > > > > Andrew, what should we do ? Do you suggest handling this in jbd
> > > > > > itself (like this patch) ?
> > > > >   Actually that part of commit code needs rewrite anyway (and after 
> > > > > that
> > > > > rewrite you get rid of ll_rw_block()) because of other problems - the
> > > > > code assumes that whenever buffer is locked, it is being written to 
> > > > > disk
> > > > > which is not true... I have some preliminary patches for that but they
> > > > > are not very nice and so far I didn't have enough time to find a nice
> > > > > solution.
> > > > 
> > > > Are you okay with current not-so-elegant fix ? 
> > >   Actually I don't quite understand how it can happen what you describe
> > > (so probably I missed something). How it can happen that some buffers
> > > are unmapped while we are committing them?  journal_unmap_buffers()
> > > checks whether we are not committing truncated buffers and if so, it
> > > does not do anything to such buffers...
> > >   Bye
> > >   Honza
> > 
> > Yep. I spent lot of time trying to understand - why they are not
> > getting skipped :(
> > 
> > But my debug clearly shows that we are clearing the buffer, while
> > we haven't actually submitted to ll_rw_block() code. (I added "track"
> > flag to bh and set it in journal_commit_transaction() when we add
> > them to wbuf[] and clear it in ll_rw_block() after submit. I checked
> > for this flag in journal_unmap_buffer() while clearing the buffer).
> > Here is what my debug shows:
> > 
> > buffer is tracked bh 8101686ea850 size 1024 
> > 
> > Call Trace:
> >  [] show_trace+0xb5/0x370
> >  [] dump_stack+0x15/0x20
> >  [] journal_invalidatepage+0x314/0x3b0
>   I see just journal_invalidatepage() here. That is fine. It calls
> journal_unmap_buffer() which should do nothing return 0. If it does
> not it would be IMO bug.. If the buffer is really unmapped here, in what
> state it is (i.e. which list is it on?).
> 
Acutally, I added dump_stack() in journal_unmap_buffer() when it
does clear_buffer_mapped(). gcc must of pulled in the function ..
I will add more debug to track the list bh came from.

- Badari


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-06 Thread Badari Pulavarty

On Wed, 2006-09-06 at 17:34 +0200, Jan Kara wrote:
> > On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote:
> > 
> > > > Andrew, what should we do ? Do you suggest handling this in jbd
> > > > itself (like this patch) ?
> > >   Actually that part of commit code needs rewrite anyway (and after that
> > > rewrite you get rid of ll_rw_block()) because of other problems - the
> > > code assumes that whenever buffer is locked, it is being written to disk
> > > which is not true... I have some preliminary patches for that but they
> > > are not very nice and so far I didn't have enough time to find a nice
> > > solution.
> > 
> > Are you okay with current not-so-elegant fix ? 
>   Actually I don't quite understand how it can happen what you describe
> (so probably I missed something). How it can happen that some buffers
> are unmapped while we are committing them?  journal_unmap_buffers()
> checks whether we are not committing truncated buffers and if so, it
> does not do anything to such buffers...
>   Bye
>   Honza

Yep. I spent lot of time trying to understand - why they are not
getting skipped :(

But my debug clearly shows that we are clearing the buffer, while
we haven't actually submitted to ll_rw_block() code. (I added "track"
flag to bh and set it in journal_commit_transaction() when we add
them to wbuf[] and clear it in ll_rw_block() after submit. I checked
for this flag in journal_unmap_buffer() while clearing the buffer).
Here is what my debug shows:

buffer is tracked bh 8101686ea850 size 1024 

Call Trace:
 [] show_trace+0xb5/0x370
 [] dump_stack+0x15/0x20
 [] journal_invalidatepage+0x314/0x3b0
 [] ext3_invalidatepage+0x38/0x40
 [] do_invalidatepage+0x20/0x30
 [] truncate_complete_page+0x23/0x50
 [] truncate_inode_pages_range+0xcd/0x300
 [] truncate_inode_pages+0x10/0x20
 [] vmtruncate+0x5f/0x100
 [] inode_setattr+0x30/0x140
 [] ext3_setattr+0x1bb/0x230
 [] notify_change+0x15e/0x320
 [] do_truncate+0x53/0x80
 [] sys_ftruncate+0xf8/0x130
 [] system_call+0x7e/0x83

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-06 Thread Badari Pulavarty

On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote:
>   Hi,
> 
> > On Fri, 2006-09-01 at 10:18 -0700, Andrew Morton wrote:
> > > > > > Kernel BUG at fs/buffer.c:2791
> > > > > > invalid opcode:  [1] SMP
> > > > > > 
> > > > > > Its complaining about BUG_ON(!buffer_mapped(bh)).
> > 
> > Here is the change that seems to cause the problem. Jana Kara
> > introduced a new mode "SWRITE" for ll_rw_block() - where it
> > waits for buffer to be unlocked (WRITE will skip locked
> > buffers) + jbd journal_commit_transaction() has been changed
> > to make use of SWRITE.
> > 
> > http://marc.theaimsgroup.com/?l=linux-fsdevel&m=112109788702895&w=2
> > 
> > Theoritically same race (between journal_commit_transaction() and
> > journal_unmap_buffer()+set_page_dirty()) could exist before his changes
> > - which should trigger bug in submit_bh(). But I can't seem to hit
> > it without his changes. My guess is ll_rw_block() is always skipping
> > those cleaned up buffers, before page gets dirtied again ..
>   I think that the change to ll_rw_block() just widens the window much
> more...
> 

Yes.

> > Andrew, what should we do ? Do you suggest handling this in jbd
> > itself (like this patch) ?
>   Actually that part of commit code needs rewrite anyway (and after that
> rewrite you get rid of ll_rw_block()) because of other problems - the
> code assumes that whenever buffer is locked, it is being written to disk
> which is not true... I have some preliminary patches for that but they
> are not very nice and so far I didn't have enough time to find a nice
> solution.

Are you okay with current not-so-elegant fix ? 

Thanks,
Badari

Patch to fix: Kernel BUG at fs/buffer.c:2791
on 1k (2k) filesystems while running fsx.

journal_commit_transaction collects lots of dirty buffer from
and does a single ll_rw_block() to write them out. ll_rw_block()
locks the buffer and checks to see if they are dirty and submits
them for IO.

In the mean while, journal_unmap_buffers() as part of
truncate can unmap the buffer and throw it away. Since its
a 1k (2k) filesystem - each page (4k) will have more than
one buffer_head attached to the page and and we can't free 
up buffer_heads attached to the page (if we are not
invalidating the whole page).

Now, any call to set_page_dirty() (like msync_interval)
could end up setting all the buffer heads attached to
this page again dirty, including the ones those got
cleaned up :(

So, -not-so-elegant fix would be to have make jbd skip all 
the buffers that are not mapped.

Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
---
 fs/jbd/commit.c |   32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

Index: linux-2.6.18-rc5/fs/jbd/commit.c
===
--- linux-2.6.18-rc5.orig/fs/jbd/commit.c   2006-08-27 20:41:48.0 
-0700
+++ linux-2.6.18-rc5/fs/jbd/commit.c2006-09-01 10:43:54.0 -0700
@@ -160,6 +160,34 @@ static int journal_write_commit_record(j
return (ret == -EIO);
 }
 
+static void jbd_write_buffers(int nr, struct buffer_head *bhs[])
+{
+   int i;
+
+   for (i = 0; i < nr; i++) {
+   struct buffer_head *bh = bhs[i];
+
+   lock_buffer(bh);
+
+   /*
+* case 1: Buffer could have been flushed by now,
+* if so nothing to do for us.
+* case 2: Buffer could have been unmapped up by
+* journal_unmap_buffer - but still attached to the
+* page. Any calls to set_page_dirty() would dirty
+* the buffer even though its not mapped.  If so,
+* we need to skip them.
+*/
+   if (buffer_mapped(bh) && test_clear_buffer_dirty(bh)) {
+   bh->b_end_io = end_buffer_write_sync;
+   get_bh(bh);
+   submit_bh(WRITE, bh);
+   continue;
+   }
+   unlock_buffer(bh);
+   }
+}
+
 /*
  * journal_commit_transaction
  *
@@ -356,7 +384,7 @@ write_out_data:
jbd_debug(2, "submit %d writes\n",
bufs);
spin_unlock(&journal->j_list_lock);
-   ll_rw_block(SWRITE, bufs, wbuf);
+   jbd_write_buffers(bufs, wbuf);
journal_brelse_array(wbuf, bufs);
bufs = 0;
goto write_out_data;
@@ -379,7 +407,7 @@ write_out_data:
 
if (bufs) {

Re: BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()

2006-09-05 Thread Badari Pulavarty

On Tue, 2006-09-05 at 17:47 -0400, Will Simoneau wrote:
> On 14:06 Tue 05 Sep , Badari Pulavarty wrote:
> > Will Simoneau wrote:
> > >Has anyone seen this before? These three traces occured at different times
> > >today when three new user accounts (and associated quotas) were created. 
> > >This
> > >machine is an NFS server which uses quotas on an ext3 fs (dir_index is on).
> > >Kernel is 2.6.17.11 on an x86 smp w/64G highmem; 4G ram is installed. The
> > >affected filesystem is on a software raid1 of two hardware raid0 volumes 
> > >from a
> > >megaraid card.
> > >
> > >BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()
> > >  ext3_getblk+0x98/0x2a6   md_wakeup_thread+0x26/0x2a
> > >  ext3_bread+0x1f/0x88   ext3_quota_read+0x136/0x1ae
> > >  v1_read_dqblk+0x61/0xac   dquot_acquire+0xf6/0x107
> > >  ext3_acquire_dquot+0x46/0x68   dqget+0x155/0x1e7
> > >  dquot_transfer+0x3e0/0x3e9   dput+0x23/0x13e

I think, we found your problem.

ext3_getblk() is not handling HOLE correctly. Does this patch help ?
Mingming, what do you think ?

Thanks,
Badari

ext3_get_blocks_handle() returns number of blocks it mapped.
It returns 0 in case of HOLE. ext3_getblk() should handle
HOLE properly (currently its dumping warning stack and
returning -EIO).

Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
---
 fs/ext3/inode.c |9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

Index: linux-2.6.18-rc5/fs/ext3/inode.c
===
--- linux-2.6.18-rc5.orig/fs/ext3/inode.c   2006-08-27 20:41:48.0 
-0700
+++ linux-2.6.18-rc5/fs/ext3/inode.c2006-09-05 15:32:57.0 -0700
@@ -1009,11 +1009,12 @@ struct buffer_head *ext3_getblk(handle_t
buffer_trace_init(&dummy.b_history);
err = ext3_get_blocks_handle(handle, inode, block, 1,
&dummy, create, 1);
-   if (err == 1) {
+   /*
+* ext3_get_blocks_handle() returns number of blocks
+* mapped. 0 in case of a HOLE.
+*/
+   if (err > 0) {
err = 0;
-   } else if (err >= 0) {
-   WARN_ON(1);
-   err = -EIO;
}
*errp = err;
if (!err && buffer_mapped(&dummy)) {


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()

2006-09-05 Thread Badari Pulavarty


Will Simoneau wrote:

Has anyone seen this before? These three traces occured at different times
today when three new user accounts (and associated quotas) were created. This
machine is an NFS server which uses quotas on an ext3 fs (dir_index is on).
Kernel is 2.6.17.11 on an x86 smp w/64G highmem; 4G ram is installed. The
affected filesystem is on a software raid1 of two hardware raid0 volumes from a
megaraid card.

BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()
  ext3_getblk+0x98/0x2a6   md_wakeup_thread+0x26/0x2a
  ext3_bread+0x1f/0x88   ext3_quota_read+0x136/0x1ae
  v1_read_dqblk+0x61/0xac   dquot_acquire+0xf6/0x107
  ext3_acquire_dquot+0x46/0x68   dqget+0x155/0x1e7
  dquot_transfer+0x3e0/0x3e9   dput+0x23/0x13e
 
Made me curious and looking around on what the warning is coming ? Few 
basic questions ..

Do you have CONFIG_LBD ?

I see the ext3_getblk() used "long" for "block" & 
ext3_get_blocks_handle() expects "sector_t"
for "block". Wondering if you are running into 64-bit -to- 32-bit 
conversion issues .. ?


Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()

2006-09-05 Thread Badari Pulavarty


Dave Kleikamp wrote:

On Tue, 2006-09-05 at 11:09 -0700, Badari Pulavarty wrote:
  

On Tue, 2006-09-05 at 13:10 -0400, Will Simoneau wrote:


Has anyone seen this before? These three traces occured at different times
today when three new user accounts (and associated quotas) were created. This
machine is an NFS server which uses quotas on an ext3 fs (dir_index is on).
Kernel is 2.6.17.11 on an x86 smp w/64G highmem; 4G ram is installed. The
affected filesystem is on a software raid1 of two hardware raid0 volumes from a
megaraid card.

BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()
  ext3_getblk+0x98/0x2a6   md_wakeup_thread+0x26/0x2a
  ext3_bread+0x1f/0x88   ext3_quota_read+0x136/0x1ae
  v1_read_dqblk+0x61/0xac   dquot_acquire+0xf6/0x107
  ext3_acquire_dquot+0x46/0x68   dqget+0x155/0x1e7
  dquot_transfer+0x3e0/0x3e9   dput+0x23/0x13e
  ext3_setattr+0xc3/0x240   current_fs_time+0x52/0x6a
  notify_change+0x2bd/0x30d   chown_common+0x9c/0xc5
  strncpy_from_user+0x3b/0x68   do_path_lookup+0xdf/0x266
  __user_walk_fd+0x44/0x5a   sys_chown+0x4a/0x55
  vfs_write+0xe7/0x13c   sys_mkdir+0x1f/0x23
  syscall_call+0x7/0xb 
  
I think its a bogus warning. 


ext3_getblk() is calling ext3_get_blocks_handle() to map "1" block for
read. But for *some* reason ext3_get_blocks_handle() more than 1 block.
ext3_get_blocks_handle() return "positive #of blocks" is a valid case.
So needs to be fixed.



I'm having a hard time figuring out exactly what ext3_get_blocks_handle
is trying to return, but it looks to me like if it is allocating one
data block, and needs to allocate an indirect block as well, then it
will return 2 rather than 1.  Is this expected, or am I just confused?

  


It would return "1" in that case.. (data block)

> 0 means get_block() suceeded and indicates the number of blocks mapped
= 0 lookup failed
< 0 mean error case


I did search for callers of ext3_get_blocks_handle() and found that
ext3_readdir() seems to do wrong thing all the time with error check :(
Need to take a closer look..

err = ext3_get_blocks_handle(NULL, inode, blk, 1,
&map_bh, 0, 0);
if (err > 0) {  <<<< BAD
  page_cache_readahead(sb->s_bdev->bd_inode->i_mapping,
&filp->f_ra,
filp,
map_bh.b_blocknr >>
(PAGE_CACHE_SHIFT - inode->i_blkbits),
1);
bh = ext3_bread(NULL, inode, blk, 0, &err);
   }



Bad to do this what it's doing, or bad to call name the variable "err"?
I think if it looked like this:

count = ext3_get_blocks_handle(NULL, inode, blk, 1,
&map_bh, 0, 0);
if (count > 0) { 


it would be a lot less confusing.
  

I am sorry !! it is doing the right thing :(


Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()

2006-09-05 Thread Badari Pulavarty

On Tue, 2006-09-05 at 13:10 -0400, Will Simoneau wrote:
> Has anyone seen this before? These three traces occured at different times
> today when three new user accounts (and associated quotas) were created. This
> machine is an NFS server which uses quotas on an ext3 fs (dir_index is on).
> Kernel is 2.6.17.11 on an x86 smp w/64G highmem; 4G ram is installed. The
> affected filesystem is on a software raid1 of two hardware raid0 volumes from 
> a
> megaraid card.
> 
> BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()
>   ext3_getblk+0x98/0x2a6   md_wakeup_thread+0x26/0x2a
>   ext3_bread+0x1f/0x88   ext3_quota_read+0x136/0x1ae
>   v1_read_dqblk+0x61/0xac   dquot_acquire+0xf6/0x107
>   ext3_acquire_dquot+0x46/0x68   dqget+0x155/0x1e7
>   dquot_transfer+0x3e0/0x3e9   dput+0x23/0x13e
>   ext3_setattr+0xc3/0x240   current_fs_time+0x52/0x6a
>   notify_change+0x2bd/0x30d   chown_common+0x9c/0xc5
>   strncpy_from_user+0x3b/0x68   do_path_lookup+0xdf/0x266
>   __user_walk_fd+0x44/0x5a   sys_chown+0x4a/0x55
>   vfs_write+0xe7/0x13c   sys_mkdir+0x1f/0x23
>   syscall_call+0x7/0xb 

I think its a bogus warning. 

ext3_getblk() is calling ext3_get_blocks_handle() to map "1" block for
read. But for *some* reason ext3_get_blocks_handle() more than 1 block.
ext3_get_blocks_handle() return "positive #of blocks" is a valid case.
So needs to be fixed.

I did search for callers of ext3_get_blocks_handle() and found that
ext3_readdir() seems to do wrong thing all the time with error check :(
Need to take a closer look..

err = ext3_get_blocks_handle(NULL, inode, blk, 1,
&map_bh, 0, 0);
if (err > 0) {   BAD
  page_cache_readahead(sb->s_bdev->bd_inode->i_mapping,
&filp->f_ra,
filp,
map_bh.b_blocknr >>
(PAGE_CACHE_SHIFT - inode->i_blkbits),
1);
bh = ext3_bread(NULL, inode, blk, 0, &err);
   }



Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-05 Thread Badari Pulavarty

On Fri, 2006-09-01 at 10:18 -0700, Andrew Morton wrote:
> On Fri, 01 Sep 2006 09:32:22 -0700
> Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> 
> > > > Kernel BUG at fs/buffer.c:2791
> > > > invalid opcode:  [1] SMP
> > > > 
> > > > Its complaining about BUG_ON(!buffer_mapped(bh)).
> 
> I need to have a little think about this, remember what _should_ be
> happening in this situation.
> 
> We (mainly I) used to do a huge amount of fsx-linux testing on 1k blocksize
> filesystems.  We've done something to make this start happening.  Part of
> resolving this bug will be working out what that was.
> 

It took a while to track this down. 2.6.13 is the last kernel
which runs my fsx tests fine (72+ hours).

Here is the change that seems to cause the problem. Jana Kara
introduced a new mode "SWRITE" for ll_rw_block() - where it
waits for buffer to be unlocked (WRITE will skip locked
buffers) + jbd journal_commit_transaction() has been changed
to make use of SWRITE.

http://marc.theaimsgroup.com/?l=linux-fsdevel&m=112109788702895&w=2

Theoritically same race (between journal_commit_transaction() and
journal_unmap_buffer()+set_page_dirty()) could exist before his changes
- which should trigger bug in submit_bh(). But I can't seem to hit
it without his changes. My guess is ll_rw_block() is always skipping
those cleaned up buffers, before page gets dirtied again ..

Andrew, what should we do ? Do you suggest handling this in jbd
itself (like this patch) ?

Thanks,
Badari

Patch to fix: Kernel BUG at fs/buffer.c:2791
on 1k (2k) filesystems while running fsx.

journal_commit_transaction collects lots of dirty buffer from
and does a single ll_rw_block() to write them out. ll_rw_block()
locks the buffer and checks to see if they are dirty and submits
them for IO.

In the mean while, journal_unmap_buffers() as part of
truncate can unmap the buffer and throw it away. Since its
a 1k (2k) filesystem - each page (4k) will have more than
one buffer_head attached to the page and and we can't free 
up buffer_heads attached to the page (if we are not
invalidating the whole page).

Now, any call to set_page_dirty() (like msync_interval)
could end up setting all the buffer heads attached to
this page again dirty, including the ones those got
cleaned up :(

So, -not-so-elegant fix would be to have make jbd skip all 
the buffers that are not mapped.

Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
---
 fs/jbd/commit.c |   32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

Index: linux-2.6.18-rc5/fs/jbd/commit.c
===
--- linux-2.6.18-rc5.orig/fs/jbd/commit.c   2006-08-27 20:41:48.0 
-0700
+++ linux-2.6.18-rc5/fs/jbd/commit.c2006-09-01 10:43:54.0 -0700
@@ -160,6 +160,34 @@ static int journal_write_commit_record(j
return (ret == -EIO);
 }
 
+static void jbd_write_buffers(int nr, struct buffer_head *bhs[])
+{
+   int i;
+
+   for (i = 0; i < nr; i++) {
+   struct buffer_head *bh = bhs[i];
+
+   lock_buffer(bh);
+
+   /*
+* case 1: Buffer could have been flushed by now,
+* if so nothing to do for us.
+* case 2: Buffer could have been unmapped up by
+* journal_unmap_buffer - but still attached to the
+* page. Any calls to set_page_dirty() would dirty
+* the buffer even though its not mapped.  If so,
+* we need to skip them.
+*/
+   if (buffer_mapped(bh) && test_clear_buffer_dirty(bh)) {
+   bh->b_end_io = end_buffer_write_sync;
+   get_bh(bh);
+   submit_bh(WRITE, bh);
+   continue;
+   }
+   unlock_buffer(bh);
+   }
+}
+
 /*
  * journal_commit_transaction
  *
@@ -356,7 +384,7 @@ write_out_data:
jbd_debug(2, "submit %d writes\n",
bufs);
spin_unlock(&journal->j_list_lock);
-   ll_rw_block(SWRITE, bufs, wbuf);
+   jbd_write_buffers(bufs, wbuf);
journal_brelse_array(wbuf, bufs);
bufs = 0;
goto write_out_data;
@@ -379,7 +407,7 @@ write_out_data:
 
if (bufs) {
spin_unlock(&journal->j_list_lock);
-   ll_rw_block(SWRITE, bufs, wbuf);
+   jbd_write_buffers(bufs, wbuf);
journal_brelse_array(wbuf, bufs);
spin_lock(&journal->j_list_lock);
}


-
T

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-01 Thread Badari Pulavarty


Andrew Morton wrote:

On Fri, 01 Sep 2006 09:32:22 -0700
Badari Pulavarty <[EMAIL PROTECTED]> wrote:

  

Kernel BUG at fs/buffer.c:2791
invalid opcode:  [1] SMP

Its complaining about BUG_ON(!buffer_mapped(bh)).



I need to have a little think about this, remember what _should_ be
happening in this situation.

We (mainly I) used to do a huge amount of fsx-linux testing on 1k blocksize
filesystems.  We've done something to make this start happening.  Part of
resolving this bug will be working out what that was.
  


Here is the progress in tracking this down so far.

I am able to reproduce the problem on following kernel versions.

2.6.18-rc5
2.6.18-rc4
2.6.17.11
2.6.16.28
2.6.15.7
2.6.14.7

I am yet to find a latest kernel version - where this works :(
I am going to try older versions of the kernel.

Thanks,
Badari


--
VGER BF report: H 3.60822e-15
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-01 Thread Badari Pulavarty

On Fri, 2006-09-01 at 10:18 -0700, Andrew Morton wrote:
> On Fri, 01 Sep 2006 09:32:22 -0700
> Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> 
> > > > Kernel BUG at fs/buffer.c:2791
> > > > invalid opcode:  [1] SMP
> > > > 
> > > > Its complaining about BUG_ON(!buffer_mapped(bh)).
> 
> I need to have a little think about this, remember what _should_ be
> happening in this situation.

Agreed. I used to run fsx on regular basis - never saw the problem
before. I will try to go back few kernel versions and verify.

> 
> We (mainly I) used to do a huge amount of fsx-linux testing on 1k blocksize
> filesystems.  We've done something to make this start happening.  Part of
> resolving this bug will be working out what that was.
> 

Here is the other fix, I did to avoid the problem (basically, have
jbd manage its own submit function and skip unmapped buffers) - not
elegant. Thats why I tried to address at the root cause..

Thanks,
Badari

Patch to fix: Kernel BUG at fs/buffer.c:2791
on 1k (2k) filesystems while running fsx.

journal_commit_transaction collects lots of dirty buffer from
and does a single ll_rw_block() to write them out. ll_rw_block()
locks the buffer and checks to see if they are dirty and submits
them for IO.

In the mean while, journal_unmap_buffers() as part of
truncate can unmap the buffer and throw it away. Since its
a 1k (2k) filesystem - each page (4k) will have more than
one buffer_head attached to the page and and we can't free 
up buffer_heads attached to the page (if we are not
invalidating the whole page).

Now, any call to set_page_dirty() (like msync_interval)
could end up setting all the buffer heads attached to
this page again dirty, including the ones those got
cleaned up :(

So, -not-so-elegant fix would be to have make jbd skip all 
the buffers that are not mapped.

Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
---
 fs/jbd/commit.c |   32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

Index: linux-2.6.18-rc5/fs/jbd/commit.c
===
--- linux-2.6.18-rc5.orig/fs/jbd/commit.c   2006-08-27 20:41:48.0 
-0700
+++ linux-2.6.18-rc5/fs/jbd/commit.c2006-09-01 10:36:16.0 -0700
@@ -160,6 +160,34 @@ static int journal_write_commit_record(j
return (ret == -EIO);
 }
 
+static void jbd_write_buffers(int nr, struct buffer_head *bhs[])
+{
+   int i;
+
+   for (i = 0; i < nr; i++) {
+   struct buffer_head *bh = bhs[i];
+
+   lock_buffer(bh);
+
+   /*
+* case 1: Buffer could have been flushed by now,
+* if so nothing to do for us.
+* case 2: Buffer could have been unmapped up by 
+* journal_unmap_buffer - but still attached to the
+* page. Any calls to set_page_dirty() would dirty
+* the buffer even though its not mapped.  If so,
+* we need to skip them.
+*/
+   if (buffer_mapped(bh) && test_clear_buffer_dirty(bh)) {
+   bh->b_end_io = end_buffer_write_sync;
+   get_bh(bh);
+   submit_bh(WRITE, bh);
+   continue;
+   }
+   unlock_buffer(bh);
+   }
+}
+
 /*
  * journal_commit_transaction
  *
@@ -356,7 +384,7 @@ write_out_data:
jbd_debug(2, "submit %d writes\n",
bufs);
spin_unlock(&journal->j_list_lock);
-   ll_rw_block(SWRITE, bufs, wbuf);
+   jbd_write_buffer(bufs, wbuf);
journal_brelse_array(wbuf, bufs);
bufs = 0;
goto write_out_data;
@@ -379,7 +407,7 @@ write_out_data:
 
if (bufs) {
spin_unlock(&journal->j_list_lock);
-   ll_rw_block(SWRITE, bufs, wbuf);
+   jbd_write_buffers(bufs, wbuf);
journal_brelse_array(wbuf, bufs);
spin_lock(&journal->j_list_lock);
}


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-01 Thread Badari Pulavarty

On Fri, 2006-09-01 at 17:12 +0100, Anton Altaparmakov wrote:
> Hi,
> 
> On Fri, 1 Sep 2006, Badari Pulavarty wrote:
> > 
> > I have been running into following bug while running fsx
> > tests on 1k (ext3) filesystem all the time. 
> > 
> > --- [cut here ] - [please bite here ] -
> > Kernel BUG at fs/buffer.c:2791
> > invalid opcode:  [1] SMP
> > 
> > Its complaining about BUG_ON(!buffer_mapped(bh)).
> > 
> > It was hard to track it down, needed lots of debug - but here 
> > is the problem & fix.  Since the fix is in __set_page_buffer_dirty()
> > code - I am wondering how it would effect others :(
> 
> This will breaks NTFS and probably a lot of other file systems I would 
> think.

Well, it can happen only with fileystems with blocksize < pagesize.
So, that should limit the scope :)

> 
> For example all buffer based file systems in their writepage 
> implementations will create buffers if none are present and those will not 
> be mapped.  If for whatever reason writepage now breaks out before mapping 
> the buffers (e.g. because the buffers do not need to be written or due to 
> an error) you are left with a page containing unmapped buffers.
> 
> Then a page dirty comes in caused by a mmapped write for example.
> 
> __set_page_dirty_bufferes() runs by default and with your patch does not 
> set the unmapped buffers dirty thus they never get written out and you 
> have data corruption.

I was wondering, is it *okay* to have a dirty buffer but not mapped to
disk ? I guess, so ..

> 
> It is the caller of submit_bh() that is doing the stupidity of submitting 
> unmapped buffers for i/o or even the caller of the caller, etc...  
> Somewhere up in that chain buffers should have been mapped before being 
> submitted for i/o otherwise it is a BUG() (as correctly identified by 
> submit_bh()).
> 
> Perhaps the real fix is not to have ext3 use ll_rw_block() and instead 
> make it use submit_bh() directly and only submit buffers inside the file 
> size and/or make it map buffers before calling ll_rw_block() and if they 
> are outside the file size just clean them without submitting them...

Yeah. I considered doing it in ll_rw_block() - but then I thought
fixing it in set_page_buffer_dirty() may be a valid fix :(

Let me try other options, then.

Thanks,
Badari



-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-01 Thread Badari Pulavarty

Hi Andrew,

I have been running into following bug while running fsx
tests on 1k (ext3) filesystem all the time. 

--- [cut here ] - [please bite here ] -
Kernel BUG at fs/buffer.c:2791
invalid opcode:  [1] SMP

Its complaining about BUG_ON(!buffer_mapped(bh)).

It was hard to track it down, needed lots of debug - but here 
is the problem & fix.  Since the fix is in __set_page_buffer_dirty()
code - I am wondering how it would effect others :(

With this fix fsx tests ran for more than 16 hours (and still
running).

Please let me know, what you think.

Thanks,
Badari 

Patch to fix: Kernel BUG at fs/buffer.c:2791
on 1k (2k) filesystems while running fsx.

journal_commit_transaction collects lots of dirty buffer from
and does a single ll_rw_block() to write them out. ll_rw_block()
locks the buffer and checks to see if they are dirty and submits
them for IO.

In the mean while, journal_unmap_buffers() as part of
truncate can unmap the buffer and throw it away. Since its
a 1k (2k) filesystem - each page (4k) will have more than
one buffer_head attached to the page and and we can't free 
up buffer_heads attached to the page (if we are not
invalidating the whole page).

Now, any call to set_page_dirty() (like msync_interval)
could end up setting all the buffer heads attached to
this page again dirty, including the ones those got
cleaned up :(

If ll_rw_block() runs now and sees the dirty bit it does
submit_bh() on those buffer_heads and triggers the assert.

Fix is to check if the buffer is mapped before setting its
dirty bit in __set_page_dirty_buffers().

Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>
---
 fs/buffer.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6.18-rc5/fs/buffer.c
===
--- linux-2.6.18-rc5.orig/fs/buffer.c   2006-09-01 08:20:51.0 -0700
+++ linux-2.6.18-rc5/fs/buffer.c2006-09-01 08:41:01.0 -0700
@@ -846,7 +846,13 @@ int __set_page_dirty_buffers(struct page
struct buffer_head *bh = head;
 
do {
-   set_buffer_dirty(bh);
+   /*
+* Its possible that, not all buffers attached to
+* this page are mapped (cleaned up by truncate).
+* If so, skip them.
+*/
+   if (buffer_mapped(bh))
+   set_buffer_dirty(bh);
bh = bh->b_this_page;
} while (bh != head);
}


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

66 matches

Mail list logo