Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA
On Tue, 2007-11-06 at 00:15 +0800, Andreas Dilger wrote: > On Nov 05, 2007 08:04 -0800, Badari Pulavarty wrote: > > On Sat, 2007-11-03 at 09:36 +0800, Andreas Dilger wrote: > > > But... this implies that every user of bh->b_data needs to kmap, and I > > > don't see that in the code anywhere else. That makes me think something > > > else is going wrong here. > > > > Most cases, this is handled in ll_rw_block() code - when we submit the > > buffer head for IO. If the page is in highmem, we will end up creating > > a bounce bufer for it. > > > > In our case, JBD code is trying to look at the data to do checksum > > on it. Thats why we have to kmap() the page before looking. > > My point is that there is a LOT of code in ext[234] that dereferences > bh->b_data without kmap() (e.g. group descriptors, bitmaps, superblock, > inode tables, etc). Does that imply that something is forcing those > bh pages into lowmem, or is the journal bh page in question being > allocated in some different way that allows it to be in highmem? Yes. You are right. Its been a while since I had to deal with HIGHMEM. All the meta-data should be in LOWMEM. I asked Mingming to verify what the buffer-head is pointing to when it has HIGHMEM page. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA
On Sat, 2007-11-03 at 09:36 +0800, Andreas Dilger wrote: > On Nov 02, 2007 08:31 -0800, Badari Pulavarty wrote: > > On Fri, 2007-11-02 at 13:20 +0800, Andreas Dilger wrote: > > > On Nov 01, 2007 17:40 -0700, Mingming Cao wrote: > > > > Current journal checksumming patch failed fsstress test on NUMA. The > > > > bh->b_data passed to the crc32_be () function could be NULL pointer, > > > > which caused kernel oops immediately when running fsstress with -o > > > > journal_checksum. It is because the page is part of highmem on NUMA box. > > > > We need to kmap the page before access the bh->b_data to calculate > > > > the checksums. > > > > > > I have no objection to the patch, per-se, but I'm surprised that there > > > would ever be a buffer head pointing at a page in high memory? That > > > seems contrary to what I would expect... > > > > I was surprised to see that too while helping Mingming/Avantika track > > this issue. I was under impression that we are checksumming only > > metadata and it should be lowmem. But only "buffer_head"s are in lowmem. > > Pages that point to can be in Highmem. > > But... this implies that every user of bh->b_data needs to kmap, and I > don't see that in the code anywhere else. That makes me think something > else is going wrong here. Most cases, this is handled in ll_rw_block() code - when we submit the buffer head for IO. If the page is in highmem, we will end up creating a bounce bufer for it. In our case, JBD code is trying to look at the data to do checksum on it. Thats why we have to kmap() the page before looking. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA
On Fri, 2007-11-02 at 13:20 +0800, Andreas Dilger wrote: > On Nov 01, 2007 17:40 -0700, Mingming Cao wrote: > > Current journal checksumming patch failed fsstress test on NUMA. The > > bh->b_data passed to the crc32_be () function could be NULL pointer, > > which caused kernel oops immediately when running fsstress with -o > > journal_checksum. It is because the page is part of highmem on NUMA box. > > We need to kmap the page before access the bh->b_data to calculate > > the checksums. > > I have no objection to the patch, per-se, but I'm surprised that there > would ever be a buffer head pointing at a page in high memory? That > seems contrary to what I would expect... I was surprised to see that too while helping Mingming/Avantika track this issue. I was under impression that we are checksumming only metadata and it should be lowmem. But only "buffer_head"s are in lowmem. Pages that point to can be in Highmem. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] add init_ext4_proc() stub for when CONFIG_PROC_FS is not set
On Wed, 2007-10-10 at 15:00 -0400, Theodore Tso wrote: > On Wed, Oct 10, 2007 at 08:30:10AM -0700, Badari Pulavarty wrote: > > Its a good start. I think there are lots of proc handling routines that > > can be move into #ifdef CONFIG_PROC_FS also. > > > > All the code around ext4_mb_read_prealloc_table(), > > ext4_mb_write_prealloc_table(), MB_PROC_VALUE_READ(stats), > > MB_PROC_VALUE_WRITE(stats), .. can be ifdefed out. > > There's no need to ifdef them out; in include/proc_fs.h there are the > following convenience #define's if CONFIG_PROC_FS is not defined: Sorry. If I wasn't clear.. What I meant to say was the above routines are NOT needed if we don't define CONFIG_PROC_FS. They are supporting read/write to /proc files. We can #ifdef them out to reduce the text size. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] add init_ext4_proc() stub for when CONFIG_PROC_FS is not set
On Tue, 2007-10-09 at 20:22 -0400, Theodore Tso wrote: > On Tue, Oct 09, 2007 at 01:40:12PM -0400, Theodore Tso wrote: > > On Tue, Oct 09, 2007 at 10:03:06AM -0700, Mingming Cao wrote: > > > I guess our testing did not catch this up because we have CONFIG_PROC_FS > > > enabled all the time. mballoc needs procfs for exporting some stats info > > > and tunable paramenters to optimize/customize multiple allocation. > > > > > > We could select CONFIG_PROC_FS at kconfig when ext4dev is enabled. > > > > We definitely should be able to compile without CONFIG_PROC_FS; it's a > > major flaw in the mballoc-core.patch that it doesn't work without it. > > I will fold the following into mballoc-core.patch. It's sufficient to > allow us to compile w/o CONFIG_PROC_FS defined. > > - Ted > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 4409c0c..2305af4 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -2854,9 +2854,11 @@ int __init init_ext4_proc(void) > if (ext4_pspace_cachep == NULL) > return -ENOMEM; > > +#ifdef CONFIG_PROC_FS > proc_root_ext4 = proc_mkdir(EXT4_ROOT, proc_root_fs); > if (proc_root_ext4 == NULL) > printk(KERN_ERR "EXT4-fs: Unable to create %s\n", EXT4_ROOT); > +#endif > > return 0; > } > @@ -2865,7 +2867,9 @@ void exit_ext4_proc(void) > { > /* XXX: synchronize_rcu(); */ > kmem_cache_destroy(ext4_pspace_cachep); > +#ifdef CONFIG_PROC_FS > remove_proc_entry(EXT4_ROOT, proc_root_fs); > +#endif > } > Its a good start. I think there are lots of proc handling routines that can be move into #ifdef CONFIG_PROC_FS also. All the code around ext4_mb_read_prealloc_table(), ext4_mb_write_prealloc_table(), MB_PROC_VALUE_READ(stats), MB_PROC_VALUE_WRITE(stats), .. can be ifdefed out. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] add init_ext4_proc() stub for when CONFIG_PROC_FS is not set
On Tue, 2007-10-09 at 10:03 -0700, Mingming Cao wrote: > On Tue, 2007-10-09 at 09:31 -0700, Badari Pulavarty wrote: > > On Tue, 2007-10-09 at 15:50 +1000, [EMAIL PROTECTED] wrote: > > > plain text document attachment (ext4-add-init_ext4_proc-stub.patch) > > > init_ext4_fs() calls init_ext4_proc() so we need a stub init_ext4_proc() > > > for the case that CONFIG_PROC_FS is not set. > > > Without the stub we get a build error: > > > > > > fs/ext4/mballoc.c: In function 'init_ext4_proc': > > > fs/ext4/mballoc.c:2837: error: 'proc_root_fs' undeclared (first use in > > > this function) > > > fs/ext4/mballoc.c:2837: error: (Each undeclared identifier is reported > > > only once > > > fs/ext4/mballoc.c:2837: error: for each function it appears in.) > > > > > > Add a stub init_ext4_proc() function that does nothing but return 0 > > > > > > Signed-off-by: Mark Nelson <[EMAIL PROTECTED]> > > > --- > > > fs/ext4/mballoc.c |7 +++ > > > 1 file changed, 7 insertions(+) > > > > > > Index: ext4/fs/ext4/mballoc.c > > > === > > > --- ext4.orig/fs/ext4/mballoc.c > > > +++ ext4/fs/ext4/mballoc.c > > > @@ -2825,6 +2825,7 @@ static int ext4_mb_destroy_per_dev_proc( > > > return 0; > > > } > > > > > > +#ifdef CONFIG_PROC_FS > > > int __init init_ext4_proc(void) > > > { > > > ext4_pspace_cachep = > > > @@ -2840,6 +2841,12 @@ int __init init_ext4_proc(void) > > > > > > return 0; > > > } > > > +#else > > > +int __init init_ext4_proc(void) > > > +{ > > > + return 0; > > > +} > > > +#endif > > > > > > void exit_ext4_proc(void) > > > { > > > > Nope. I don't think we can do this :( > > > > For example, we need to create ext4_pspace_cachep kmem cache > > for the pre-allocation to work. We can't ifdef it out. > > > > Mingming/Amit, can you take a look at this ? It looks like > > we NEED procfs support to make mballoc work. If so, we need > > to add it to the dependency. > > > > > > I guess our testing did not catch this up because we have CONFIG_PROC_FS > enabled all the time. mballoc needs procfs for exporting some stats info > and tunable paramenters to optimize/customize multiple allocation. > > We could select CONFIG_PROC_FS at kconfig when ext4dev is enabled. For testing it may be okay, but in the long run we need to make sure that ext4 doesn't depend on PROC_FS support for operation :( It would be nice to make it work without PROC_FS dependency, if you configure procfs we should get more info/tunables.. ext4_pspace_cachep creation has to be moved to somewhere else ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] add init_ext4_proc() stub for when CONFIG_PROC_FS is not set
On Tue, 2007-10-09 at 15:50 +1000, [EMAIL PROTECTED] wrote: > plain text document attachment (ext4-add-init_ext4_proc-stub.patch) > init_ext4_fs() calls init_ext4_proc() so we need a stub init_ext4_proc() > for the case that CONFIG_PROC_FS is not set. > Without the stub we get a build error: > > fs/ext4/mballoc.c: In function 'init_ext4_proc': > fs/ext4/mballoc.c:2837: error: 'proc_root_fs' undeclared (first use in this > function) > fs/ext4/mballoc.c:2837: error: (Each undeclared identifier is reported only > once > fs/ext4/mballoc.c:2837: error: for each function it appears in.) > > Add a stub init_ext4_proc() function that does nothing but return 0 > > Signed-off-by: Mark Nelson <[EMAIL PROTECTED]> > --- > fs/ext4/mballoc.c |7 +++ > 1 file changed, 7 insertions(+) > > Index: ext4/fs/ext4/mballoc.c > === > --- ext4.orig/fs/ext4/mballoc.c > +++ ext4/fs/ext4/mballoc.c > @@ -2825,6 +2825,7 @@ static int ext4_mb_destroy_per_dev_proc( > return 0; > } > > +#ifdef CONFIG_PROC_FS > int __init init_ext4_proc(void) > { > ext4_pspace_cachep = > @@ -2840,6 +2841,12 @@ int __init init_ext4_proc(void) > > return 0; > } > +#else > +int __init init_ext4_proc(void) > +{ > + return 0; > +} > +#endif > > void exit_ext4_proc(void) > { Nope. I don't think we can do this :( For example, we need to create ext4_pspace_cachep kmem cache for the pre-allocation to work. We can't ifdef it out. Mingming/Amit, can you take a look at this ? It looks like we NEED procfs support to make mballoc work. If so, we need to add it to the dependency. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/2] move init_ext4_proc() last and add cleanup call exit_ext4_proc()
On Tue, 2007-10-09 at 15:50 +1000, [EMAIL PROTECTED] wrote: > plain text document attachment (ext4-move-init_ext4_proc-add- > cleanup.patch) > The first problem that is addressed is that the addition of init_ext4_proc() > means that the code doesn't clean up after itself on a failure of > init_ext4_xattr(), and the second is that usually init_*_proc() should be > the last thing called, so we move it to the end and add the cleanup call to > exit_ext4_proc(). > > Signed-off-by: Mark Nelson <[EMAIL PROTECTED]> > --- > fs/ext4/super.c |9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > > Index: ext4/fs/ext4/super.c > === > --- ext4.orig/fs/ext4/super.c > +++ ext4/fs/ext4/super.c > @@ -2999,10 +2999,6 @@ static int __init init_ext4_fs(void) > { > int err; > > - err = init_ext4_proc(); > - if (err) > - return err; > - > err = init_ext4_xattr(); > if (err) > return err; > @@ -3012,7 +3008,12 @@ static int __init init_ext4_fs(void) > err = register_filesystem(&ext4dev_fs_type); > if (err) > goto out; > + err = init_ext4_proc(); > + if (err) > + goto out_proc; > return 0; > +out_proc: > + exit_ext4_proc(); > out: > destroy_inodecache(); > out1: Nope. You can not call exit_ext4_proc() if there is a failure in init_ext4_proc(). If the kmem_cache_create() for ext4_pspace_cachep fails, you would end up calling kmem_cache_destory(NULL). The best way is to make init_ext4_proc() cleanup itself, in case of an error. Hmm.. we are not handling proc_mkdir(EXT4_ROOT) failures. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc9: Oops in cache_alloc_refill() mm/slab.c
On Fri, 2007-10-05 at 15:41 +0200, Valerie Clement wrote: > Badari Pulavarty wrote: > > On Thu, 2007-10-04 at 18:13 +0200, Valerie Clement wrote: > >> While running ffsb tests on my ext4 filesystem, I got an Oops in > >> cache_alloc_refill(). > >> I turned on SLAB debugging and here is the message I got: > >> > >> slab: Internal list corruption detected in cache 'buffer_head'(30), > >> slabp 81007e100100(1515870810). Hexdump: > > > > slabp->inuse = 1515870810 looks bogus. Is this easily reproducible ? > > Hi Badari, > Thanks for your answer. > I didn't reproduce it without the latest ext4 patches. So I suspect a > bug in one of them. > But how debugging this? > Which other debug traces can I turn on? Let me understand. You applied latest ext4 patchsets ? If so, Mingming has some slab-cleanup changes in the patchset. You can try backing them out and see. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel Oops in ext3 code
On Fri, 2007-09-28 at 06:54 +0200, Norbert Preining wrote: > Hi Mingming, > > On Do, 27 Sep 2007, Mingming Cao wrote: > > Could you please sent the objdump of the ext4_discard_reservation > > function? It doesn't match what I see here. > > I assume you meant ext3_ I made > objdump -x -D -s super.o > (the only place where I found this function in the source code). If you > want something else, let me know, but a bit more specific. Can I do the > objdump directly from the kernel image file? > objdump -DlS balloc.o would give us ext3_discard_reservation() Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc6: hanging ext3 dbench tests
On Mon, 2007-09-24 at 13:04 -0700, Linus Torvalds wrote: > > On Mon, 24 Sep 2007, Badari Pulavarty wrote: > > > > Whats happening on my machine is .. > > > > dbench forks of 4 children and sends them a signal to start the work. > > 3 out of 4 children gets the signal and does the work. One of the child > > never gets the signal so, it waits forever in pause(). So, parent waits > > for a longtime to kill it. > > Since this *seems* to have nothing to do with the filesystem, and since it > *seems* to have been introduced between -rc3 and -rc4, I did > > gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/ > > to see what has changed. One of the commits was signal-related, and that > one doesn't look like it could possibly matter. > > The rest were scheduler-related, which doesn't surprise me. In fact, even > before I looked, my reaction to your bug report was "That sounds like an > application race condition". > > Applications shouldn't use "pause()" for waiting for a signal. It's a > fundamentally racy interface - the signal could have happened just > *before* calling pause. So it's almost always a bug to use pause(), and > any users should be fixed to use "sigsuspend()" instead, which can > atomically (and correctly) pause for a signal while the process has masked > it outside of the system call. > > Now, I took a look at the dbench sources, and I have to say that the race > looks *very* unlikely (there's quite a small window in which it does > > children[i].status = getpid(); > ** race window here ** > pause(); > > and it would require *just* the right timing so that the parent doesn't > end up doing the "sleep(1)" (which would make the window even less likely > to be hit), but there does seem to be a race condition there. And it > *could* be that you just happen to hit it on your hw setup. > > So before you do anything else, does this patch (TOTALLY UNTESTED! DONE > ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU > BAD NAMES!) make any difference? > > (patch against unmodified dbench-2.0) > > Linus > > --- > diff --git a/dbench.c b/dbench.c > index ccf5624..4be5712 100644 > --- a/dbench.c > +++ b/dbench.c > @@ -91,10 +91,15 @@ static double create_procs(int nprocs, void (*fn)(struct > child_struct * )) > > for (i=0;i if (fork() == 0) { > + sigset_t old, blocked; > + > + sigemptyset(&blocked); > + sigaddset(&blocked, SIGCONT); > + sigprocmask(SIG_BLOCK, &blocked, &old); > setbuffer(stdout, NULL, 0); > nb_setup(&children[i]); > children[i].status = getpid(); > - pause(); > + sigsuspend(&old); > fn(&children[i]); > _exit(0); > } With the modified dbench, I couldn't reproduce the problem so far. I will let it run through the night (just to be sure). For now, we can treat it as a tool/App issue :) Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc6: hanging ext3 dbench tests
On Mon, 2007-09-24 at 13:04 -0700, Linus Torvalds wrote: > > On Mon, 24 Sep 2007, Badari Pulavarty wrote: > > > > Whats happening on my machine is .. > > > > dbench forks of 4 children and sends them a signal to start the work. > > 3 out of 4 children gets the signal and does the work. One of the child > > never gets the signal so, it waits forever in pause(). So, parent waits > > for a longtime to kill it. > > Since this *seems* to have nothing to do with the filesystem, and since it > *seems* to have been introduced between -rc3 and -rc4, I did > > gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/ I was wrong. I managed to reproduce on 2.6.23-rc3, but it took a long time. But I never reproduced it on 2.6.22. Ran test for a day. > > to see what has changed. One of the commits was signal-related, and that > one doesn't look like it could possibly matter. > > The rest were scheduler-related, which doesn't surprise me. In fact, even > before I looked, my reaction to your bug report was "That sounds like an > application race condition". > > Applications shouldn't use "pause()" for waiting for a signal. It's a > fundamentally racy interface - the signal could have happened just > *before* calling pause. So it's almost always a bug to use pause(), and > any users should be fixed to use "sigsuspend()" instead, which can > atomically (and correctly) pause for a signal while the process has masked > it outside of the system call. > > Now, I took a look at the dbench sources, and I have to say that the race > looks *very* unlikely (there's quite a small window in which it does > > children[i].status = getpid(); > ** race window here ** > pause(); > > and it would require *just* the right timing so that the parent doesn't > end up doing the "sleep(1)" (which would make the window even less likely > to be hit), but there does seem to be a race condition there. And it > *could* be that you just happen to hit it on your hw setup. > > So before you do anything else, does this patch (TOTALLY UNTESTED! DONE > ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU > BAD NAMES!) make any difference? > > (patch against unmodified dbench-2.0) I am testing the updated version of dbench now. Normally, it takes 30min-1hour to reproduce the problem (when I do infinite "dbench 4"). I will post the results soon. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc6: hanging ext3 dbench tests
Hi Andy, I managed to reproduce the dbench problem. (not sure if its the same thing or not - but symptoms are same). My problem has nothing to do with ext3. I can produce it on ext2, jfs also. Whats happening on my machine is .. dbench forks of 4 children and sends them a signal to start the work. 3 out of 4 children gets the signal and does the work. One of the child never gets the signal so, it waits forever in pause(). So, parent waits for a longtime to kill it. BTW, I was trying to find out when this problem started showing up. So far, I managed to track it to 2.6.23-rc4. (2.6.23-rc3 doesn't seem to have this problem). I am going to do bi-sect and find out which patch caused this. I am using dbench-2.0 which consistently reproduces the problem on my x86-64 box. Did you find anything new with your setup ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: FLEX_BG Kernel support v2.
On Fri, 2007-09-21 at 09:06 -0500, Jose R. Santos wrote: > From: Jose R. Santos <[EMAIL PROTECTED]> > > ext4: FLEX_BG Kernel support v2. > > @@ -702,13 +702,15 @@ static inline int ext4_valid_inum(struct super_block > *sb, unsigned long ino) > #define EXT4_FEATURE_INCOMPAT_META_BG0x0010 > #define EXT4_FEATURE_INCOMPAT_EXTENTS0x0040 /* extents > support */ > #define EXT4_FEATURE_INCOMPAT_64BIT 0x0080 > +#define EXT4_FEATURE_INCOMPAT_FLEX_BG0x0200 Any reason why 0x100 is skipped ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] JBD slab cleanups
On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote: > On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote: > > jbd/jbd2: Replace slab allocations with page cache allocations > > > > From: Christoph Lameter <[EMAIL PROTECTED]> > > > > JBD should not pass slab pages down to the block layer. > > Use page allocator pages instead. This will also prepare > > JBD for the large blocksize patchset. > > > > Currently memory allocation for committed_data(and frozen_buffer) for > bufferhead is done through jbd slab management, as Christoph Hellwig > pointed out that this is broken as jbd should not pass slab pages down > to IO layer. and suggested to use get_free_pages() directly. > > The problem with this patch, as Andreas Dilger pointed today in ext4 > interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste > 1/3-1/2 page space. > > What was the originally intention to set up slabs for committed_data(and > frozen_buffer) in JBD? Why not using kmalloc? > > Mingming Looks good. Small suggestion is to get rid of all kmalloc() usages and consistently use jbd_kmalloc() or jbd2_kmalloc(). Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext2 statfs improvement for block and inode free count
On Wed, 2007-07-18 at 20:18 -0700, Andrew Morton wrote: > On Fri, 13 Jul 2007 18:36:54 -0700 Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > More statfs() improvements for ext2. ext2 already maintains > > percpu counters for free blocks and inodes. Derive free > > block count and inode count by summing up percpu counters, > > instead of counting up all the groups in the filesystem > > each time. > > > > hm, another speedup patch with no measurements which demonstrate its > benefit. In my setups (4 & 8-way), I didn't measure any significant performance improvements (in any reasonable workload). I see some decent improvements on cooked-up (1 million stats) tests :( > > > > > Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> > > Acked-by: Andreas Dilger <[EMAIL PROTECTED]> > > > > fs/ext2/super.c |4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > Index: linux-2.6.22/fs/ext2/super.c > > === > > --- linux-2.6.22.orig/fs/ext2/super.c 2007-07-13 20:06:38.0 > > -0700 > > +++ linux-2.6.22/fs/ext2/super.c2007-07-13 20:06:51.0 -0700 > > @@ -1136,12 +1136,12 @@ static int ext2_statfs (struct dentry * > > buf->f_type = EXT2_SUPER_MAGIC; > > buf->f_bsize = sb->s_blocksize; > > buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead; > > - buf->f_bfree = ext2_count_free_blocks(sb); > > + buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter); > > buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count); > > if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count)) > > buf->f_bavail = 0; > > buf->f_files = le32_to_cpu(es->s_inodes_count); > > - buf->f_ffree = ext2_count_free_inodes(sb); > > + buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter); > > buf->f_namelen = EXT2_NAME_LEN; > > fsid = le64_to_cpup((void *)es->s_uuid) ^ > >le64_to_cpup((void *)es->s_uuid + sizeof(u64)); > > > > Well there's a tradeoff here. At large CPU counts, percpu_counter_sum() > becomes quite expensive - it takes a global lock and then goes off fishing > in every CPU's percpu_alloced memory. > > So there is some value of (num_online_cpus / sb->s_groups_count) at which > this change becomes a loss. Where does that value lie? Yes. I debated long time whether I should submit this or not - due to very reason. Old code wasn't holding any locks. I don't have any high count CPU machine (>8way) with me. I will request for time on one. > > Bear in mind that the global lock in percpu_counter_sum() will tilt the > scales quite a bit. Noticed that too. I added WARN_ON() to see if percpu sum doesn't match computed sum. I saw few stacks in a 24 hour run of fsx runs. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext2 statfs improvement for block and inode free count
Andrew, Can you include it in -mm ? BTW, this patch is against mainline, won't apply cleanly to -mm, due to other statfs() improvements. Thanks, Badari More statfs() improvements for ext2. ext2 already maintains percpu counters for free blocks and inodes. Derive free block count and inode count by summing up percpu counters, instead of counting up all the groups in the filesystem each time. Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> Acked-by: Andreas Dilger <[EMAIL PROTECTED]> fs/ext2/super.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.22/fs/ext2/super.c === --- linux-2.6.22.orig/fs/ext2/super.c 2007-07-13 20:06:38.0 -0700 +++ linux-2.6.22/fs/ext2/super.c2007-07-13 20:06:51.0 -0700 @@ -1136,12 +1136,12 @@ static int ext2_statfs (struct dentry * buf->f_type = EXT2_SUPER_MAGIC; buf->f_bsize = sb->s_blocksize; buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead; - buf->f_bfree = ext2_count_free_blocks(sb); + buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter); buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count); if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count)) buf->f_bavail = 0; buf->f_files = le32_to_cpu(es->s_inodes_count); - buf->f_ffree = ext2_count_free_inodes(sb); + buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter); buf->f_namelen = EXT2_NAME_LEN; fsid = le64_to_cpup((void *)es->s_uuid) ^ le64_to_cpup((void *)es->s_uuid + sizeof(u64)); - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATHC] Fix for ext2 reservation (Re: -mm merge plans for 2.6.23)
On Tue, 2007-07-10 at 16:55 -0700, Andrew Morton wrote: > On Tue, 10 Jul 2007 15:15:57 -0700 > Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > Well, I looked at the problem now and here is the fix :) > > whee, thanks. > > > Greg, Please consider this for stable release also. > > No, it is only relevant to -mm's ext2-reservations.patch. Yes. Sorry. I was looking at -mm tree and assumed that ext2-reservation code made into mainline also :( Greg, please ignore the patch for stable. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATHC] Fix for ext2 reservation (Re: -mm merge plans for 2.6.23)
On Tue, 2007-07-10 at 12:39 -0700, Martin Bligh wrote: > Andrew Morton wrote: > > > > Begin forwarded message: > > > > Date: Tue, 10 Jul 2007 21:49:23 +0400 > > From: Alexey Dobriyan <[EMAIL PROTECTED]> > > To: Andrew Morton <[EMAIL PROTECTED]> > > Cc: [EMAIL PROTECTED], linux-ext4@vger.kernel.org > > Subject: ext2 reservations (Re: -mm merge plans for 2.6.23) > > > > > >> ext2-reservations.patch > >> > >> Still needs decent testing. > > > > Was this oops silently fixed? > > http://lkml.org/lkml/2007/3/2/138 > > 2.6.21-rc2-mm1: EIP is at ext2_discard_reservation+0x1c/0x52 > > > > I still have that ext2 partition backed up. > > Now I'm confused - I thought there was a latent issue there, and > then we went back and revisited it, and we decided there wasn't ;-( Well, I looked at the problem now and here is the fix :) Greg, Please consider this for stable release also. Thanks, Badari ext2 reservation fix - Alexey Dobriyan reported ext2 discard reservation panic while ago (http://lkml.org/lkml/2007/3/2/138). If ext2_new_inode() fails for any reason it would end up calling ext2_discard_reservation() (due to last iput). Normally, it does nothing since we don't have a reservation window structure allocated. But the NULL pointer check wouldn't work with slab poisioning, and causes oops. Fix is to initialize i_block_alloc_info to NULL in ext2_alloc_inode() code instead of assuming that it would be NULL. Same fix already exists in ext3 and ext4. Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> fs/ext2/super.c |1 + 1 file changed, 1 insertion(+) Index: linux-2.6.22/fs/ext2/super.c === --- linux-2.6.22.orig/fs/ext2/super.c 2007-07-08 16:32:17.0 -0700 +++ linux-2.6.22/fs/ext2/super.c2007-07-10 16:36:42.0 -0700 @@ -147,6 +147,7 @@ static struct inode *ext2_alloc_inode(st ei->i_acl = EXT2_ACL_NOT_CACHED; ei->i_default_acl = EXT2_ACL_NOT_CACHED; #endif + ei->i_block_alloc_info = NULL; ei->vfs_inode.i_version = 1; return &ei->vfs_inode; } - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate support for bitmap-based files
On Fri, 2007-07-06 at 14:33 -0700, Mike Waychison wrote: > Badari Pulavarty wrote: > > On Sat, 2007-06-30 at 10:13 -0400, Mingming Cao wrote: > >> On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote: > >>> Guys, Mike and Sreenivasa at google are looking into implementing > >>> fallocate() on ext2. Of course, any such implementation could and should > >>> also be portable to ext3 and ext4 bitmapped files. > >>> > >>> I believe that Sreenivasa will mainly be doing the implementation work. > >>> > >>> > >>> The basic plan is as follows: > >>> > >>> - Create (with tune2fs and mke2fs) a hidden file using one of the > >>> reserved inode numbers. That file will be sized to have one bit for > >>> each > >>> block in the partition. Let's call this the "unwritten block file". > >>> > >>> The unwritten block file will be initialised with all-zeroes > >>> > >>> - at fallocate()-time, allocate the blocks to the user's file (in some > >>> yet-to-be-determined fashion) and, for each one which is uninitialised, > >>> set its bit in the unwritten block file. The set bit means "this block > >>> is uninitialised and needs to be zeroed out on read". > >>> > >>> - truncate() would need to clear out set-bits in the unwritten blocks > >>> file. > >>> > >>> - When the fs comes to read a block from disk, it will need to consult > >>> the unwritten blocks file to see if that block should be zeroed by the > >>> CPU. > >>> > >>> - When the unwritten-block is written to, its bit in the unwritten blocks > >>> file gets zeroed. > >>> > >>> - An obvious efficiency concern: if a user file has no unwritten blocks > >>> in it, we don't need to consult the unwritten blocks file. > >>> > >>> Need to work out how to do this. An obvious solution would be to have > >>> a number-of-unwritten-blocks counter in the inode. But do we have space > >>> for that? > >>> > >>> (I expect google and others would prefer that the on-disk format be > >>> compatible with legacy ext2!) > >>> > >>> - One concern is the following scenario: > >>> > >>> - Mount fs with "new" kernel, fallocate() some blocks to a file. > >>> > >>> - Now, mount the fs under "old" kernel (which doesn't understand the > >>> unwritten blocks file). > >>> > >>> - This kernel will be able to read uninitialised data from that > >>> fallocated-to file, which is a security concern. > >>> > >>> - Now, the "old" kernel writes some data to a fallocated block. But > >>> this kernel doesn't know that it needs to clear that block's flag in > >>> the unwritten blocks file! > >>> > >>> - Now mount that fs under the "new" kernel and try to read that file. > >>> The flag for the block is set, so this kernel will still zero out the > >>> data on a read, thus corrupting the user's data > >>> > >>> So how to fix this? Perhaps with a per-inode flag indicating "this > >>> inode has unwritten blocks". But to fix this problem, we'd require that > >>> the "old" kernel clear out that flag. > >>> > >>> Can anyone propose a solution to this? > >>> > >>> Ah, I can! Use the compatibility flags in such a way as to prevent the > >>> "old" kernel from mounting this filesystem at all. To mount this fs > >>> under an "old" kernel the user will need to run some tool which will > >>> > >>> - read the unwritten blocks file > >>> > >>> - for each set-bit in the unwritten blocks file, zero out the > >>> corresponding block > >>> > >>> - zero out the unwritten blocks file > >>> > >>> - rewrite the superblock to indicate that this fs may now be mounted > >>> by an "old" kernel. > >>> > >>> Sound sane? > >>> > >>> - I'm assuming that there are more reserved inodes available, and that > >>> the changes to tune2fs and mke2fs will be basically a copy-n-paste job &g
Re: [PATCH] ext2 statfs speed up
On Thu, 2007-07-05 at 15:06 -0600, Andreas Dilger wrote: > On Jul 05, 2007 11:11 -0700, Badari Pulavarty wrote: > > @@ -1131,17 +1134,22 @@ static int ext2_statfs (struct dentry * > > buf->f_bfree = ext2_count_free_blocks(sb); > > + es->s_free_blocks_count = cpu_to_le32(buf->f_bfree); > > buf->f_ffree = ext2_count_free_inodes(sb); > > + es->s_free_inodes_count = cpu_to_le32(buf->f_ffree); > > Hmm, this is still sub-optimal. For ext3 and ext4 it just uses > percpu_counter_sum() instead of the slow ext*_count_free_blocks(), which > walks all of the groups. Not that this is a reason to hold this patch, > because at least we are removing 1/2 of the overhead for ext2. Andrea, I am wondering why we are not currently using percpu_counter_sum() for ext2 ? I see that ext2 already has all the stuff it needs. Can't I just do following ? Thanks, Badari fs/ext2/super.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.22-rc7/fs/ext2/super.c === --- linux-2.6.22-rc7.orig/fs/ext2/super.c 2007-07-05 12:35:15.0 -0700 +++ linux-2.6.22-rc7/fs/ext2/super.c2007-07-05 20:37:32.0 -0700 @@ -1142,13 +1142,13 @@ static int ext2_statfs (struct dentry * buf->f_type = EXT2_SUPER_MAGIC; buf->f_bsize = sb->s_blocksize; buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last; - buf->f_bfree = ext2_count_free_blocks(sb); + buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter); es->s_free_blocks_count = cpu_to_le32(buf->f_bfree); buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count); if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count)) buf->f_bavail = 0; buf->f_files = le32_to_cpu(es->s_inodes_count); - buf->f_ffree = ext2_count_free_inodes(sb); + buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter); es->s_free_inodes_count = cpu_to_le32(buf->f_ffree); buf->f_namelen = EXT2_NAME_LEN; fsid = le64_to_cpup((void *)es->s_uuid) ^ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4 statfs speed up
On Fri, 2007-07-06 at 02:44 +0800, coly li wrote: > Badari, > > How about the performance improvement, is there any benchmark result ? > > Coly The original patch came from Andrea Dilger - where they use this patch in their luster code. Andrea, do you have perf. numbers ? BTW, I am doing perf. analysis too. I will share the results. Thanks, Badari > > > 在 2007-07-05四的 11:11 -0700,Badari Pulavarty写道: > > This is a patch that speeds up statfs. It is very simple - > > the "overhead" calculation, which takes a huge amount of time > > for large filesystems, never changes unless the size of the > > filesystem itself changes. That means we can store it in memory > > and only recalculate if the filesystem has been resized (almost > > never). > > > > It also fixes a minor problem that we never update the on-disk > > superblock free blocks/inodes counts until the filesystem is > > unmounted. While not fatal, we may as well update that on disk > > when we have the information, and it makes things like debugfs > > and dumpe2fs report a bit more accurate info. > > > > Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> > > Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]> > > > > fs/ext4/super.c| 25 +++-- > > include/linux/ext4_fs_sb.h |2 ++ > > 2 files changed, 17 insertions(+), 10 deletions(-) > > > > Index: linux-2.6.22-rc7/include/linux/ext4_fs_sb.h > > === > > --- linux-2.6.22-rc7.orig/include/linux/ext4_fs_sb.h2007-07-01 > > 12:54:24.0 -0700 > > +++ linux-2.6.22-rc7/include/linux/ext4_fs_sb.h 2007-07-04 > > 22:09:59.0 -0700 > > @@ -39,6 +39,8 @@ struct ext4_sb_info { > > unsigned long s_gdb_count; /* Number of group descriptor blocks */ > > unsigned long s_desc_per_block; /* Number of group descriptors per > > block */ > > unsigned long s_groups_count; /* Number of groups in the fs */ > > + unsigned long s_overhead_last; /* Last calculated overhead */ > > + unsigned long s_blocks_last;/* Last seen block count */ > > struct buffer_head * s_sbh; /* Buffer containing the super block */ > > struct ext4_super_block * s_es; /* Pointer to the super block in the > > buffer */ > > struct buffer_head ** s_group_desc; > > Index: linux-2.6.22-rc7/fs/ext4/super.c > > === > > --- linux-2.6.22-rc7.orig/fs/ext4/super.c 2007-07-01 12:54:24.0 > > -0700 > > +++ linux-2.6.22-rc7/fs/ext4/super.c2007-07-04 22:12:27.0 > > -0700 > > @@ -2481,19 +2481,19 @@ static int ext4_statfs (struct dentry * > > struct super_block *sb = dentry->d_sb; > > struct ext4_sb_info *sbi = EXT4_SB(sb); > > struct ext4_super_block *es = sbi->s_es; > > - ext4_fsblk_t overhead; > > - int i; > > u64 fsid; > > > > - if (test_opt (sb, MINIX_DF)) > > - overhead = 0; > > - else { > > - unsigned long ngroups; > > - ngroups = EXT4_SB(sb)->s_groups_count; > > + if (test_opt(sb, MINIX_DF)) { > > + sbi->s_overhead_last = 0; > > + } else if (sbi->s_blocks_last != le32_to_cpu(es->s_blocks_count)) { > > + unsigned long ngroups = sbi->s_groups_count, i; > > + ext4_fsblk_t overhead = 0; > > smp_rmb(); > > > > /* > > -* Compute the overhead (FS structures) > > +* Compute the overhead (FS structures). This is constant > > +* for a given filesystem unless the number of block groups > > +* changes so we cache the previous value until it does. > > */ > > > > /* > > @@ -2517,18 +2517,23 @@ static int ext4_statfs (struct dentry * > > * Every block group has an inode bitmap, a block > > * bitmap, and an inode table. > > */ > > - overhead += (ngroups * (2 + EXT4_SB(sb)->s_itb_per_group)); > > + overhead += ngroups * (2 + sbi->s_itb_per_group); > > + sbi->s_overhead_last = overhead; > > + smp_wmb(); > > + sbi->s_blocks_last = le32_to_cpu(es->s_blocks_count); > > } > > > > buf->f_type = EXT4_SUPER_MAGIC; > > buf->f_bsize = sb->s_blocksize; > > - buf->f_blocks = ext4_blocks_count(es) - overhead; > >
[PATCH] ext3 statfs speed up
This is a patch that speeds up statfs. It is very simple - the "overhead" calculation, which takes a huge amount of time for large filesystems, never changes unless the size of the filesystem itself changes. That means we can store it in memory and only recalculate if the filesystem has been resized (almost never). It also fixes a minor problem that we never update the on-disk superblock free blocks/inodes counts until the filesystem is unmounted. While not fatal, we may as well update that on disk when we have the information, and it makes things like debugfs and dumpe2fs report a bit more accurate info. Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]> fs/ext3/super.c| 25 +++-- include/linux/ext3_fs_sb.h |2 ++ 2 files changed, 17 insertions(+), 10 deletions(-) Index: linux-2.6.22-rc7/include/linux/ext3_fs_sb.h === --- linux-2.6.22-rc7.orig/include/linux/ext3_fs_sb.h2007-07-01 12:54:24.0 -0700 +++ linux-2.6.22-rc7/include/linux/ext3_fs_sb.h 2007-07-04 21:59:28.0 -0700 @@ -38,6 +38,8 @@ struct ext3_sb_info { unsigned long s_gdb_count; /* Number of group descriptor blocks */ unsigned long s_desc_per_block; /* Number of group descriptors per block */ unsigned long s_groups_count; /* Number of groups in the fs */ + unsigned long s_overhead_last; /* Last calculated overhead */ + unsigned long s_blocks_last;/* Last seen block count */ struct buffer_head * s_sbh; /* Buffer containing the super block */ struct ext3_super_block * s_es; /* Pointer to the super block in the buffer */ struct buffer_head ** s_group_desc; Index: linux-2.6.22-rc7/fs/ext3/super.c === --- linux-2.6.22-rc7.orig/fs/ext3/super.c 2007-07-01 12:54:24.0 -0700 +++ linux-2.6.22-rc7/fs/ext3/super.c2007-07-04 21:59:28.0 -0700 @@ -2406,19 +2406,19 @@ static int ext3_statfs (struct dentry * struct super_block *sb = dentry->d_sb; struct ext3_sb_info *sbi = EXT3_SB(sb); struct ext3_super_block *es = sbi->s_es; - ext3_fsblk_t overhead; - int i; u64 fsid; - if (test_opt (sb, MINIX_DF)) - overhead = 0; - else { - unsigned long ngroups; - ngroups = EXT3_SB(sb)->s_groups_count; + if (test_opt(sb, MINIX_DF)) { + sbi->s_overhead_last = 0; + } else if (sbi->s_blocks_last != le32_to_cpu(es->s_blocks_count)) { + unsigned long ngroups = sbi->s_groups_count, i; + ext3_fsblk_t overhead = 0; smp_rmb(); /* -* Compute the overhead (FS structures) +* Compute the overhead (FS structures). This is constant +* for a given filesystem unless the number of block groups +* changes so we cache the previous value until it does. */ /* @@ -2442,18 +2442,23 @@ static int ext3_statfs (struct dentry * * Every block group has an inode bitmap, a block * bitmap, and an inode table. */ - overhead += (ngroups * (2 + EXT3_SB(sb)->s_itb_per_group)); + overhead += ngroups * (2 + sbi->s_itb_per_group); + sbi->s_overhead_last = overhead; + smp_wmb(); + sbi->s_blocks_last = le32_to_cpu(es->s_blocks_count); } buf->f_type = EXT3_SUPER_MAGIC; buf->f_bsize = sb->s_blocksize; - buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead; + buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last; buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter); + es->s_free_blocks_count = cpu_to_le32(buf->f_bfree); buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count); if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count)) buf->f_bavail = 0; buf->f_files = le32_to_cpu(es->s_inodes_count); buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter); + es->s_free_inodes_count = cpu_to_le32(buf->f_ffree); buf->f_namelen = EXT3_NAME_LEN; fsid = le64_to_cpup((void *)es->s_uuid) ^ le64_to_cpup((void *)es->s_uuid + sizeof(u64)); - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext2 statfs speed up
This is a patch that speeds up statfs. It is very simple - the "overhead" calculation, which takes a huge amount of time for large filesystems, never changes unless the size of the filesystem itself changes. That means we can store it in memory and only recalculate if the filesystem has been resized (almost never). It also fixes a minor problem that we never update the on-disk superblock free blocks/inodes counts until the filesystem is unmounted. While not fatal, we may as well update that on disk when we have the information, and it makes things like debugfs and dumpe2fs report a bit more accurate info. Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]> fs/ext2/super.c| 20 ++-- include/linux/ext2_fs_sb.h |2 ++ 2 files changed, 16 insertions(+), 6 deletions(-) Index: linux-2.6.22-rc7/include/linux/ext2_fs_sb.h === --- linux-2.6.22-rc7.orig/include/linux/ext2_fs_sb.h2007-07-01 12:54:24.0 -0700 +++ linux-2.6.22-rc7/include/linux/ext2_fs_sb.h 2007-07-04 22:02:01.0 -0700 @@ -33,6 +33,8 @@ struct ext2_sb_info { unsigned long s_gdb_count; /* Number of group descriptor blocks */ unsigned long s_desc_per_block; /* Number of group descriptors per block */ unsigned long s_groups_count; /* Number of groups in the fs */ + unsigned long s_overhead_last; /* Last calculated overhead */ + unsigned long s_blocks_last;/* Last seen block count */ struct buffer_head * s_sbh; /* Buffer containing the super block */ struct ext2_super_block * s_es; /* Pointer to the super block in the buffer */ struct buffer_head ** s_group_desc; Index: linux-2.6.22-rc7/fs/ext2/super.c === --- linux-2.6.22-rc7.orig/fs/ext2/super.c 2007-07-01 12:54:24.0 -0700 +++ linux-2.6.22-rc7/fs/ext2/super.c2007-07-05 12:29:03.0 -0700 @@ -1099,15 +1099,18 @@ static int ext2_statfs (struct dentry * struct super_block *sb = dentry->d_sb; struct ext2_sb_info *sbi = EXT2_SB(sb); struct ext2_super_block *es = sbi->s_es; - unsigned long overhead; - int i; u64 fsid; if (test_opt (sb, MINIX_DF)) - overhead = 0; - else { + sbi->s_overhead_last = 0; + else if (sbi->s_blocks_last != le32_to_cpu(es->s_blocks_count)) { + unsigned long i, overhead = 0; + smp_rmb(); + /* -* Compute the overhead (FS structures) +* Compute the overhead (FS structures). This is constant +* for a given filesystem unless the number of block groups +* changes so we cache the previous value until it does. */ /* @@ -1131,17 +1134,22 @@ static int ext2_statfs (struct dentry * */ overhead += (sbi->s_groups_count * (2 + sbi->s_itb_per_group)); + sbi->s_overhead_last = overhead; + smp_wmb(); + sbi->s_blocks_last = le32_to_cpu(es->s_blocks_count); } buf->f_type = EXT2_SUPER_MAGIC; buf->f_bsize = sb->s_blocksize; - buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead; + buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last; buf->f_bfree = ext2_count_free_blocks(sb); + es->s_free_blocks_count = cpu_to_le32(buf->f_bfree); buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count); if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count)) buf->f_bavail = 0; buf->f_files = le32_to_cpu(es->s_inodes_count); buf->f_ffree = ext2_count_free_inodes(sb); + es->s_free_inodes_count = cpu_to_le32(buf->f_ffree); buf->f_namelen = EXT2_NAME_LEN; fsid = le64_to_cpup((void *)es->s_uuid) ^ le64_to_cpup((void *)es->s_uuid + sizeof(u64)); - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4 statfs speed up
This is a patch that speeds up statfs. It is very simple - the "overhead" calculation, which takes a huge amount of time for large filesystems, never changes unless the size of the filesystem itself changes. That means we can store it in memory and only recalculate if the filesystem has been resized (almost never). It also fixes a minor problem that we never update the on-disk superblock free blocks/inodes counts until the filesystem is unmounted. While not fatal, we may as well update that on disk when we have the information, and it makes things like debugfs and dumpe2fs report a bit more accurate info. Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]> fs/ext4/super.c| 25 +++-- include/linux/ext4_fs_sb.h |2 ++ 2 files changed, 17 insertions(+), 10 deletions(-) Index: linux-2.6.22-rc7/include/linux/ext4_fs_sb.h === --- linux-2.6.22-rc7.orig/include/linux/ext4_fs_sb.h2007-07-01 12:54:24.0 -0700 +++ linux-2.6.22-rc7/include/linux/ext4_fs_sb.h 2007-07-04 22:09:59.0 -0700 @@ -39,6 +39,8 @@ struct ext4_sb_info { unsigned long s_gdb_count; /* Number of group descriptor blocks */ unsigned long s_desc_per_block; /* Number of group descriptors per block */ unsigned long s_groups_count; /* Number of groups in the fs */ + unsigned long s_overhead_last; /* Last calculated overhead */ + unsigned long s_blocks_last;/* Last seen block count */ struct buffer_head * s_sbh; /* Buffer containing the super block */ struct ext4_super_block * s_es; /* Pointer to the super block in the buffer */ struct buffer_head ** s_group_desc; Index: linux-2.6.22-rc7/fs/ext4/super.c === --- linux-2.6.22-rc7.orig/fs/ext4/super.c 2007-07-01 12:54:24.0 -0700 +++ linux-2.6.22-rc7/fs/ext4/super.c2007-07-04 22:12:27.0 -0700 @@ -2481,19 +2481,19 @@ static int ext4_statfs (struct dentry * struct super_block *sb = dentry->d_sb; struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_super_block *es = sbi->s_es; - ext4_fsblk_t overhead; - int i; u64 fsid; - if (test_opt (sb, MINIX_DF)) - overhead = 0; - else { - unsigned long ngroups; - ngroups = EXT4_SB(sb)->s_groups_count; + if (test_opt(sb, MINIX_DF)) { + sbi->s_overhead_last = 0; + } else if (sbi->s_blocks_last != le32_to_cpu(es->s_blocks_count)) { + unsigned long ngroups = sbi->s_groups_count, i; + ext4_fsblk_t overhead = 0; smp_rmb(); /* -* Compute the overhead (FS structures) +* Compute the overhead (FS structures). This is constant +* for a given filesystem unless the number of block groups +* changes so we cache the previous value until it does. */ /* @@ -2517,18 +2517,23 @@ static int ext4_statfs (struct dentry * * Every block group has an inode bitmap, a block * bitmap, and an inode table. */ - overhead += (ngroups * (2 + EXT4_SB(sb)->s_itb_per_group)); + overhead += ngroups * (2 + sbi->s_itb_per_group); + sbi->s_overhead_last = overhead; + smp_wmb(); + sbi->s_blocks_last = le32_to_cpu(es->s_blocks_count); } buf->f_type = EXT4_SUPER_MAGIC; buf->f_bsize = sb->s_blocksize; - buf->f_blocks = ext4_blocks_count(es) - overhead; + buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last; buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter); + es->s_free_blocks_count = cpu_to_le32(buf->f_bfree); buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es); if (buf->f_bfree < ext4_r_blocks_count(es)) buf->f_bavail = 0; buf->f_files = le32_to_cpu(es->s_inodes_count); buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter); + es->s_free_inodes_count = cpu_to_le32(buf->f_ffree); buf->f_namelen = EXT4_NAME_LEN; fsid = le64_to_cpup((void *)es->s_uuid) ^ le64_to_cpup((void *)es->s_uuid + sizeof(u64)); - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] speed up statfs
On Wed, 2007-07-04 at 16:23 -0600, Andreas Dilger wrote: > Hi, > this is a patch we are currently using that speeds up statfs. > It is very simple - the "overhead" calculation, which takes a > huge amount of time for large filesystems, never changes unless > the size of the filesystem itself changes. That means we can > store it in memory and only recalculate if the filesystem has > been resized (almost never). > > It is based on ext3 but is trivally updated to ext4. It also > fixes a minor problem that we never update the on-disk superblock > free blocks/inodes counts until the filesystem is unmounted. > While not fatal, we may as well update that on disk when we have > the information, and it makes things like debugfs and dumpe2fs > report a bit more accurate info. > > I'd be happy if someone could update this to the latest kernel and > for ext2 and ext4 also. > I forward ported it to latest + did the same for ext2 and ext4 also. Compiles fine. I will post them tomorrow after little testing. Thanks, Badaru - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate support for bitmap-based files
On Sat, 2007-06-30 at 10:13 -0400, Mingming Cao wrote: > On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote: > > Guys, Mike and Sreenivasa at google are looking into implementing > > fallocate() on ext2. Of course, any such implementation could and should > > also be portable to ext3 and ext4 bitmapped files. > > > > I believe that Sreenivasa will mainly be doing the implementation work. > > > > > > The basic plan is as follows: > > > > - Create (with tune2fs and mke2fs) a hidden file using one of the > > reserved inode numbers. That file will be sized to have one bit for each > > block in the partition. Let's call this the "unwritten block file". > > > > The unwritten block file will be initialised with all-zeroes > > > > - at fallocate()-time, allocate the blocks to the user's file (in some > > yet-to-be-determined fashion) and, for each one which is uninitialised, > > set its bit in the unwritten block file. The set bit means "this block > > is uninitialised and needs to be zeroed out on read". > > > > - truncate() would need to clear out set-bits in the unwritten blocks file. > > > > - When the fs comes to read a block from disk, it will need to consult > > the unwritten blocks file to see if that block should be zeroed by the > > CPU. > > > > - When the unwritten-block is written to, its bit in the unwritten blocks > > file gets zeroed. > > > > - An obvious efficiency concern: if a user file has no unwritten blocks > > in it, we don't need to consult the unwritten blocks file. > > > > Need to work out how to do this. An obvious solution would be to have > > a number-of-unwritten-blocks counter in the inode. But do we have space > > for that? > > > > (I expect google and others would prefer that the on-disk format be > > compatible with legacy ext2!) > > > > - One concern is the following scenario: > > > > - Mount fs with "new" kernel, fallocate() some blocks to a file. > > > > - Now, mount the fs under "old" kernel (which doesn't understand the > > unwritten blocks file). > > > > - This kernel will be able to read uninitialised data from that > > fallocated-to file, which is a security concern. > > > > - Now, the "old" kernel writes some data to a fallocated block. But > > this kernel doesn't know that it needs to clear that block's flag in > > the unwritten blocks file! > > > > - Now mount that fs under the "new" kernel and try to read that file. > > The flag for the block is set, so this kernel will still zero out the > > data on a read, thus corrupting the user's data > > > > So how to fix this? Perhaps with a per-inode flag indicating "this > > inode has unwritten blocks". But to fix this problem, we'd require that > > the "old" kernel clear out that flag. > > > > Can anyone propose a solution to this? > > > > Ah, I can! Use the compatibility flags in such a way as to prevent the > > "old" kernel from mounting this filesystem at all. To mount this fs > > under an "old" kernel the user will need to run some tool which will > > > > - read the unwritten blocks file > > > > - for each set-bit in the unwritten blocks file, zero out the > > corresponding block > > > > - zero out the unwritten blocks file > > > > - rewrite the superblock to indicate that this fs may now be mounted > > by an "old" kernel. > > > > Sound sane? > > > > - I'm assuming that there are more reserved inodes available, and that > > the changes to tune2fs and mke2fs will be basically a copy-n-paste job > > from the `tune2fs -j' code. Correct? > > > > - I haven't thought about what fsck changes would be needed. > > > > Presumably quite a few. For example, fsck should check that set-bits > > in the unwriten blobks file do not correspond to freed blocks. If they > > do, that should be fixed up. > > > > And fsck can check each inodes number-of-unwritten-blocks counters > > against the unwritten blocks file (if we implement the per-inode > > number-of-unwritten-blocks counter) > > > > What else should fsck do? > > > > - I haven't thought about the implications of porting this into ext3/4. > > Probably the commit to the unwritten blocks file will need to be atomic > > with the commit to the user's file's metadata, so the unwritten-blocks > > file will effectively need to be in journalled-data mode. > > > > Or, more likely, we access the unwritten blocks file via the blockdev > > pagecache (ie: use bmap, like the journal file) and then we're just > > talking direct to the disk's blocks and it becomes just more fs metadata. > > > > - I guess resize2fs will need to be taught about the unwritten blocks > > file: to shrink and grow it appropriately. > > > > > > That's all I can think of for now - I probably missed something. > > > > Suggestions and thought are sought, please. > > > > > > Another approach we have been thinking is using a backing > inode
Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote: .. > > > 5) ext3_write_end: > > Before write_begin/write_end patch set we have folowing locking > > order: > > stop_journal(handle); > > unlock_page(page); > > But now order is oposite: > > unlock_page(page); > > stop_journal(handle); > > Can we got any race condition now? I'm not sure is it actual problem, > > may be somebody cant describe this. > > Can we just change it to the original order? That would seem to be > safest unless one of the ext3 devs explicitly acks it. It would be nice to go back to original order, but its not that simple with current structure of the code. With Nick's patches unlock_page() happens in generic_write_end(). journal_stop() needs to happen after generic_write_end(). :( Mingming, can you take a look at the current & proposed order ? I ran into bunch of races when I tried to change the order for ->writepages() support earlier :( Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix section conflict of ext4_ext_{find_goal,invalidate_cache}
On Tue, 2007-05-15 at 19:15 +0200, Martin Michlmayr wrote: > Building with GCC 4.2, I get the following error: > > CC [M] fs/ext4/extents.o > fs/ext4/extents.c:2166: error: __ksymtab_ext4_ext_find_goal causes a section > type conflict > fs/ext4/extents.c:2163: error: __ksymtab_ext4_ext_invalidate_cache causes a > section type conflict > > This is because ext4_ext_find_goal and ext4_ext_invalidate_cache are > declared static but also exported. Hmm.. Why are these exported ? Looking at the code EXPORT_SYMBOL(ext4_ext_invalidate_cache); EXPORT_SYMBOL(ext4_ext_insert_extent); EXPORT_SYMBOL(ext4_ext_walk_space); EXPORT_SYMBOL(ext4_ext_find_goal); EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert); Mingming ? Why are we exporting these ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which kernel version should I submit patch ?
On Thu, 2007-05-03 at 02:13 +0800, coly wrote: > Hi, list. > > I find these days 2.6.21 released. If I will submit patches for ext4dev, > should I generate the patch against 2.6.21 ? If you find a bug in ext4 on 2.6.21 - yes, generate against 2.6.21. Make sure its not already fixed in -mm or Ted's ext4 tree. http://www2.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/ Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Fri, 2007-03-02 at 09:16 -0600, Eric Sandeen wrote: > Badari Pulavarty wrote: > > > > Amit K. Arora wrote: > > > >> This is to give a heads up on few patches that we will be soon coming up > >> with. These patches implement a new system call sys_fallocate() and a > >> new inode operation "fallocate", for persistent preallocation. The new > >> system call, as Andrew suggested, will look like: > >> > >> asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); > >> > > I am wondering about return values from this syscall ? Is it supposed to > > return the > > number of bytes allocated ? What about partial allocations ? > > If you don't have enough blocks to cover the request, you should > probably just return -ENOSPC, not a partial allocation. That could be challenging, when multiple writers are working in parallel. You may not be able to return -ENOSPC, till you fail the allocation (for filesystems which alllocates a block at a time). > > > What about > > if the > > blocks already exists ? What would be return values in those cases ? > > 0 on success, other normal errors oetherwise.. > > If asked for a range that includes already-allocated blocks, you just > allocate any non-allocated blocks in the range, I think. Yes. What I was trying to figure out is, if there is a requirement that interface need to return exact number of bytes it *really* allocated (like write() or read()). I can't think of any, but just wanted to through it out.. BTW, what is the interface for finding out what is the size of the pre-allocated file ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Amit K. Arora wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); I am wondering about return values from this syscall ? Is it supposed to return the number of bytes allocated ? What about partial allocations ? What about if the blocks already exists ? What would be return values in those cases ? Just curious .. What does posix_fallocate() return ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 15/23] ext3 -nobh option causes oops
On Thu, 2006-11-16 at 23:51 +0100, Adrian Bunk wrote: > It seems this patch that went into 2.6.17.8 should also be included in > the 2.6.16.x branch, or do I miss anything? > Yes. This is needed for 2.6.16.x branch also. Thanks, Badari > On Thu, Aug 03, 2006 at 10:40:03PM -0700, Greg KH wrote: > > -stable review patch. If anyone has any objections, please let us know. > > > > ------ > > From: Badari Pulavarty <[EMAIL PROTECTED]> > > > > For files other than IFREG, nobh option doesn't make sense. Modifications > > to them are journalled and needs buffer heads to do that. Without this > > patch, we get kernel oops in page_buffers(). > > > > Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> > > Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> > > Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]> > > > > --- > > fs/ext3/inode.c |6 +++--- > > 1 file changed, 3 insertions(+), 3 deletions(-) > > > > --- linux-2.6.17.7.orig/fs/ext3/inode.c > > +++ linux-2.6.17.7/fs/ext3/inode.c > > @@ -1159,7 +1159,7 @@ retry: > > ret = PTR_ERR(handle); > > goto out; > > } > > - if (test_opt(inode->i_sb, NOBH)) > > + if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode)) > > ret = nobh_prepare_write(page, from, to, ext3_get_block); > > else > > ret = block_prepare_write(page, from, to, ext3_get_block); > > @@ -1245,7 +1245,7 @@ static int ext3_writeback_commit_write(s > > if (new_i_size > EXT3_I(inode)->i_disksize) > > EXT3_I(inode)->i_disksize = new_i_size; > > > > - if (test_opt(inode->i_sb, NOBH)) > > + if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode)) > > ret = nobh_commit_write(file, page, from, to); > > else > > ret = generic_commit_write(file, page, from, to); > > @@ -1495,7 +1495,7 @@ static int ext3_writeback_writepage(stru > > goto out_fail; > > } > > > > - if (test_opt(inode->i_sb, NOBH)) > > + if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode)) > > ret = nobh_writepage(page, ext3_get_block, wbc); > > else > > ret = block_write_full_page(page, ext3_get_block, wbc); > > - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pagefault in generic_file_buffered_write() causing deadlock
Andrew Morton wrote: On Wed, 15 Nov 2006 07:57:45 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote: We are looking at a customer situation (on 2.6.16-based distro) - where system becomes almost useless while running some java & stress tests. Root cause seems to be taking a pagefault in generic_file_buffered_write () after calling prepare_write. I am wondering 1) Why & How this can happen - since we made sure to fault the user buffer before prepare write. When using writev() we only fault in the first segment of the iovec. If the second or succesive segment isn't mapped into pagetables we're vulnerable to the deadlock. Yes. I remember this change. Thank you. 2) If this is already fixed in current mainline (I can't see how). It was fixed in 2.6.17. You'll need 6527c2bdf1f833cc18e8f42bd97973d583e4aa83 and 81b0c8713385ce1b1b9058e916edcf9561ad76d6 I will try to get this change into customer :( Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pagefault in generic_file_buffered_write() causing deadlock
Andrew Morton wrote: On Wed, 15 Nov 2006 07:57:45 -0800 Badari Pulavarty <[EMAIL PROTECTED]> wrote: We are looking at a customer situation (on 2.6.16-based distro) - where system becomes almost useless while running some java & stress tests. Root cause seems to be taking a pagefault in generic_file_buffered_write () after calling prepare_write. I am wondering 1) Why & How this can happen - since we made sure to fault the user buffer before prepare write. When using writev() we only fault in the first segment of the iovec. If the second or succesive segment isn't mapped into pagetables we're vulnerable to the deadlock. Hmm.. Not it :( Its coming from write() not writev(). [C0002ABBF290] [C039D58C] .do_page_fault+0x2e4/0x75c [C0002ABBF460] [C0004860] .handle_page_fault+0x20/0x54 --- Exception: 301 at .__copy_tofrom_user+0x11c/0x580 LR = .generic_file_buffered_write+0x39c/0x7c8 [C0002ABBF750] [C0095A94] .generic_file_buffered_write+0x2c0/0x7c8 ( unreliable) [C0002ABBF8F0] [C00962EC] .__generic_file_aio_write_nolock+0x350/0x3 e0 [C0002ABBFA20] [C0096908] .generic_file_aio_write+0x78/0x104 [C0002ABBFAE0] [C01649F0] .ext3_file_write+0x2c/0xd4 [C0002ABBFB70] [C00C5168] .do_sync_write+0xd4/0x130 [C0002ABBFCF0] [C00C5ED4] .vfs_write+0x128/0x20c [C0002ABBFD90] [C00C664C] .sys_write+0x4c/0x8c [C0002ABBFE30] [C000871C] syscall_exit+0x0/0x40 Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
pagefault in generic_file_buffered_write() causing deadlock
Hi Andrew & MM experts, We are looking at a customer situation (on 2.6.16-based distro) - where system becomes almost useless while running some java & stress tests. Root cause seems to be taking a pagefault in generic_file_buffered_write () after calling prepare_write. I am wondering 1) Why & How this can happen - since we made sure to fault the user buffer before prepare write. 2) If this is already fixed in current mainline (I can't see how). Ideas on what I can do to fix it ? Thanks, Badari Here is the analysis & stacks: === Java thread doing mmap() holding for mmap_sem and waiting for transaction to be unlocked: java D 0fed3ff4 7104 2447 2391 2448 2446 (NOTLB) Call Trace: [C0002AC8F410] [C1315AC0] 0xc1315ac0 (unreliable) [C0002AC8F5E0] [C000F0B4] .__switch_to+0x12c/0x150 [C0002AC8F670] [C039980C] .schedule+0xcec/0xe4c [C0002AC8F780] [C017BC24] .start_this_handle+0x3b4/0x4ac [C0002AC8F8A0] [C017BE08] .journal_start+0xec/0x140 [C0002AC8F940] [C0171374] .ext3_journal_start_sb+0x58/0x78 [C0002AC8F9C0] [C016AB90] .ext3_dirty_inode+0x38/0xb0 [C0002AC8FA50] [C00F6820] .__mark_inode_dirty+0x60/0x1d4 [C0002AC8FAF0] [C00E9F60] .touch_atime+0xc8/0xe0 [C0002AC8FB80] [C0093834] .generic_file_mmap+0x54/0x80 [C0002AC8FC00] [C00AC450] .do_mmap_pgoff+0x558/0x870 [C0002AC8FD10] [C000A9C0] .sys_mmap+0xdc/0x160 [C0002AC8FDC0] [C0014258] .compat_sys_mmap2+0x14/0x28 [C0002AC8FE30] [C000871C] syscall_exit+0x0/0x40 kjournald locked the transaction and waiting for journal stop (t_updates to go to zero): kjournald D 8704 2167 1 2203 2028 (L-TLB) Call Trace: [C0003514F980] [C05257D8] amd74xx_pci_tbl+0x8/0x200 (unreliable) [C0003514FB50] [C000F0B4] .__switch_to+0x12c/0x150 [C0003514FBE0] [C039980C] .schedule+0xcec/0xe4c [C0003514FCF0] [C017DA58] .journal_commit_transaction+0x190/0x1448 [C0003514FE50] [C0182F44] .kjournald+0xf0/0x27c [C0003514FF90] [C0025630] .kernel_thread+0x4c/0x68 Another java thread, did journal_start() in prepare_write() and took a pagefault while copying. Now this is waiting for mmap_sem to finish the fault :( java D 0ffd76f0 6384 2452 2391 2453 2451 (NOTLB) Call Trace: [C0002ABBEE50] [C0002ABBEEE0] 0xc0002abbeee0 (unreliable) [C0002ABBF020] [C000F0B4] .__switch_to+0x12c/0x150 [C0002ABBF0B0] [C039980C] .schedule+0xcec/0xe4c [C0002ABBF1C0] [C039B688] .rwsem_down_read_failed +0x284/0x2d0 [C0002ABBF290] [C039D58C] .do_page_fault+0x2e4/0x75c [C0002ABBF460] [C0004860] .handle_page_fault+0x20/0x54 --- Exception: 301 at .__copy_tofrom_user+0x11c/0x580 LR = .generic_file_buffered_write+0x39c/0x7c8 [C0002ABBF750] [C0095A94] .generic_file_buffered_write+0x2c0/0x7c8 ( unreliable) [C0002ABBF8F0] [C00962EC] .__generic_file_aio_write_nolock+0x350/0x3 e0 [C0002ABBFA20] [C0096908] .generic_file_aio_write+0x78/0x104 [C0002ABBFAE0] [C01649F0] .ext3_file_write+0x2c/0xd4 [C0002ABBFB70] [C00C5168] .do_sync_write+0xd4/0x130 [C0002ABBFCF0] [C00C5ED4] .vfs_write+0x128/0x20c [C0002ABBFD90] [C00C664C] .sys_write+0x4c/0x8c [C0002ABBFE30] [C000871C] syscall_exit+0x0/0x40 - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.19-rc2-mm1 warning in invalidate_inode_pages2_range()
On Thu, 2006-10-19 at 16:02 -0700, Andrew Morton wrote: > On Thu, 19 Oct 2006 15:39:06 -0700 > Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > Hi Zach, > > > > While running IO tests I get following messages on 2.6.19-rc2-mm1 > > > > BUG: warning at mm/truncate.c:400/invalidate_inode_pages2_range() > > > > Call Trace: > > [] show_trace+0x41/0x70 > > [] dump_stack+0x12/0x20 > > [] invalidate_inode_pages2_range+0x297/0x2e0 > > [] generic_file_direct_IO+0xf5/0x130 > > [] generic_file_direct_write+0x74/0x140 > > [] __generic_file_aio_write_nolock+0x36c/0x4b0 > > [] generic_file_aio_write+0x67/0xd0 > > [] :ext4dev:ext4_file_write+0x23/0xc0 > > DWARF2 unwinder stuck at ext4_file_write+0x23/0xc0 [ext4dev] > > Leftover inexact backtrace: > > [] do_sync_write+0xcf/0x120 > > [] cp_new_stat+0xe7/0x100 > > [] autoremove_wake_function+0x0/0x30 > > [] __mutex_lock_slowpath+0x1df/0x1f0 > > [] vfs_write+0xbd/0x180 > > [] sys_write+0x53/0x90 > > [] system_call+0x7e/0x83 > > > > that's warning that we weren't able to invalidate some pagecache. That's > not really suprising. Perhaps we should be more careful in deciding when > to fail the call (ie: leaving behind a clean page is ok, leaving behind a > dirty page is bad). > > Probably it's leaving behind dirty pagecache and you're about to lose your > data. Which I bet is your own darn fault for doing silly things. > > What workload was in use? Running fsx and fsstress. - Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.19-rc2-mm1 warning in invalidate_inode_pages2_range()
Hi Zach, While running IO tests I get following messages on 2.6.19-rc2-mm1 BUG: warning at mm/truncate.c:400/invalidate_inode_pages2_range() Call Trace: [] show_trace+0x41/0x70 [] dump_stack+0x12/0x20 [] invalidate_inode_pages2_range+0x297/0x2e0 [] generic_file_direct_IO+0xf5/0x130 [] generic_file_direct_write+0x74/0x140 [] __generic_file_aio_write_nolock+0x36c/0x4b0 [] generic_file_aio_write+0x67/0xd0 [] :ext4dev:ext4_file_write+0x23/0xc0 DWARF2 unwinder stuck at ext4_file_write+0x23/0xc0 [ext4dev] Leftover inexact backtrace: [] do_sync_write+0xcf/0x120 [] cp_new_stat+0xe7/0x100 [] autoremove_wake_function+0x0/0x30 [] __mutex_lock_slowpath+0x1df/0x1f0 [] vfs_write+0xbd/0x180 [] sys_write+0x53/0x90 [] system_call+0x7e/0x83 FYI. Thaks Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.19-rc1-mm1 (ext4 problem ?)
On Wed, 2006-10-11 at 09:56 -0700, Andrew Morton wrote: > On Wed, 11 Oct 2006 08:02:14 -0700 > Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > On Tue, 2006-10-10 at 21:01 -0700, Andrew Morton wrote: > > > > > > > Andrew Morton wrote: > > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.19-rc1/2.6.19-rc1-mm1/ > > > > > > > > > > > > > > > - Added the ext4 filesystem. Quick usage instructions: > > > > > > > > > > - Grab updated e2fsprogs from > > > > > > > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/ > > > > > > > > > ext4 did not survive my simple stress tests. > > > > I ran 4 copies of fsx on ext4dev filesystem (without extents) + > > 4 copies of fsx on ext4dev (with extents). > > > > Machine hung after running for few hours. There are 4 fsx sigsegv > > messages on the console and the last message on the console is > > > > do_IRQ: 0.62 No irq handler for vector > > > > Quite a few people are hitting that - it's related to Eric's recent > IRQ/APIC-routing changes. > > I don't know why that would cause fsx to get a sigsegv though. I don't think they are related. Hopefully once we figure out IRQ problem, we can try to track down fsx problems. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3 fsx failures on 2.6.19-rc1
On Tue, 2006-10-10 at 14:30 +0200, Jan Kara wrote: > Hi, > > > I am having fsx failures on 2.6.19-rc1. > :( > > > I don't have any useful information at this time to track it down. > > I am running 4 copies of fsx (+ fsstress) on a 1k filesystem and > > one copy of fsx dies. > How long does it take? Random. It happens any where from 1 hours to 6 hours. > > > fsx-linux[20667]: segfault at rip 2af0fe031690 rsp > > 7fffacc03b88 error 4 > > > > READ BAD DATA: offset = 0xa352, size = 0x5fef > > OFFSET GOODBAD RANGE > > 0x df90 0x48e4 0x 0x 70 > > operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops > Hmm, so fsx-linux wrote something and read back zeros. Strange. Do you > know what that 'segfault' message means? I cannot find it in my copy of > fsx-linux... > > fsx got a segmentation fault and died. These are the messages in the dmesg. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
ext3 fsx failures on 2.6.19-rc1
Hi Andrew, I am having fsx failures on 2.6.19-rc1. I don't have any useful information at this time to track it down. I am running 4 copies of fsx (+ fsstress) on a 1k filesystem and one copy of fsx dies. FYI. Thanks, Badari fsx-linux[20667]: segfault at rip 2af0fe031690 rsp 7fffacc03b88 error 4 READ BAD DATA: offset = 0xa352, size = 0x5fef OFFSET GOODBAD RANGE 0x df90 0x48e4 0x 0x 70 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops LOG DUMP (71764 total operations): 71765(85 mod 256): TRUNCATE DOWNfrom 0x2464c to 0x362a ** 71766(86 mod 256): READ 0x1b16 thru 0x3629 (0x1b14 bytes) 71767(87 mod 256): MAPWRITE 0x3ca1b thru 0x3 (0x35e5 bytes) 71768(88 mod 256): MAPREAD 0x20f08 thru 0x2d401 (0xc4fa bytes) 71769(89 mod 256): MAPREAD 0x84d5 thru 0x16a99 (0xe5c5 bytes) ****** 71770(90 mod 256): WRITE0x22dfb thru 0x26da3 (0x3fa9 bytes) 71771(91 mod 256): WRITE0x11326 thru 0x19589 (0x8264 bytes) 71772(92 mod 256): READ 0x8db3 thru 0xc813 (0x3a61 bytes) 71773(93 mod 256): WRITE0x33aff thru 0x3a725 (0x6c27 bytes) 71774(94 mod 256): TRUNCATE DOWNfrom 0x4 to 0x25e02 71775(95 mod 256): MAPWRITE 0x7f4 thru 0xffc0 (0xf7cd bytes) ** 71776(96 mod 256): WRITE0xcfa8 thru 0xdba9(0xc02 bytes) 71777(97 mod 256): READ 0x233e9 thru 0x25e01(0x2a19 bytes) 71778(98 mod 256): TRUNCATE DOWNfrom 0x25e02 to 0x1ea56 71779(99 mod 256): MAPWRITE 0x2b905 thru 0x3aaba (0xf1b6 bytes) 71780(100 mod 256): READ0x32e45 thru 0x3aaba (0x7c76 bytes) 71781(101 mod 256): READ0x229ae thru 0x31eaf (0xf502 bytes) 71782(102 mod 256): TRUNCATE UP from 0x3aabb to 0x3b640 71783(103 mod 256): MAPREAD 0x2d2f4 thru 0x32a6d (0x577a bytes) 71784(104 mod 256): MAPWRITE 0x38d96 thru 0x3c980 (0x3beb bytes) 71785(105 mod 256): WRITE 0x964d thru 0x9a7e(0x432 bytes) 71786(106 mod 256): TRUNCATE DOWN from 0x3c981 to 0x2de7 ** 71787(107 mod 256): WRITE 0x5134 thru 0x81e4(0x30b1 bytes) HOLE 71788(108 mod 256): MAPWRITE 0xb02f thru 0x15d39 (0xad0b bytes) ** 71789(109 mod 256): MAPWRITE 0x22da2 thru 0x2c461 (0x96c0 bytes) 71790(110 mod 256): MAPREAD 0xe52e thru 0x10d99 (0x286c bytes) 71791(111 mod 256): MAPREAD 0x801e thru 0x14492 (0xc475 bytes) ****** 71792(112 mod 256): MAPWRITE 0x222e5 thru 0x2760f (0x532b bytes) 71793(113 mod 256): MAPREAD 0xc9e3 thru 0xe947(0x1f65 bytes) ****** 71794(114 mod 256): READ0x15618 thru 0x20a3e (0xb427 bytes) 71795(115 mod 256): READ0x21fcf thru 0x27c2b (0x5c5d bytes) 71796(116 mod 256): MAPREAD 0x4322 thru 0x13c95 (0xf974 bytes) ****** 71797(117 mod 256): MAPREAD 0x28ce6 thru 0x2c461 (0x377c bytes) 71798(118 mod 256): MAPWRITE 0x6467 thru 0xeef2 (0x8a8c bytes) ** 71799(119 mod 256): READ0x140e4 thru 0x1a3f1 (0x630e bytes) 71800(120 mod 256): MAPWRITE 0x15fdf thru 0x2186b (0xb88d bytes) 71801(121 mod 256): WRITE 0x22d1b thru 0x23bfb (0xee1 bytes) 71802(122 mod 256): WRITE 0x3ada8 thru 0x3 (0x5258 bytes) HOLE 71803(123 mod 256): TRUNCATE DOWN from 0x4 to 0x2276c 71804(124 mod 256): READ0x1d99c thru 0x2276b (0x4dd0 bytes) 71805(125 mod 256): READ0x8b7 thru 0x1189 (0x8d3 bytes) 71806(126 mod 256): MAPWRITE 0xe5ed thru 0x117b0 (0x31c4 bytes) 71807(127 mod 256): MAPWRITE 0x17121 thru 0x1d177 (0x6057 bytes) 71808(128 mod 256): READ0x6033 thru 0x11909 (0xb8d7 bytes) ****** 71809(129 mod 256): WRITE 0x3196d thru 0x3e669 (0xccfd bytes) HOLE 71810(130 mod 256): READ0x217f thru 0x25ab(0x42d bytes) 71811(131 mod 256): MAPWRITE 0x39b78 thru 0x3a677 (0xb00 bytes) 71812(132 mod 256): TRUNCATE DOWN from 0x3e66a to 0x330dc 71813(133 mod 256): TRUNCATE DOWN from 0x330dc to 0x2ca2d 71814(134 mod 256): TRUNCATE DOWN from 0x2ca2d to 0x184d1 71815(135 mod 256): TRUNCATE UP from 0x184d1 to 0x3f897 71816(136 mod 256): MAPREAD 0x1dd98 thru 0x2aaa8 (0xcd11 bytes) 71817(137 mod 256): MAPREAD 0x3ad15 thru 0x3f39f (0x468b bytes) 71818(138 mod 256): WRITE 0x323a6 thru 0x3b549 (0x91a4 bytes) 71819(139 mod 256): TRUNCATE DOWN from 0x3f897 to 0x200a7 71820(140 mod 256): TRUNCATE DOWN from 0x200a7 to 0x169f7 71821(141 mod 256): MAPWRITE 0x23ac9 thru 0x26308 (0x2840 bytes) 71822(142 mod 256): WRITE 0xc67d thru 0xcee6(0x86a bytes) 71823(143 mod 256): READ0x1e85f thru 0x26012 (0x77b4 bytes) 71824(144 mod 256): WRITE 0x12f5a thru 0x1c19a (0x9241 bytes) 71825(145 mod 256): MAPWRITE 0x2ee03 thru 0x30fdb (0x21d9 bytes) 71826(146 mod 256): READ0x2a481 thru 0x2f630 (0x51b0 bytes) 71827(147 mod 256): READ0x13e64 thru 0x1d7e6 (0x9983 bytes) 71828(148 mod 256): TRUNCATE DOWN from 0x30fdc to 0x7c66 ** 71829(149 mod 256): WRI
Re: [patch 003/152] jbd: fix commit of ordered data buffers
On Fri, 2006-09-29 at 11:02 +0200, Jan Kara wrote: ... > > >+ } > > >+ /* Someone already cleaned up the buffer? */ > > >+ if (!buffer_jbd(bh) > > >+ || jh->b_transaction != commit_transaction > > >+ || jh->b_jlist != BJ_SyncData) { > > >+ jbd_unlock_bh_state(bh); > > >+ if (locked) > > >+ unlock_buffer(bh); > > >+ BUFFER_TRACE(bh, "already cleaned up"); > > >+ put_bh(bh); > > >+ continue; >---> Here the buffer was refiled by someone else I am little concerned about this particular code. We know that some one else will do the unfile/remove - but we will keep spinning on it till that happens. Isn't it ? Why don't we skip it and move to next one ? We are seeing few following message while running tests and wondering if your patch is causing it .. BUG: spinlock lockup on CPU#1, scp/30189, cfb503d8 (Not tainted) Call Trace: [C00018FDB320] [C00102E0] .show_stack+0x68/0x1b0 (unreliable) [C00018FDB3C0] [C01734F4] ._raw_spin_lock+0x138/0x184 [C00018FDB460] [C025AD24] ._spin_lock+0x10/0x24 [C00018FDB4E0] [D0172E14] .journal_dirty_data+0xa4/0x2c0 [jbd] [C00018FDB580] [D0205BAC] .ext3_journal_dirty_data+0x28/0x70 [ext3] [C00018FDB610] [D02048BC] .walk_page_buffers+0xb0/0x134 [ext3] [C00018FDB6D0] [D0208280] .ext3_ordered_commit_write+0x74/0x114 [ext 3] [C00018FDB780] [C007E710] .generic_file_buffered_write+0x51c/0x794 [C00018FDB930] [C007ECB8] .__generic_file_aio_write_nolock+0x330/0x3 bc [C00018FDBA20] [C007EDBC] .generic_file_aio_write+0x78/0x104 [C00018FDBAE0] [D0202EFC] .ext3_file_write+0x2c/0xd4 [ext3] [C00018FDBB70] [C00A8090] .do_sync_write+0xd4/0x130 [C00018FDBCF0] [C00A8E14] .vfs_write+0x118/0x200 [C00018FDBD90] [C00A9584] .sys_write+0x4c/0x8c [C00018FDBE30] [C0008434] syscall_exit+0x0/0x40 Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[Fwd: Re: [patch 003/152] jbd: fix commit of ordered data buffers]
Forgot to CC: ext4 & lkml earlier.. Thanks, Badari --- Begin Message --- [EMAIL PROTECTED] wrote: From: Jan Kara <[EMAIL PROTECTED]> [suitable for 2.6.18.x around the 2.6.19-rc2 timeframe] Signed-off-by: Jan Kara <[EMAIL PROTECTED]> Cc: Badari Pulavarty <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> --- fs/jbd/commit.c | 182 -- 1 file changed, 113 insertions(+), 69 deletions(-) diff -puN fs/jbd/commit.c~jbd-fix-commit-of-ordered-data-buffers fs/jbd/commit.c --- a/fs/jbd/commit.c~jbd-fix-commit-of-ordered-data-buffers +++ a/fs/jbd/commit.c @@ -160,6 +160,117 @@ static int journal_write_commit_record(j return (ret == -EIO); } +static void journal_do_submit_data(struct buffer_head **wbuf, int bufs) +{ + int i; + + for (i = 0; i < bufs; i++) { + wbuf[i]->b_end_io = end_buffer_write_sync; + /* We use-up our safety reference in submit_bh() */ + submit_bh(WRITE, wbuf[i]); + } +} + +/* + * Submit all the data buffers to disk + */ +static void journal_submit_data_buffers(journal_t *journal, + transaction_t *commit_transaction) +{ + struct journal_head *jh; + struct buffer_head *bh; + int locked; + int bufs = 0; + struct buffer_head **wbuf = journal->j_wbuf; + + /* +* Whenever we unlock the journal and sleep, things can get added +* onto ->t_sync_datalist, so we have to keep looping back to +* write_out_data until we *know* that the list is empty. +* +* Cleanup any flushed data buffers from the data list. Even in +* abort mode, we want to flush this out as soon as possible. +*/ +write_out_data: + cond_resched(); + spin_lock(&journal->j_list_lock); + + while (commit_transaction->t_sync_datalist) { + jh = commit_transaction->t_sync_datalist; Don't we need commit_transaction->t_sync_datalist = jh->b_tnext; here to skip to the next "jh" for next time ? Thanks, Badari + bh = jh2bh(jh); + locked = 0; + + /* Get reference just to make sure buffer does not disappear +* when we are forced to drop various locks */ + get_bh(bh); + /* If the buffer is dirty, we need to submit IO and hence +* we need the buffer lock. We try to lock the buffer without +* blocking. If we fail, we need to drop j_list_lock and do +* blocking lock_buffer(). +*/ + if (buffer_dirty(bh)) { + if (test_set_buffer_locked(bh)) { + BUFFER_TRACE(bh, "needs blocking lock"); + spin_unlock(&journal->j_list_lock); + /* Write out all data to prevent deadlocks */ + journal_do_submit_data(wbuf, bufs); + bufs = 0; + lock_buffer(bh); + spin_lock(&journal->j_list_lock); + } + locked = 1; + } + /* We have to get bh_state lock. Again out of order, sigh. */ + if (!inverted_lock(journal, bh)) { + jbd_lock_bh_state(bh); + spin_lock(&journal->j_list_lock); + } + /* Someone already cleaned up the buffer? */ + if (!buffer_jbd(bh) + || jh->b_transaction != commit_transaction + || jh->b_jlist != BJ_SyncData) { + jbd_unlock_bh_state(bh); + if (locked) + unlock_buffer(bh); + BUFFER_TRACE(bh, "already cleaned up"); + put_bh(bh); + continue; + } + if (locked && test_clear_buffer_dirty(bh)) { + BUFFER_TRACE(bh, "needs writeout, adding to array"); + wbuf[bufs++] = bh; + __journal_file_buffer(jh, commit_transaction, + BJ_Locked); + jbd_unlock_bh_state(bh); + if (bufs == journal->j_wbufsize) { + spin_unlock(&journal->j_list_lock); + journal_do_submit_data(wbuf, bufs); + bufs = 0; + goto write_out_data; + } + } + else { + BUFFER_TRACE(bh, "write
[PATCH] ext3 sequential read regression fix
ext3-get-blocks support caused ~20% degrade in Sequential read performance (tiobench). Problem is with marking the buffer boundary so IO can be submitted right away. Here is the patch to fix it. 2.6.18-rc6: --- # ./iotest 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 75.2726 seconds, 57.1 MB/s real1m15.285s user0m0.276s sys 0m3.884s 2.6.18-rc6 + fix: - [EMAIL PROTECTED] ~]# ./iotest 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 62.9356 seconds, 68.2 MB/s The boundary block check in ext3_get_blocks_handle needs to be adjusted against the count of blocks mapped in this call, now that it can map more than one block. Signed-off-by: Suparna Bhattacharya <[EMAIL PROTECTED]> Tested-by: Badari Pulavarty <[EMAIL PROTECTED]> linux-2.6.18-rc5-suparna/fs/ext3/inode.c |2 +- 1 files changed, 1 insertion(+), 1 deletion(-) diff -puN fs/ext3/inode.c~ext3-multiblock-boundary-fix fs/ext3/inode.c --- linux-2.6.18-rc5/fs/ext3/inode.c~ext3-multiblock-boundary-fix 2006-09-15 10:53:12.0 +0530 +++ linux-2.6.18-rc5-suparna/fs/ext3/inode.c2006-09-15 10:54:30.0 +0530 @@ -925,7 +925,7 @@ int ext3_get_blocks_handle(handle_t *han set_buffer_new(bh_result); got_it: map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key)); - if (blocks_to_boundary == 0) + if (count > blocks_to_boundary) set_buffer_boundary(bh_result); err = count; /* Clean up and exit */ - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext3 sequential read performance (~20%) degrade
On Fri, 2006-09-15 at 11:20 +0530, Suparna Bhattacharya wrote: > On Thu, Sep 14, 2006 at 05:03:08PM -0700, Andrew Morton wrote: > > On Thu, 14 Sep 2006 16:36:12 -0700 > > Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > > > Hi Andrew, > > > > > > I have been working on tracking down ~20% performance degrade for > > > sequential read performance for ext3. > > > > oop. I'd kinda prefer that we discover things like this before the patch > > gets into mainline. > > > > > Finally narrowed it down to get_blocks() support. If I force > > > ext3_get_blocks_handle() to always return 1 block - I get better > > > IO rate. I did all the usual stuff, tracked down requests, traced > > > blocksizes, looked at readahead code, looked at mpage_readpages() > > > etc.. I still can't figure out how to explain the degrade.. > > > > > > Any suggestions on how to track it down. > > > > Learn to driver Jens's blktrace stuff, find out why the IO scheduling went > > bad. > > > > Number one suspicion: the buffer_boundary() stuff isn't working. > > I think you are right about that - perhaps something along > the lines of the following patch (untested) would help ? Yep. It works :) Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
ext3 sequential read performance (~20%) degrade
Hi Andrew, I have been working on tracking down ~20% performance degrade for sequential read performance for ext3. Finally narrowed it down to get_blocks() support. If I force ext3_get_blocks_handle() to always return 1 block - I get better IO rate. I did all the usual stuff, tracked down requests, traced blocksizes, looked at readahead code, looked at mpage_readpages() etc.. I still can't figure out how to explain the degrade.. Any suggestions on how to track it down. Thanks, Badari # cat iotest mount /dev/sdb2 /mnt/tmp time dd if=/mnt/tmp/testfile of=/dev/null bs=4k count=1048576 umount /mnt/tmp 2.6.18-rc6: (multiblock): # ./iotest 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 75.2654 seconds, 57.1 MB/s real1m15.282s user0m0.248s sys 0m4.292s 2.6.18-rc6 (force single block in ext3_get_blocks_handle(): # ./iotest ./iotest 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 62.9472 seconds, 68.2 MB/s real1m2.976s user0m0.268s sys 0m4.280s - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
On Mon, 2006-09-11 at 11:46 +0200, Jan Kara wrote: ... > > > > I don't have any performance tests handy. We have some automated tests I > > can schedule to run to verify the stability aspects. > OK. I've run IOZONE rewrite throughput test on my computer with > iozone -t 10 -i 0 -s 10M -e > 2.6.18-rc6 and the same kernel + my patch seem to give almost the same > results. The strange thing was that both in vanilla and patched kernel there > were several runs where a write througput (when iozone was creating the file) > was suddenly 10% of the usual value (18MB/s vs. 2MB/s). The rewrite numbers > were always fine. Maybe that has something to do with block allocation > code. Anyway, it is not a regression of my patch so unless your test > finds some problem I think the patch should be ready for inclusion into > -mm... Your patch seems to be working fine. I haven't found any major regression yet. I spent lot of time trying to reproduce the problem with buffer-debug Andrew sent out - I really wanted to get to bottom of whats really happening here (since your patch made it go away). Yes. Your theory is correct. journal_dirty_data() is moving the buffer-head from commited transaction to current one and journal_unmap_buffer() is discarding and cleaning up the buffer-head. Later set_page_dirty() dirties the buffer-head there by causing BUG() in submit_bh(). Here is the buffer-trace-debug output to confirm it. I can sleep better now :) Now we can figure out, if your fix is the right one or not .. Thanks, Badari buffer trace for buffer at 0x8101d8540678 (I am CPU 3) journal_dirty_data():[fs/jbd/transaction.c:939] entry b_state:0x402d b_jlist:BJ_SyncData cpu:3 b_count:15 b_blocknr:45037 b_jbd:1 b_frozen_data: b_committed_data: b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:0 b_trans_is_comitting:1 b_jcount:1 pg_dirty:1 journal_dirty_data():[fs/jbd/transaction.c:971] has transaction b_state:0x10402d b_jlist:BJ_SyncData cpu:3 b_count:15 b_blocknr:45037 b_jbd:1 b_frozen_data: b_committed_data: b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:0 b_trans_is_comitting:1 b_jcount:1 pg_dirty:1 journal_dirty_data():[fs/jbd/transaction.c:973] belongs to older transaction b_state:0x10402d b_jlist:BJ_SyncData cpu:3 b_count:15 b_blocknr:45037 b_jbd:1 b_frozen_data: b_committed_data: b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:0 b_trans_is_comitting:1 b_jcount:1 pg_dirty:1 journal_dirty_data():[fs/jbd/transaction.c:1037] unfile from commit b_state:0x10402d b_jlist:BJ_SyncData cpu:3 b_count:15 b_blocknr:45037 b_jbd:1 b_frozen_data: b_committed_data: b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:0 b_trans_is_comitting:1 b_jcount:1 pg_dirty:1 __journal_temp_unlink_buffer():[fs/jbd/transaction.c:1501] entry b_state:0x10402d b_jlist:BJ_SyncData cpu:3 b_count:15 b_blocknr:45037 b_jbd:1 b_frozen_data: b_committed_data: b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:0 b_trans_is_comitting:1 b_jcount:1 pg_dirty:1 journal_dirty_data():[fs/jbd/transaction.c:1055] not on correct data list: unfile b_state:0x10402d b_jlist:BJ_None cpu:3 b_count:15 b_blocknr:45037 b_jbd:1 b_frozen_data: b_committed_data: b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:1 b_trans_is_comitting:0 b_jcount:1 pg_dirty:1 __journal_temp_unlink_buffer():[fs/jbd/transaction.c:1501] entry b_state:0x10402d b_jlist:BJ_None cpu:3 b_count:15 b_blocknr:45037 b_jbd:1 b_frozen_data: b_committed_data: b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:1 b_trans_is_comitting:0 b_jcount:1 pg_dirty:1 journal_dirty_data():[fs/jbd/transaction.c:1059] file as data b_state:0x10402d b_jlist:BJ_None cpu:3 b_count:15 b_blocknr:45037 b_jbd:1 b_frozen_data: b_committed_data: b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:1 b_trans_is_comitting:0 b_jcount:1 pg_dirty:1 __journal_file_buffer():[fs/jbd/transaction.c:1948] entry b_state:0x10402d b_jlist:BJ_None cpu:3 b_count:15 b_blocknr:45037 b_jbd:1 b_frozen_data: b_committed_data: b_transaction:1 b_next_transaction:0 b_cp_transaction:0 b_trans_is_running:1 b_trans_is_comitting:0 b_jcount:1 pg_dirty:1 __journal_temp_unlink_buffer():[fs/jbd/transaction.c:1501] entry b_state:0x10402d b_jlist:BJ_None cpu:3 b_count:15 b_blocknr:45037 b_jbd:1 b_frozen_data: b_committed_data: b_transaction:1 b_n
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
Jan Kara wrote: Hi, Jan Kara wrote: I've been looking more at the code and I have revived my patch fixing this part of the code. I've mildly tested the patch. Could you also give it a try? Thanks. Honza Original commit code assumes, that when a buffer on BJ_SyncData list is locked, it is being written to disk. But this is not true and hence it can lead to a potential data loss on crash. Also the code didn't count with the fact that journal_dirty_data() can steal buffers from committing transaction and hence could write buffers that no longer belong to the committing transaction. Finally it could possibly happen that we tried writing out one buffer several times. The patch below tries to solve these problems by a complete rewrite of the data commit code. We go through buffers on t_sync_datalist, lock buffers needing write out and store them in an array. Buffers are also immediately refiled to BJ_Locked list or unfiled (if the write out is completed). When the array is full or we have to block on buffer lock, we submit all accumulated buffers for IO. Signed-off-by: Jan Kara <[EMAIL PROTECTED]> I have been running 4+ hours with this patch and seems to work fine. I haven't hit any assert yet :) I will let it run till tomorrow. I will let you know, how it goes. Great, thanks. BTW: Do you have any performance tests handy? The changes are big enough to cause some unexpected performance regressions, livelocks... If you don't have anything ready, I can setup and run something myself. Just that I don't like this testing too much ;). Tests are still running fine. I don't have any performance tests handy. We have some automated tests I can schedule to run to verify the stability aspects. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
Jan Kara wrote: Ugh! Are you sure? For this path the buffer must be attached (only) to the running transaction. But then how the commit code comes to it? Somebody would have to even manage to refile the buffer from the committing transaction to the running one while the buffer is in wbuf[]. Could you check whether someone does __journal_refile_buffer() on your marked buffers, please? Or whether we move buffer to BJ_Locked list in the write_out_data: loop? Thanks. I added more debug in __journal_refile_buffer() to see if the marked buffers are getting refiled. I am able to reproduce the problem, but I don't see any debug including my original prints. (It looks as if none of my debug code exists) - its really confusing. I will keep looking and get back to you. I've been looking more at the code and I have revived my patch fixing this part of the code. I've mildly tested the patch. Could you also give it a try? Thanks. Honza Original commit code assumes, that when a buffer on BJ_SyncData list is locked, it is being written to disk. But this is not true and hence it can lead to a potential data loss on crash. Also the code didn't count with the fact that journal_dirty_data() can steal buffers from committing transaction and hence could write buffers that no longer belong to the committing transaction. Finally it could possibly happen that we tried writing out one buffer several times. The patch below tries to solve these problems by a complete rewrite of the data commit code. We go through buffers on t_sync_datalist, lock buffers needing write out and store them in an array. Buffers are also immediately refiled to BJ_Locked list or unfiled (if the write out is completed). When the array is full or we have to block on buffer lock, we submit all accumulated buffers for IO. Signed-off-by: Jan Kara <[EMAIL PROTECTED]> I have been running 4+ hours with this patch and seems to work fine. I haven't hit any assert yet :) I will let it run till tomorrow. I will let you know, how it goes. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext3_getblk should handle HOLE correctly
On Thu, 2006-09-07 at 11:45 -0700, Andrew Morton wrote: > On Wed, 06 Sep 2006 10:39:06 -0700 > Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > Hi Andrew, > > > > Its been reported that ext3_getblk() is not doing the right thing > > and triggering following WARN(): > > > > BUG: warning at fs/ext3/inode.c:1016/ext3_getblk() > > ext3_getblk+0x98/0x2a6 md_wakeup_thread > > +0x26/0x2a > > ext3_bread+0x1f/0x88 ext3_quota_read+0x136/0x1ae > > v1_read_dqblk+0x61/0xac dquot_acquire+0xf6/0x107 > > ext3_acquire_dquot+0x46/0x68 dqget+0x155/0x1e7 > > dquot_transfer+0x3e0/0x3e9 dput+0x23/0x13e > > ext3_setattr+0xc3/0x240 current_fs_time > > +0x52/0x6a > > notify_change+0x2bd/0x30d chown_common+0x9c/0xc5 > > strncpy_from_user+0x3b/0x68 do_path_lookup > > +0xdf/0x266 > > __user_walk_fd+0x44/0x5a sys_chown+0x4a/0x55 > > vfs_write+0xe7/0x13c sys_mkdir+0x1f/0x23 > > syscall_call+0x7/0xb > > > > Looking at the code, it looks like its not handle HOLE correctly. > > It ends up returning -EIO. > > Strange. The fs should be spewing these warnings all over the place. For > some reason this code is hard to trigger. Why?? I guess - ext3_getblk() mostly used by ext3_bread() and most callers to it would be reading already allocated block. > > > - if (err == 1) { > > + /* > > +* ext3_get_blocks_handle() returns number of blocks > > +* mapped. 0 in case of a HOLE. > > +*/ > > + if (err > 0) { > > err = 0; > > - } else if (err >= 0) { > > - WARN_ON(1); > > - err = -EIO; > > } > > That removes the warning if ext3_get_blocks_handle() returned a positive > number greater than one. And it looks like we still need debugging support > in this area. I am not sure why we need it ? All we care about is one block. If ext3_get_blocks_handle() returns more than one (which it shouldn't) - it still be okay. Whats wrong with that ? Just curious .. May be we should add a WARN() in ext3_get_blocks_handle() when it returns more than asked for. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
> Ugh! Are you sure? For this path the buffer must be attached (only) to > the running transaction. But then how the commit code comes to it? > Somebody would have to even manage to refile the buffer from the > committing transaction to the running one while the buffer is in wbuf[]. > Could you check whether someone does __journal_refile_buffer() on your > marked buffers, please? Or whether we move buffer to BJ_Locked list in > the write_out_data: loop? Thanks. > > I added more debug in __journal_refile_buffer() to see if the marked buffers are getting refiled. I am able to reproduce the problem, but I don't see any debug including my original prints. (It looks as if none of my debug code exists) - its really confusing. I will keep looking and get back to you. I may try Andrew's buffer debug patch - if I get desperate. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
Andrew Morton wrote: On Wed, 6 Sep 2006 19:27:33 +0200 Jan Kara <[EMAIL PROTECTED]> wrote: Ugh! Are you sure? For this path the buffer must be attached (only) to the running transaction. But then how the commit code comes to it? Somebody would have to even manage to refile the buffer from the committing transaction to the running one while the buffer is in wbuf[]. Could you check whether someone does __journal_refile_buffer() on your marked buffers, please? Or whether we move buffer to BJ_Locked list in the write_out_data: loop? Thanks. The ext3-debug patch will help here. It records, within the bh, the inputs from the last 32 BUFFER_TRACE()s which were run against this bh. If a J_ASSERT fails then you get a nice trace of the last 32 "things" which happened to this bh, including the bh's state at that transition. It basically tells you everything you need to know to find the bug. It's worth spending the time to become familiar with it - I used it a lot in early ext3 development. I will try the patch. Unfortunately, adding more debug is causing the problem reproduction difficult. I've been unable to reproduce this crash, btw. Is there some magic incantation apat from running `fsx-linux'? All I do is on a single 1k filesystem, run 4 copies of fsx (on 4 different files, ofcourse). I hit the assert anywhere between 10min-2hours. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext3_getblk should handle HOLE correctly
Hi Andrew, Its been reported that ext3_getblk() is not doing the right thing and triggering following WARN(): BUG: warning at fs/ext3/inode.c:1016/ext3_getblk() ext3_getblk+0x98/0x2a6 md_wakeup_thread +0x26/0x2a ext3_bread+0x1f/0x88 ext3_quota_read+0x136/0x1ae v1_read_dqblk+0x61/0xac dquot_acquire+0xf6/0x107 ext3_acquire_dquot+0x46/0x68 dqget+0x155/0x1e7 dquot_transfer+0x3e0/0x3e9 dput+0x23/0x13e ext3_setattr+0xc3/0x240 current_fs_time +0x52/0x6a notify_change+0x2bd/0x30d chown_common+0x9c/0xc5 strncpy_from_user+0x3b/0x68 do_path_lookup +0xdf/0x266 __user_walk_fd+0x44/0x5a sys_chown+0x4a/0x55 vfs_write+0xe7/0x13c sys_mkdir+0x1f/0x23 syscall_call+0x7/0xb Looking at the code, it looks like its not handle HOLE correctly. It ends up returning -EIO. Here is the patch to fix it. If we really want to be paranoid, we can allow return values 0 (HOLE), 1 (we asked for one block) and return -EIO for more than 1 block. But I really don't see a reason for doing it - all we need is the block# here. (doesn't matter how many blocks are mapped). Thanks, Badari ext3_get_blocks_handle() returns number of blocks it mapped. It returns 0 in case of HOLE. ext3_getblk() should handle HOLE properly (currently its dumping warning stack and returning -EIO). Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> Acked-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/ext3/inode.c |9 + 1 file changed, 5 insertions(+), 4 deletions(-) Index: linux-2.6.18-rc5/fs/ext3/inode.c === --- linux-2.6.18-rc5.orig/fs/ext3/inode.c 2006-08-27 20:41:48.0 -0700 +++ linux-2.6.18-rc5/fs/ext3/inode.c2006-09-05 15:32:57.0 -0700 @@ -1009,11 +1009,12 @@ struct buffer_head *ext3_getblk(handle_t buffer_trace_init(&dummy.b_history); err = ext3_get_blocks_handle(handle, inode, block, 1, &dummy, create, 1); - if (err == 1) { + /* +* ext3_get_blocks_handle() returns number of blocks +* mapped. 0 in case of a HOLE. +*/ + if (err > 0) { err = 0; - } else if (err >= 0) { - WARN_ON(1); - err = -EIO; } *errp = err; if (!err && buffer_mapped(&dummy)) { - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
On Wed, 2006-09-06 at 18:27 +0200, Jan Kara wrote: > > On Wed, 2006-09-06 at 17:34 +0200, Jan Kara wrote: > > > > On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote: > > > > > > > > > > Andrew, what should we do ? Do you suggest handling this in jbd > > > > > > itself (like this patch) ? > > > > > Actually that part of commit code needs rewrite anyway (and after > > > > > that > > > > > rewrite you get rid of ll_rw_block()) because of other problems - the > > > > > code assumes that whenever buffer is locked, it is being written to > > > > > disk > > > > > which is not true... I have some preliminary patches for that but they > > > > > are not very nice and so far I didn't have enough time to find a nice > > > > > solution. > > > > > > > > Are you okay with current not-so-elegant fix ? > > > Actually I don't quite understand how it can happen what you describe > > > (so probably I missed something). How it can happen that some buffers > > > are unmapped while we are committing them? journal_unmap_buffers() > > > checks whether we are not committing truncated buffers and if so, it > > > does not do anything to such buffers... > > > Bye > > > Honza > > > > Yep. I spent lot of time trying to understand - why they are not > > getting skipped :( > > > > But my debug clearly shows that we are clearing the buffer, while > > we haven't actually submitted to ll_rw_block() code. (I added "track" > > flag to bh and set it in journal_commit_transaction() when we add > > them to wbuf[] and clear it in ll_rw_block() after submit. I checked > > for this flag in journal_unmap_buffer() while clearing the buffer). > > Here is what my debug shows: > > > > buffer is tracked bh 8101686ea850 size 1024 > > > > Call Trace: > > [] show_trace+0xb5/0x370 > > [] dump_stack+0x15/0x20 > > [] journal_invalidatepage+0x314/0x3b0 > I see just journal_invalidatepage() here. That is fine. It calls > journal_unmap_buffer() which should do nothing return 0. If it does > not it would be IMO bug.. If the buffer is really unmapped here, in what > state it is (i.e. which list is it on?). Okay.. here is the path its taking according to my debug .. Remember, the issue is: after the buffer is cleaned - they are still (left) attached to the page (since a page can have 4 buffer heads and we partially truncated the page). After we clean up the buffers any subsequent call to set_page_dirty() would end up marking *all* the buffers dirty. If ll_rw_block() happens after this, we will run into the assert. If no set_page_dirty() happens before ll_rw_block() happens, things would be fine - as the buffer won't be dirty and be skipped. journal_unmap_buffer() { . } else { /* Good, the buffer belongs to the running transaction. * We are writing our own transaction's data, not any * previous one's, so it is safe to throw it away * (remember that we expect the filesystem to have set * i_size already for this truncate so recovery will not * expose the disk blocks we are discarding here.) */ J_ASSERT_JH(jh, transaction == journal- >j_running_transaction); may_free = __dispose_buffer(jh, transaction); } zap_buffer: journal_put_journal_head(jh); zap_buffer_no_jh: spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(bh); spin_unlock(&journal->j_state_lock); zap_buffer_unlocked: clear_buffer_dirty(bh); J_ASSERT_BH(bh, !buffer_jbddirty(bh)); clear_buffer_mapped(bh); clear_buffer_req(bh); clear_buffer_new(bh); bh->b_bdev = NULL; return may_free; } - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
On Wed, 2006-09-06 at 18:27 +0200, Jan Kara wrote: > > On Wed, 2006-09-06 at 17:34 +0200, Jan Kara wrote: > > > > On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote: > > > > > > > > > > Andrew, what should we do ? Do you suggest handling this in jbd > > > > > > itself (like this patch) ? > > > > > Actually that part of commit code needs rewrite anyway (and after > > > > > that > > > > > rewrite you get rid of ll_rw_block()) because of other problems - the > > > > > code assumes that whenever buffer is locked, it is being written to > > > > > disk > > > > > which is not true... I have some preliminary patches for that but they > > > > > are not very nice and so far I didn't have enough time to find a nice > > > > > solution. > > > > > > > > Are you okay with current not-so-elegant fix ? > > > Actually I don't quite understand how it can happen what you describe > > > (so probably I missed something). How it can happen that some buffers > > > are unmapped while we are committing them? journal_unmap_buffers() > > > checks whether we are not committing truncated buffers and if so, it > > > does not do anything to such buffers... > > > Bye > > > Honza > > > > Yep. I spent lot of time trying to understand - why they are not > > getting skipped :( > > > > But my debug clearly shows that we are clearing the buffer, while > > we haven't actually submitted to ll_rw_block() code. (I added "track" > > flag to bh and set it in journal_commit_transaction() when we add > > them to wbuf[] and clear it in ll_rw_block() after submit. I checked > > for this flag in journal_unmap_buffer() while clearing the buffer). > > Here is what my debug shows: > > > > buffer is tracked bh 8101686ea850 size 1024 > > > > Call Trace: > > [] show_trace+0xb5/0x370 > > [] dump_stack+0x15/0x20 > > [] journal_invalidatepage+0x314/0x3b0 > I see just journal_invalidatepage() here. That is fine. It calls > journal_unmap_buffer() which should do nothing return 0. If it does > not it would be IMO bug.. If the buffer is really unmapped here, in what > state it is (i.e. which list is it on?). > Acutally, I added dump_stack() in journal_unmap_buffer() when it does clear_buffer_mapped(). gcc must of pulled in the function .. I will add more debug to track the list bh came from. - Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
On Wed, 2006-09-06 at 17:34 +0200, Jan Kara wrote: > > On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote: > > > > > > Andrew, what should we do ? Do you suggest handling this in jbd > > > > itself (like this patch) ? > > > Actually that part of commit code needs rewrite anyway (and after that > > > rewrite you get rid of ll_rw_block()) because of other problems - the > > > code assumes that whenever buffer is locked, it is being written to disk > > > which is not true... I have some preliminary patches for that but they > > > are not very nice and so far I didn't have enough time to find a nice > > > solution. > > > > Are you okay with current not-so-elegant fix ? > Actually I don't quite understand how it can happen what you describe > (so probably I missed something). How it can happen that some buffers > are unmapped while we are committing them? journal_unmap_buffers() > checks whether we are not committing truncated buffers and if so, it > does not do anything to such buffers... > Bye > Honza Yep. I spent lot of time trying to understand - why they are not getting skipped :( But my debug clearly shows that we are clearing the buffer, while we haven't actually submitted to ll_rw_block() code. (I added "track" flag to bh and set it in journal_commit_transaction() when we add them to wbuf[] and clear it in ll_rw_block() after submit. I checked for this flag in journal_unmap_buffer() while clearing the buffer). Here is what my debug shows: buffer is tracked bh 8101686ea850 size 1024 Call Trace: [] show_trace+0xb5/0x370 [] dump_stack+0x15/0x20 [] journal_invalidatepage+0x314/0x3b0 [] ext3_invalidatepage+0x38/0x40 [] do_invalidatepage+0x20/0x30 [] truncate_complete_page+0x23/0x50 [] truncate_inode_pages_range+0xcd/0x300 [] truncate_inode_pages+0x10/0x20 [] vmtruncate+0x5f/0x100 [] inode_setattr+0x30/0x140 [] ext3_setattr+0x1bb/0x230 [] notify_change+0x15e/0x320 [] do_truncate+0x53/0x80 [] sys_ftruncate+0xf8/0x130 [] system_call+0x7e/0x83 Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote: > Hi, > > > On Fri, 2006-09-01 at 10:18 -0700, Andrew Morton wrote: > > > > > > Kernel BUG at fs/buffer.c:2791 > > > > > > invalid opcode: [1] SMP > > > > > > > > > > > > Its complaining about BUG_ON(!buffer_mapped(bh)). > > > > Here is the change that seems to cause the problem. Jana Kara > > introduced a new mode "SWRITE" for ll_rw_block() - where it > > waits for buffer to be unlocked (WRITE will skip locked > > buffers) + jbd journal_commit_transaction() has been changed > > to make use of SWRITE. > > > > http://marc.theaimsgroup.com/?l=linux-fsdevel&m=112109788702895&w=2 > > > > Theoritically same race (between journal_commit_transaction() and > > journal_unmap_buffer()+set_page_dirty()) could exist before his changes > > - which should trigger bug in submit_bh(). But I can't seem to hit > > it without his changes. My guess is ll_rw_block() is always skipping > > those cleaned up buffers, before page gets dirtied again .. > I think that the change to ll_rw_block() just widens the window much > more... > Yes. > > Andrew, what should we do ? Do you suggest handling this in jbd > > itself (like this patch) ? > Actually that part of commit code needs rewrite anyway (and after that > rewrite you get rid of ll_rw_block()) because of other problems - the > code assumes that whenever buffer is locked, it is being written to disk > which is not true... I have some preliminary patches for that but they > are not very nice and so far I didn't have enough time to find a nice > solution. Are you okay with current not-so-elegant fix ? Thanks, Badari Patch to fix: Kernel BUG at fs/buffer.c:2791 on 1k (2k) filesystems while running fsx. journal_commit_transaction collects lots of dirty buffer from and does a single ll_rw_block() to write them out. ll_rw_block() locks the buffer and checks to see if they are dirty and submits them for IO. In the mean while, journal_unmap_buffers() as part of truncate can unmap the buffer and throw it away. Since its a 1k (2k) filesystem - each page (4k) will have more than one buffer_head attached to the page and and we can't free up buffer_heads attached to the page (if we are not invalidating the whole page). Now, any call to set_page_dirty() (like msync_interval) could end up setting all the buffer heads attached to this page again dirty, including the ones those got cleaned up :( So, -not-so-elegant fix would be to have make jbd skip all the buffers that are not mapped. Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> --- fs/jbd/commit.c | 32 ++-- 1 file changed, 30 insertions(+), 2 deletions(-) Index: linux-2.6.18-rc5/fs/jbd/commit.c === --- linux-2.6.18-rc5.orig/fs/jbd/commit.c 2006-08-27 20:41:48.0 -0700 +++ linux-2.6.18-rc5/fs/jbd/commit.c2006-09-01 10:43:54.0 -0700 @@ -160,6 +160,34 @@ static int journal_write_commit_record(j return (ret == -EIO); } +static void jbd_write_buffers(int nr, struct buffer_head *bhs[]) +{ + int i; + + for (i = 0; i < nr; i++) { + struct buffer_head *bh = bhs[i]; + + lock_buffer(bh); + + /* +* case 1: Buffer could have been flushed by now, +* if so nothing to do for us. +* case 2: Buffer could have been unmapped up by +* journal_unmap_buffer - but still attached to the +* page. Any calls to set_page_dirty() would dirty +* the buffer even though its not mapped. If so, +* we need to skip them. +*/ + if (buffer_mapped(bh) && test_clear_buffer_dirty(bh)) { + bh->b_end_io = end_buffer_write_sync; + get_bh(bh); + submit_bh(WRITE, bh); + continue; + } + unlock_buffer(bh); + } +} + /* * journal_commit_transaction * @@ -356,7 +384,7 @@ write_out_data: jbd_debug(2, "submit %d writes\n", bufs); spin_unlock(&journal->j_list_lock); - ll_rw_block(SWRITE, bufs, wbuf); + jbd_write_buffers(bufs, wbuf); journal_brelse_array(wbuf, bufs); bufs = 0; goto write_out_data; @@ -379,7 +407,7 @@ write_out_data: if (bufs) {
Re: BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()
On Tue, 2006-09-05 at 17:47 -0400, Will Simoneau wrote: > On 14:06 Tue 05 Sep , Badari Pulavarty wrote: > > Will Simoneau wrote: > > >Has anyone seen this before? These three traces occured at different times > > >today when three new user accounts (and associated quotas) were created. > > >This > > >machine is an NFS server which uses quotas on an ext3 fs (dir_index is on). > > >Kernel is 2.6.17.11 on an x86 smp w/64G highmem; 4G ram is installed. The > > >affected filesystem is on a software raid1 of two hardware raid0 volumes > > >from a > > >megaraid card. > > > > > >BUG: warning at fs/ext3/inode.c:1016/ext3_getblk() > > > ext3_getblk+0x98/0x2a6 md_wakeup_thread+0x26/0x2a > > > ext3_bread+0x1f/0x88 ext3_quota_read+0x136/0x1ae > > > v1_read_dqblk+0x61/0xac dquot_acquire+0xf6/0x107 > > > ext3_acquire_dquot+0x46/0x68 dqget+0x155/0x1e7 > > > dquot_transfer+0x3e0/0x3e9 dput+0x23/0x13e I think, we found your problem. ext3_getblk() is not handling HOLE correctly. Does this patch help ? Mingming, what do you think ? Thanks, Badari ext3_get_blocks_handle() returns number of blocks it mapped. It returns 0 in case of HOLE. ext3_getblk() should handle HOLE properly (currently its dumping warning stack and returning -EIO). Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> --- fs/ext3/inode.c |9 + 1 file changed, 5 insertions(+), 4 deletions(-) Index: linux-2.6.18-rc5/fs/ext3/inode.c === --- linux-2.6.18-rc5.orig/fs/ext3/inode.c 2006-08-27 20:41:48.0 -0700 +++ linux-2.6.18-rc5/fs/ext3/inode.c2006-09-05 15:32:57.0 -0700 @@ -1009,11 +1009,12 @@ struct buffer_head *ext3_getblk(handle_t buffer_trace_init(&dummy.b_history); err = ext3_get_blocks_handle(handle, inode, block, 1, &dummy, create, 1); - if (err == 1) { + /* +* ext3_get_blocks_handle() returns number of blocks +* mapped. 0 in case of a HOLE. +*/ + if (err > 0) { err = 0; - } else if (err >= 0) { - WARN_ON(1); - err = -EIO; } *errp = err; if (!err && buffer_mapped(&dummy)) { - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()
Will Simoneau wrote: Has anyone seen this before? These three traces occured at different times today when three new user accounts (and associated quotas) were created. This machine is an NFS server which uses quotas on an ext3 fs (dir_index is on). Kernel is 2.6.17.11 on an x86 smp w/64G highmem; 4G ram is installed. The affected filesystem is on a software raid1 of two hardware raid0 volumes from a megaraid card. BUG: warning at fs/ext3/inode.c:1016/ext3_getblk() ext3_getblk+0x98/0x2a6 md_wakeup_thread+0x26/0x2a ext3_bread+0x1f/0x88 ext3_quota_read+0x136/0x1ae v1_read_dqblk+0x61/0xac dquot_acquire+0xf6/0x107 ext3_acquire_dquot+0x46/0x68 dqget+0x155/0x1e7 dquot_transfer+0x3e0/0x3e9 dput+0x23/0x13e Made me curious and looking around on what the warning is coming ? Few basic questions .. Do you have CONFIG_LBD ? I see the ext3_getblk() used "long" for "block" & ext3_get_blocks_handle() expects "sector_t" for "block". Wondering if you are running into 64-bit -to- 32-bit conversion issues .. ? Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()
Dave Kleikamp wrote: On Tue, 2006-09-05 at 11:09 -0700, Badari Pulavarty wrote: On Tue, 2006-09-05 at 13:10 -0400, Will Simoneau wrote: Has anyone seen this before? These three traces occured at different times today when three new user accounts (and associated quotas) were created. This machine is an NFS server which uses quotas on an ext3 fs (dir_index is on). Kernel is 2.6.17.11 on an x86 smp w/64G highmem; 4G ram is installed. The affected filesystem is on a software raid1 of two hardware raid0 volumes from a megaraid card. BUG: warning at fs/ext3/inode.c:1016/ext3_getblk() ext3_getblk+0x98/0x2a6 md_wakeup_thread+0x26/0x2a ext3_bread+0x1f/0x88 ext3_quota_read+0x136/0x1ae v1_read_dqblk+0x61/0xac dquot_acquire+0xf6/0x107 ext3_acquire_dquot+0x46/0x68 dqget+0x155/0x1e7 dquot_transfer+0x3e0/0x3e9 dput+0x23/0x13e ext3_setattr+0xc3/0x240 current_fs_time+0x52/0x6a notify_change+0x2bd/0x30d chown_common+0x9c/0xc5 strncpy_from_user+0x3b/0x68 do_path_lookup+0xdf/0x266 __user_walk_fd+0x44/0x5a sys_chown+0x4a/0x55 vfs_write+0xe7/0x13c sys_mkdir+0x1f/0x23 syscall_call+0x7/0xb I think its a bogus warning. ext3_getblk() is calling ext3_get_blocks_handle() to map "1" block for read. But for *some* reason ext3_get_blocks_handle() more than 1 block. ext3_get_blocks_handle() return "positive #of blocks" is a valid case. So needs to be fixed. I'm having a hard time figuring out exactly what ext3_get_blocks_handle is trying to return, but it looks to me like if it is allocating one data block, and needs to allocate an indirect block as well, then it will return 2 rather than 1. Is this expected, or am I just confused? It would return "1" in that case.. (data block) > 0 means get_block() suceeded and indicates the number of blocks mapped = 0 lookup failed < 0 mean error case I did search for callers of ext3_get_blocks_handle() and found that ext3_readdir() seems to do wrong thing all the time with error check :( Need to take a closer look.. err = ext3_get_blocks_handle(NULL, inode, blk, 1, &map_bh, 0, 0); if (err > 0) { <<<< BAD page_cache_readahead(sb->s_bdev->bd_inode->i_mapping, &filp->f_ra, filp, map_bh.b_blocknr >> (PAGE_CACHE_SHIFT - inode->i_blkbits), 1); bh = ext3_bread(NULL, inode, blk, 0, &err); } Bad to do this what it's doing, or bad to call name the variable "err"? I think if it looked like this: count = ext3_get_blocks_handle(NULL, inode, blk, 1, &map_bh, 0, 0); if (count > 0) { it would be a lot less confusing. I am sorry !! it is doing the right thing :( Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()
On Tue, 2006-09-05 at 13:10 -0400, Will Simoneau wrote: > Has anyone seen this before? These three traces occured at different times > today when three new user accounts (and associated quotas) were created. This > machine is an NFS server which uses quotas on an ext3 fs (dir_index is on). > Kernel is 2.6.17.11 on an x86 smp w/64G highmem; 4G ram is installed. The > affected filesystem is on a software raid1 of two hardware raid0 volumes from > a > megaraid card. > > BUG: warning at fs/ext3/inode.c:1016/ext3_getblk() > ext3_getblk+0x98/0x2a6 md_wakeup_thread+0x26/0x2a > ext3_bread+0x1f/0x88 ext3_quota_read+0x136/0x1ae > v1_read_dqblk+0x61/0xac dquot_acquire+0xf6/0x107 > ext3_acquire_dquot+0x46/0x68 dqget+0x155/0x1e7 > dquot_transfer+0x3e0/0x3e9 dput+0x23/0x13e > ext3_setattr+0xc3/0x240 current_fs_time+0x52/0x6a > notify_change+0x2bd/0x30d chown_common+0x9c/0xc5 > strncpy_from_user+0x3b/0x68 do_path_lookup+0xdf/0x266 > __user_walk_fd+0x44/0x5a sys_chown+0x4a/0x55 > vfs_write+0xe7/0x13c sys_mkdir+0x1f/0x23 > syscall_call+0x7/0xb I think its a bogus warning. ext3_getblk() is calling ext3_get_blocks_handle() to map "1" block for read. But for *some* reason ext3_get_blocks_handle() more than 1 block. ext3_get_blocks_handle() return "positive #of blocks" is a valid case. So needs to be fixed. I did search for callers of ext3_get_blocks_handle() and found that ext3_readdir() seems to do wrong thing all the time with error check :( Need to take a closer look.. err = ext3_get_blocks_handle(NULL, inode, blk, 1, &map_bh, 0, 0); if (err > 0) { BAD page_cache_readahead(sb->s_bdev->bd_inode->i_mapping, &filp->f_ra, filp, map_bh.b_blocknr >> (PAGE_CACHE_SHIFT - inode->i_blkbits), 1); bh = ext3_bread(NULL, inode, blk, 0, &err); } Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
On Fri, 2006-09-01 at 10:18 -0700, Andrew Morton wrote: > On Fri, 01 Sep 2006 09:32:22 -0700 > Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > > > Kernel BUG at fs/buffer.c:2791 > > > > invalid opcode: [1] SMP > > > > > > > > Its complaining about BUG_ON(!buffer_mapped(bh)). > > I need to have a little think about this, remember what _should_ be > happening in this situation. > > We (mainly I) used to do a huge amount of fsx-linux testing on 1k blocksize > filesystems. We've done something to make this start happening. Part of > resolving this bug will be working out what that was. > It took a while to track this down. 2.6.13 is the last kernel which runs my fsx tests fine (72+ hours). Here is the change that seems to cause the problem. Jana Kara introduced a new mode "SWRITE" for ll_rw_block() - where it waits for buffer to be unlocked (WRITE will skip locked buffers) + jbd journal_commit_transaction() has been changed to make use of SWRITE. http://marc.theaimsgroup.com/?l=linux-fsdevel&m=112109788702895&w=2 Theoritically same race (between journal_commit_transaction() and journal_unmap_buffer()+set_page_dirty()) could exist before his changes - which should trigger bug in submit_bh(). But I can't seem to hit it without his changes. My guess is ll_rw_block() is always skipping those cleaned up buffers, before page gets dirtied again .. Andrew, what should we do ? Do you suggest handling this in jbd itself (like this patch) ? Thanks, Badari Patch to fix: Kernel BUG at fs/buffer.c:2791 on 1k (2k) filesystems while running fsx. journal_commit_transaction collects lots of dirty buffer from and does a single ll_rw_block() to write them out. ll_rw_block() locks the buffer and checks to see if they are dirty and submits them for IO. In the mean while, journal_unmap_buffers() as part of truncate can unmap the buffer and throw it away. Since its a 1k (2k) filesystem - each page (4k) will have more than one buffer_head attached to the page and and we can't free up buffer_heads attached to the page (if we are not invalidating the whole page). Now, any call to set_page_dirty() (like msync_interval) could end up setting all the buffer heads attached to this page again dirty, including the ones those got cleaned up :( So, -not-so-elegant fix would be to have make jbd skip all the buffers that are not mapped. Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> --- fs/jbd/commit.c | 32 ++-- 1 file changed, 30 insertions(+), 2 deletions(-) Index: linux-2.6.18-rc5/fs/jbd/commit.c === --- linux-2.6.18-rc5.orig/fs/jbd/commit.c 2006-08-27 20:41:48.0 -0700 +++ linux-2.6.18-rc5/fs/jbd/commit.c2006-09-01 10:43:54.0 -0700 @@ -160,6 +160,34 @@ static int journal_write_commit_record(j return (ret == -EIO); } +static void jbd_write_buffers(int nr, struct buffer_head *bhs[]) +{ + int i; + + for (i = 0; i < nr; i++) { + struct buffer_head *bh = bhs[i]; + + lock_buffer(bh); + + /* +* case 1: Buffer could have been flushed by now, +* if so nothing to do for us. +* case 2: Buffer could have been unmapped up by +* journal_unmap_buffer - but still attached to the +* page. Any calls to set_page_dirty() would dirty +* the buffer even though its not mapped. If so, +* we need to skip them. +*/ + if (buffer_mapped(bh) && test_clear_buffer_dirty(bh)) { + bh->b_end_io = end_buffer_write_sync; + get_bh(bh); + submit_bh(WRITE, bh); + continue; + } + unlock_buffer(bh); + } +} + /* * journal_commit_transaction * @@ -356,7 +384,7 @@ write_out_data: jbd_debug(2, "submit %d writes\n", bufs); spin_unlock(&journal->j_list_lock); - ll_rw_block(SWRITE, bufs, wbuf); + jbd_write_buffers(bufs, wbuf); journal_brelse_array(wbuf, bufs); bufs = 0; goto write_out_data; @@ -379,7 +407,7 @@ write_out_data: if (bufs) { spin_unlock(&journal->j_list_lock); - ll_rw_block(SWRITE, bufs, wbuf); + jbd_write_buffers(bufs, wbuf); journal_brelse_array(wbuf, bufs); spin_lock(&journal->j_list_lock); } - T
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
Andrew Morton wrote: On Fri, 01 Sep 2006 09:32:22 -0700 Badari Pulavarty <[EMAIL PROTECTED]> wrote: Kernel BUG at fs/buffer.c:2791 invalid opcode: [1] SMP Its complaining about BUG_ON(!buffer_mapped(bh)). I need to have a little think about this, remember what _should_ be happening in this situation. We (mainly I) used to do a huge amount of fsx-linux testing on 1k blocksize filesystems. We've done something to make this start happening. Part of resolving this bug will be working out what that was. Here is the progress in tracking this down so far. I am able to reproduce the problem on following kernel versions. 2.6.18-rc5 2.6.18-rc4 2.6.17.11 2.6.16.28 2.6.15.7 2.6.14.7 I am yet to find a latest kernel version - where this works :( I am going to try older versions of the kernel. Thanks, Badari -- VGER BF report: H 3.60822e-15 - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
On Fri, 2006-09-01 at 10:18 -0700, Andrew Morton wrote: > On Fri, 01 Sep 2006 09:32:22 -0700 > Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > > > Kernel BUG at fs/buffer.c:2791 > > > > invalid opcode: [1] SMP > > > > > > > > Its complaining about BUG_ON(!buffer_mapped(bh)). > > I need to have a little think about this, remember what _should_ be > happening in this situation. Agreed. I used to run fsx on regular basis - never saw the problem before. I will try to go back few kernel versions and verify. > > We (mainly I) used to do a huge amount of fsx-linux testing on 1k blocksize > filesystems. We've done something to make this start happening. Part of > resolving this bug will be working out what that was. > Here is the other fix, I did to avoid the problem (basically, have jbd manage its own submit function and skip unmapped buffers) - not elegant. Thats why I tried to address at the root cause.. Thanks, Badari Patch to fix: Kernel BUG at fs/buffer.c:2791 on 1k (2k) filesystems while running fsx. journal_commit_transaction collects lots of dirty buffer from and does a single ll_rw_block() to write them out. ll_rw_block() locks the buffer and checks to see if they are dirty and submits them for IO. In the mean while, journal_unmap_buffers() as part of truncate can unmap the buffer and throw it away. Since its a 1k (2k) filesystem - each page (4k) will have more than one buffer_head attached to the page and and we can't free up buffer_heads attached to the page (if we are not invalidating the whole page). Now, any call to set_page_dirty() (like msync_interval) could end up setting all the buffer heads attached to this page again dirty, including the ones those got cleaned up :( So, -not-so-elegant fix would be to have make jbd skip all the buffers that are not mapped. Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> --- fs/jbd/commit.c | 32 ++-- 1 file changed, 30 insertions(+), 2 deletions(-) Index: linux-2.6.18-rc5/fs/jbd/commit.c === --- linux-2.6.18-rc5.orig/fs/jbd/commit.c 2006-08-27 20:41:48.0 -0700 +++ linux-2.6.18-rc5/fs/jbd/commit.c2006-09-01 10:36:16.0 -0700 @@ -160,6 +160,34 @@ static int journal_write_commit_record(j return (ret == -EIO); } +static void jbd_write_buffers(int nr, struct buffer_head *bhs[]) +{ + int i; + + for (i = 0; i < nr; i++) { + struct buffer_head *bh = bhs[i]; + + lock_buffer(bh); + + /* +* case 1: Buffer could have been flushed by now, +* if so nothing to do for us. +* case 2: Buffer could have been unmapped up by +* journal_unmap_buffer - but still attached to the +* page. Any calls to set_page_dirty() would dirty +* the buffer even though its not mapped. If so, +* we need to skip them. +*/ + if (buffer_mapped(bh) && test_clear_buffer_dirty(bh)) { + bh->b_end_io = end_buffer_write_sync; + get_bh(bh); + submit_bh(WRITE, bh); + continue; + } + unlock_buffer(bh); + } +} + /* * journal_commit_transaction * @@ -356,7 +384,7 @@ write_out_data: jbd_debug(2, "submit %d writes\n", bufs); spin_unlock(&journal->j_list_lock); - ll_rw_block(SWRITE, bufs, wbuf); + jbd_write_buffer(bufs, wbuf); journal_brelse_array(wbuf, bufs); bufs = 0; goto write_out_data; @@ -379,7 +407,7 @@ write_out_data: if (bufs) { spin_unlock(&journal->j_list_lock); - ll_rw_block(SWRITE, bufs, wbuf); + jbd_write_buffers(bufs, wbuf); journal_brelse_array(wbuf, bufs); spin_lock(&journal->j_list_lock); } - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
On Fri, 2006-09-01 at 17:12 +0100, Anton Altaparmakov wrote: > Hi, > > On Fri, 1 Sep 2006, Badari Pulavarty wrote: > > > > I have been running into following bug while running fsx > > tests on 1k (ext3) filesystem all the time. > > > > --- [cut here ] - [please bite here ] - > > Kernel BUG at fs/buffer.c:2791 > > invalid opcode: [1] SMP > > > > Its complaining about BUG_ON(!buffer_mapped(bh)). > > > > It was hard to track it down, needed lots of debug - but here > > is the problem & fix. Since the fix is in __set_page_buffer_dirty() > > code - I am wondering how it would effect others :( > > This will breaks NTFS and probably a lot of other file systems I would > think. Well, it can happen only with fileystems with blocksize < pagesize. So, that should limit the scope :) > > For example all buffer based file systems in their writepage > implementations will create buffers if none are present and those will not > be mapped. If for whatever reason writepage now breaks out before mapping > the buffers (e.g. because the buffers do not need to be written or due to > an error) you are left with a page containing unmapped buffers. > > Then a page dirty comes in caused by a mmapped write for example. > > __set_page_dirty_bufferes() runs by default and with your patch does not > set the unmapped buffers dirty thus they never get written out and you > have data corruption. I was wondering, is it *okay* to have a dirty buffer but not mapped to disk ? I guess, so .. > > It is the caller of submit_bh() that is doing the stupidity of submitting > unmapped buffers for i/o or even the caller of the caller, etc... > Somewhere up in that chain buffers should have been mapped before being > submitted for i/o otherwise it is a BUG() (as correctly identified by > submit_bh()). > > Perhaps the real fix is not to have ext3 use ll_rw_block() and instead > make it use submit_bh() directly and only submit buffers inside the file > size and/or make it map buffers before calling ll_rw_block() and if they > are outside the file size just clean them without submitting them... Yeah. I considered doing it in ll_rw_block() - but then I thought fixing it in set_page_buffer_dirty() may be a valid fix :( Let me try other options, then. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers
Hi Andrew, I have been running into following bug while running fsx tests on 1k (ext3) filesystem all the time. --- [cut here ] - [please bite here ] - Kernel BUG at fs/buffer.c:2791 invalid opcode: [1] SMP Its complaining about BUG_ON(!buffer_mapped(bh)). It was hard to track it down, needed lots of debug - but here is the problem & fix. Since the fix is in __set_page_buffer_dirty() code - I am wondering how it would effect others :( With this fix fsx tests ran for more than 16 hours (and still running). Please let me know, what you think. Thanks, Badari Patch to fix: Kernel BUG at fs/buffer.c:2791 on 1k (2k) filesystems while running fsx. journal_commit_transaction collects lots of dirty buffer from and does a single ll_rw_block() to write them out. ll_rw_block() locks the buffer and checks to see if they are dirty and submits them for IO. In the mean while, journal_unmap_buffers() as part of truncate can unmap the buffer and throw it away. Since its a 1k (2k) filesystem - each page (4k) will have more than one buffer_head attached to the page and and we can't free up buffer_heads attached to the page (if we are not invalidating the whole page). Now, any call to set_page_dirty() (like msync_interval) could end up setting all the buffer heads attached to this page again dirty, including the ones those got cleaned up :( If ll_rw_block() runs now and sees the dirty bit it does submit_bh() on those buffer_heads and triggers the assert. Fix is to check if the buffer is mapped before setting its dirty bit in __set_page_dirty_buffers(). Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]> --- fs/buffer.c |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) Index: linux-2.6.18-rc5/fs/buffer.c === --- linux-2.6.18-rc5.orig/fs/buffer.c 2006-09-01 08:20:51.0 -0700 +++ linux-2.6.18-rc5/fs/buffer.c2006-09-01 08:41:01.0 -0700 @@ -846,7 +846,13 @@ int __set_page_dirty_buffers(struct page struct buffer_head *bh = head; do { - set_buffer_dirty(bh); + /* +* Its possible that, not all buffers attached to +* this page are mapped (cleaned up by truncate). +* If so, skip them. +*/ + if (buffer_mapped(bh)) + set_buffer_dirty(bh); bh = bh->b_this_page; } while (bh != head); } - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html