subject:"Re\: OOM problems on 2.6.12\-rc1 with many fsx tests"

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-12 Thread Stephen C. Tweedie

Hi,

On Wed, 2005-04-06 at 02:23, Andrew Morton wrote:

> Nobody has noticed the now-fixed leak since 2.6.6 and this one appears to
> be 100x slower.  Which is fortunate because this one is going to take a
> long time to fix.  I'll poke at it some more.

OK, I'm now at the stage where I can kick off that fsx test on a kernel
without your leak fix, kill it, umount and get

Whoops: found 43 unfreeable buffers still on the superblock debug list
for sb 0100296b2d48.  Tracing one...
buffer trace for buffer at 0x01003edaa9c8 (I am CPU 0)
...

with a trace pointing to journal_unmap_buffer().  I'll try with the fix
in place to see if there are any other cases showing up with the same
problem.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-05 Thread Andrew Morton

Mingming Cao <[EMAIL PROTECTED]> wrote:
>
> I run the test(20 instances of fsx) with your patch on 2.6.12-rc1 with
>  512MB RAM (where I were able to constantly re-create the mem leak and
>  lead to OOM before). The result is the kernel did not get into OOM after
>  about 19 hours(before it took about 9 hours or so), system is still
>  responsive. However I did notice about ~60MB delta between Active
>  +Inactive and Buffers+cached+Swapcached+Mapped+Slab

yes.

Nobody has noticed the now-fixed leak since 2.6.6 and this one appears to
be 100x slower.  Which is fortunate because this one is going to take a
long time to fix.  I'll poke at it some more.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-05 Thread Mingming Cao

On Mon, 2005-04-04 at 13:04 -0700, Andrew Morton wrote:
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> >
> > On Sun, 2005-04-03 at 18:35 -0700, Andrew Morton wrote:
> > > Mingming Cao <[EMAIL PROTECTED]> wrote:
> > > >
> > > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> > > >  2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> > > >  hours the system hit OOM, and OOM keep killing processes one by one. I
> > > >  could reproduce this problem very constantly on a 2 way PIII 700MHZ 
> > > > with
> > > >  512MB RAM. Also the problem could be reproduced on running the same 
> > > > test
> > > >  on reiser fs.
> > > > 
> > > >  The fsx command is:
> > > > 
> > > >  ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> > > 
> > > 
> > > This ext3 bug goes all the way back to 2.6.6.
> > 
> > > I don't know yet why you saw problems with reiser3 and I'm pretty sure I
> > > saw problems with ext2.  More testing is needed there.
> > > 
> > 
> > We (Janet and I) are chasing this bug as well. Janet is able to
> > reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down
> > this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will
> > double check it.  Will try your patch today.
> 
> There's a second leak, with similar-looking symptoms.  At ~50
> commits/second it has leaked ~10MB in 24 hours, so it's very slow - less
> than a hundredth the rate of the first one.
> 

I run the test(20 instances of fsx) with your patch on 2.6.12-rc1 with
512MB RAM (where I were able to constantly re-create the mem leak and
lead to OOM before). The result is the kernel did not get into OOM after
about 19 hours(before it took about 9 hours or so), system is still
responsive. However I did notice about ~60MB delta between Active
+Inactive and Buffers+cached+Swapcached+Mapped+Slab

Here is the current /proc/meminfo

elm3b92:~ # cat /proc/meminfo
MemTotal:   510400 kB
MemFree: 97004 kB
Buffers:196772 kB
Cached:  77608 kB
SwapCached:  0 kB
Active: 299064 kB
Inactive:83140 kB
HighTotal:   0 kB
HighFree:0 kB
LowTotal:   510400 kB
LowFree: 97004 kB
SwapTotal: 1052216 kB
SwapFree:  1052216 kB
Dirty:1600 kB
Writeback:   0 kB
Mapped:  23256 kB
Slab:24548 kB
CommitLimit:   1307416 kB
Committed_AS:73560 kB
PageTables:532 kB
VmallocTotal:   516024 kB
VmallocUsed:  3700 kB
VmallocChunk:   512320 kB



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-05 Thread Stephen C. Tweedie

Hi,

On Mon, 2005-04-04 at 02:35, Andrew Morton wrote:

> Without the below patch it's possible to make ext3 leak at around a
> megabyte per minute by arranging for the fs to run a commit every 50
> milliseconds, btw.

Ouch! 
> (Stephen, please review...)

Doing so now. 

> The patch teaches journal_unmap_buffer() about buffers which are on the
> committing transaction's t_locked_list.  These buffers have been written and
> I/O has completed.  

Agreed.  The key here is that the buffer is locked before
journal_unmap_buffer() is called, so we can indeed rely on it being
safely on disk.  

> We can take them off the transaction and undirty them
> within the context of journal_invalidatepage()->journal_unmap_buffer().

Right - the committing transaction can't be doing any more writes, and
the current transaction has explicitly told us to throw away its own
writes if we get here.  Unfiling the buffer should be safe.

> + if (jh->b_jlist == BJ_Locked) {
> + /*
> +  * The buffer is on the committing transaction's locked
> +  * list.  We have the buffer locked, so I/O has
> +  * completed.  So we can nail the buffer now.
> +  */
> + may_free = __dispose_buffer(jh, transaction);
> + goto zap_buffer;
> + }

ACK.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-04 Thread Martin J. Bligh

>> > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
>> > >  2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
>> > >  hours the system hit OOM, and OOM keep killing processes one by one. I
>> > >  could reproduce this problem very constantly on a 2 way PIII 700MHZ with
>> > >  512MB RAM. Also the problem could be reproduced on running the same test
>> > >  on reiser fs.
>> > > 
>> > >  The fsx command is:
>> > > 
>> > >  ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
>> > 
>> > 
>> > This ext3 bug goes all the way back to 2.6.6.
>> 
>> > I don't know yet why you saw problems with reiser3 and I'm pretty sure I
>> > saw problems with ext2.  More testing is needed there.
>> > 
>> 
>> We (Janet and I) are chasing this bug as well. Janet is able to
>> reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down
>> this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will
>> double check it.  Will try your patch today.
> 
> There's a second leak, with similar-looking symptoms.  At ~50
> commits/second it has leaked ~10MB in 24 hours, so it's very slow - less
> than a hundredth the rate of the first one.

What are you using to see these with, just kgdb, and a large cranial
capacity? Or is there some more magic?

m.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-04 Thread Andrew Morton

"Martin J. Bligh" <[EMAIL PROTECTED]> wrote:
>
> >> > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> >> > >  2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> >> > >  hours the system hit OOM, and OOM keep killing processes one by one. I
> >> > >  could reproduce this problem very constantly on a 2 way PIII 700MHZ 
> >> > > with
> >> > >  512MB RAM. Also the problem could be reproduced on running the same 
> >> > > test
> >> > >  on reiser fs.
> >> > > 
> >> > >  The fsx command is:
> >> > > 
> >> > >  ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> >> > 
> >> > 
> >> > This ext3 bug goes all the way back to 2.6.6.
> >> 
> >> > I don't know yet why you saw problems with reiser3 and I'm pretty sure I
> >> > saw problems with ext2.  More testing is needed there.
> >> > 
> >> 
> >> We (Janet and I) are chasing this bug as well. Janet is able to
> >> reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down
> >> this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will
> >> double check it.  Will try your patch today.
> > 
> > There's a second leak, with similar-looking symptoms.  At ~50
> > commits/second it has leaked ~10MB in 24 hours, so it's very slow - less
> > than a hundredth the rate of the first one.
> 
> What are you using to see these with, just kgdb, and a large cranial
> capacity? Or is there some more magic?
> 

Nothing magical: run the test for a while, kill everything, cause a huge
swapstorm then look at the meminfo numbers.  If active+inactive is
significantly larger than cahed+buffers+swapcached+mapped+minus-a-bit then
it's leaked.

Right now I have:

MemTotal:   246264 kB
MemFree:196148 kB
Buffers:  4200 kB
Cached:   3308 kB
SwapCached:   8064 kB
Active:  21548 kB
Inactive:12532 kB
HighTotal:   0 kB
HighFree:0 kB
LowTotal:   246264 kB
LowFree:196148 kB
SwapTotal: 1020116 kB
SwapFree:  1001612 kB
Dirty:  60 kB
Writeback:   0 kB
Mapped:   2284 kB
Slab:12200 kB
CommitLimit:   1143248 kB
Committed_AS:34004 kB
PageTables:   1200 kB
VmallocTotal:   774136 kB
VmallocUsed: 82832 kB
VmallocChunk:   691188 kB
HugePages_Total: 0
HugePages_Free:  0  

33 megs on the LRU, unaccounted for in other places.

Once the leak is nice and large I can start a new swapstorm, set a
breakpoint in try_to_free_buffers() (for example) and start looking at the
state of the page and its buffers.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-04 Thread Andrew Morton

Mingming Cao <[EMAIL PROTECTED]> wrote:
>
> On Sun, 2005-04-03 at 18:35 -0700, Andrew Morton wrote:
> > Mingming Cao <[EMAIL PROTECTED]> wrote:
> > >
> > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> > >  2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> > >  hours the system hit OOM, and OOM keep killing processes one by one. I
> > >  could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> > >  512MB RAM. Also the problem could be reproduced on running the same test
> > >  on reiser fs.
> > > 
> > >  The fsx command is:
> > > 
> > >  ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> > 
> > 
> > This ext3 bug goes all the way back to 2.6.6.
> 
> > I don't know yet why you saw problems with reiser3 and I'm pretty sure I
> > saw problems with ext2.  More testing is needed there.
> > 
> 
> We (Janet and I) are chasing this bug as well. Janet is able to
> reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down
> this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will
> double check it.  Will try your patch today.

There's a second leak, with similar-looking symptoms.  At ~50
commits/second it has leaked ~10MB in 24 hours, so it's very slow - less
than a hundredth the rate of the first one.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-04 Thread Mingming Cao

On Sun, 2005-04-03 at 18:35 -0700, Andrew Morton wrote:
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> >
> > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> >  2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> >  hours the system hit OOM, and OOM keep killing processes one by one. I
> >  could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> >  512MB RAM. Also the problem could be reproduced on running the same test
> >  on reiser fs.
> > 
> >  The fsx command is:
> > 
> >  ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> 
> 
> This ext3 bug goes all the way back to 2.6.6.

> I don't know yet why you saw problems with reiser3 and I'm pretty sure I
> saw problems with ext2.  More testing is needed there.
> 

We (Janet and I) are chasing this bug as well. Janet is able to
reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down
this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will
double check it.  Will try your patch today.
 
Thanks,
Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-04-03 Thread Andrew Morton

Mingming Cao <[EMAIL PROTECTED]> wrote:
>
> I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
>  2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
>  hours the system hit OOM, and OOM keep killing processes one by one. I
>  could reproduce this problem very constantly on a 2 way PIII 700MHZ with
>  512MB RAM. Also the problem could be reproduced on running the same test
>  on reiser fs.
> 
>  The fsx command is:
> 
>  ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &

This ext3 bug goes all the way back to 2.6.6.

I don't know yet why you saw problems with reiser3 and I'm pretty sure I
saw problems with ext2.  More testing is needed there.

Without the below patch it's possible to make ext3 leak at around a
megabyte per minute by arranging for the fs to run a commit every 50
milliseconds, btw.

(Stephen, please review...)

This fixes the lots-of-fsx-linux-instances-cause-a-slow-leak bug.

It's been there since 2.6.6, caused by:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5/2.6.5-mm4/broken-out/jbd-move-locked-buffers.patch

That patch moves under-writeout ordered-data buffers onto a separate journal
list during commit.  It took out the old code which was based on a single
list.

The old code (necessarily) had logic which would restart I/O against buffers
which had been redirtied while they were on the committing transaction's
t_sync_datalist list.  The new code only writes buffers once, ignoring
redirtyings by a later transaction, which is good.

But over on the truncate side of things, in journal_unmap_buffer(), we're
treating buffers on the t_locked_list as inviolable things which belong to the
committing transaction, and we just leave them alone during concurrent
truncate-vs-commit.

The net effect is that when truncate tries to invalidate a page whose buffers
are on t_locked_list and have been redirtied, journal_unmap_buffer() just
leaves those buffers alone.  truncate will remove the page from its mapping
and we end up with an anonymous clean page with dirty buffers, which is an
illegal state for a page.  The JBD commit will not clean those buffers as they
are removed from t_locked_list.  The VM (try_to_free_buffers) cannot reclaim
these pages.

The patch teaches journal_unmap_buffer() about buffers which are on the
committing transaction's t_locked_list.  These buffers have been written and
I/O has completed.  We can take them off the transaction and undirty them
within the context of journal_invalidatepage()->journal_unmap_buffer().

Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 25-akpm/fs/jbd/transaction.c |   13 +++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff -puN fs/jbd/transaction.c~jbd-dirty-buffer-leak-fix fs/jbd/transaction.c
--- 25/fs/jbd/transaction.c~jbd-dirty-buffer-leak-fix   2005-04-03 
15:12:12.0 -0700
+++ 25-akpm/fs/jbd/transaction.c2005-04-03 15:14:40.0 -0700
@@ -1812,7 +1812,17 @@ static int journal_unmap_buffer(journal_
}
}
} else if (transaction == journal->j_committing_transaction) {
-   /* If it is committing, we simply cannot touch it.  We
+   if (jh->b_jlist == BJ_Locked) {
+   /*
+* The buffer is on the committing transaction's locked
+* list.  We have the buffer locked, so I/O has
+* completed.  So we can nail the buffer now.
+*/
+   may_free = __dispose_buffer(jh, transaction);
+   goto zap_buffer;
+   }
+   /*
+* If it is committing, we simply cannot touch it.  We
 * can remove it's next_transaction pointer from the
 * running transaction if that is set, but nothing
 * else. */
@@ -1887,7 +1897,6 @@ int journal_invalidatepage(journal_t *jo
unsigned int next_off = curr_off + bh->b_size;
next = bh->b_this_page;

-   /* AKPM: doing lock_buffer here may be overly paranoid */
if (offset <= curr_off) {
/* This block is wholly outside the truncation point */
lock_buffer(bh);
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-27 Thread Badari Pulavarty

Badari Pulavarty wrote:
Mingming Cao wrote:
On Sat, 2005-03-26 at 16:23 -0800, Mingming Cao wrote:
On Fri, 2005-03-25 at 14:11 -0800, Badari Pulavarty wrote:
On Fri, 2005-03-25 at 13:56, Andrew Morton wrote:
Mingming Cao <[EMAIL PROTECTED]> wrote:
I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx 
tests on
2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
hours the system hit OOM, and OOM keep killing processes one by 
one. I
could reproduce this problem very constantly on a 2 way PIII 
700MHZ with
512MB RAM. Also the problem could be reproduced on running the 
same test
on reiser fs.

The fsx command is:
./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &

I was able to reproduce this on ext3.  Seven instances of the above 
leaked
10-15MB over 10 hours.  All of it permanently stuck on the LRU.

I'll continue to poke at it - see what kernel it started with, which
filesystems it affects, whether it happens on UP&&!PREEMPT, etc.  
Not a
quick process.

I reproduced *similar* issue with 2.6.11. The reason I say similar, is
there is no OOM kill, but very low free memory and machine doesn't
respond at all. (I booted my machine with 256M memory and ran 20 copies
of fsx on ext3).

Yes, I re-run the same test on 2.6.11 for 24 hours, like Badari see on
his machine, my machine did not go to OOM on 2.6.11,still alive, but
memory is very low(only 5M free). Killed all fsx and umount the ext3
filesystem did not bring back much memory. I will going to rerun the
tests without the mapped read/write to see what happen.


Run fsx tests without mapped IO on 2.6.11 seems fine.  Here is
the /proc/meminfo after 18 hours run:

Mingming, Reproduce it on 2.6.11 with mapped IO tests. That will tell
us when the regression started.
Sorry - Ignore my request, Mingming already did this work and posted
the result.
Thanks,
Badari
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-27 Thread Badari Pulavarty

Mingming Cao wrote:
On Sat, 2005-03-26 at 16:23 -0800, Mingming Cao wrote:
On Fri, 2005-03-25 at 14:11 -0800, Badari Pulavarty wrote:
On Fri, 2005-03-25 at 13:56, Andrew Morton wrote:
Mingming Cao <[EMAIL PROTECTED]> wrote:
I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
hours the system hit OOM, and OOM keep killing processes one by one. I
could reproduce this problem very constantly on a 2 way PIII 700MHZ with
512MB RAM. Also the problem could be reproduced on running the same test
on reiser fs.
The fsx command is:
./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
I was able to reproduce this on ext3.  Seven instances of the above leaked
10-15MB over 10 hours.  All of it permanently stuck on the LRU.
I'll continue to poke at it - see what kernel it started with, which
filesystems it affects, whether it happens on UP&&!PREEMPT, etc.  Not a
quick process.
I reproduced *similar* issue with 2.6.11. The reason I say similar, is
there is no OOM kill, but very low free memory and machine doesn't
respond at all. (I booted my machine with 256M memory and ran 20 copies
of fsx on ext3).

Yes, I re-run the same test on 2.6.11 for 24 hours, like Badari see on
his machine, my machine did not go to OOM on 2.6.11,still alive, but
memory is very low(only 5M free). Killed all fsx and umount the ext3
filesystem did not bring back much memory. I will going to rerun the
tests without the mapped read/write to see what happen.


Run fsx tests without mapped IO on 2.6.11 seems fine.  Here is
the /proc/meminfo after 18 hours run:
Mingming, Reproduce it on 2.6.11 with mapped IO tests. That will tell
us when the regression started.
Thanks,
Badari
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-27 Thread Mingming Cao

On Sat, 2005-03-26 at 16:23 -0800, Mingming Cao wrote:
> On Fri, 2005-03-25 at 14:11 -0800, Badari Pulavarty wrote:
> > On Fri, 2005-03-25 at 13:56, Andrew Morton wrote:
> > > Mingming Cao <[EMAIL PROTECTED]> wrote:
> > > >
> > > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> > > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> > > > hours the system hit OOM, and OOM keep killing processes one by one. I
> > > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> > > > 512MB RAM. Also the problem could be reproduced on running the same test
> > > > on reiser fs.
> > > > 
> > > > The fsx command is:
> > > > 
> > > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> > > 
> > > I was able to reproduce this on ext3.  Seven instances of the above leaked
> > > 10-15MB over 10 hours.  All of it permanently stuck on the LRU.
> > > 
> > > I'll continue to poke at it - see what kernel it started with, which
> > > filesystems it affects, whether it happens on UP&&!PREEMPT, etc.  Not a
> > > quick process.
> > 
> > I reproduced *similar* issue with 2.6.11. The reason I say similar, is
> > there is no OOM kill, but very low free memory and machine doesn't
> > respond at all. (I booted my machine with 256M memory and ran 20 copies
> > of fsx on ext3).
> > 
> > 
> 
> Yes, I re-run the same test on 2.6.11 for 24 hours, like Badari see on
> his machine, my machine did not go to OOM on 2.6.11,still alive, but
> memory is very low(only 5M free). Killed all fsx and umount the ext3
> filesystem did not bring back much memory. I will going to rerun the
> tests without the mapped read/write to see what happen.
> 
> 

Run fsx tests without mapped IO on 2.6.11 seems fine.  Here is
the /proc/meminfo after 18 hours run:

# cat /proc/meminfo
MemTotal:   510464 kB
MemFree:  6004 kB
Buffers:179420 kB
Cached:   9144 kB
SwapCached:  0 kB
Active: 313236 kB
Inactive:   171380 kB
HighTotal:   0 kB
HighFree:0 kB
LowTotal:   510464 kB
LowFree:  6004 kB
SwapTotal: 1052216 kB
SwapFree:  1052216 kB
Dirty:2100 kB
Writeback:   0 kB
Mapped:  24884 kB
Slab:14788 kB
CommitLimit:   1307448 kB
Committed_AS:78032 kB
PageTables:720 kB
VmallocTotal:   516024 kB
VmallocUsed:  1672 kB
VmallocChunk:   514352 kB

elm3b92:~ # killall -9 fsx
elm3b92:~ # cat /proc/meminfo
MemTotal:   510464 kB
MemFree: 21332 kB
Buffers:179668 kB
Cached:   8828 kB
SwapCached:  0 kB
Active: 298748 kB
Inactive:   171152 kB
HighTotal:   0 kB
HighFree:0 kB
LowTotal:   510464 kB
LowFree: 21332 kB
SwapTotal: 1052216 kB
SwapFree:  1052216 kB
Dirty:1140 kB
Writeback:   0 kB
Mapped:  11648 kB
Slab:14632 kB
CommitLimit:   1307448 kB
Committed_AS:59800 kB
PageTables:492 kB
VmallocTotal:   516024 kB
VmallocUsed:  1672 kB
VmallocChunk:   514352 kB

elm3b92:~ # umount /mnt/ext3
elm3b92:~ # cat /proc/meminfo
MemTotal:   510464 kB
MemFree:181636 kB
Buffers: 22092 kB
Cached:   6740 kB
SwapCached:  0 kB
Active: 151284 kB
Inactive:   158948 kB
HighTotal:   0 kB
HighFree:0 kB
LowTotal:   510464 kB
LowFree:181636 kB
SwapTotal: 1052216 kB
SwapFree:  1052216 kB
Dirty:   0 kB
Writeback:   0 kB
Mapped:  11656 kB
Slab:14052 kB
CommitLimit:   1307448 kB
Committed_AS:59800 kB
PageTables:492 kB
VmallocTotal:   516024 kB
VmallocUsed:  1672 kB
VmallocChunk:   514352 kB


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-26 Thread Mingming Cao

On Fri, 2005-03-25 at 14:11 -0800, Badari Pulavarty wrote:
> On Fri, 2005-03-25 at 13:56, Andrew Morton wrote:
> > Mingming Cao <[EMAIL PROTECTED]> wrote:
> > >
> > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> > > hours the system hit OOM, and OOM keep killing processes one by one. I
> > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> > > 512MB RAM. Also the problem could be reproduced on running the same test
> > > on reiser fs.
> > > 
> > > The fsx command is:
> > > 
> > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> > 
> > I was able to reproduce this on ext3.  Seven instances of the above leaked
> > 10-15MB over 10 hours.  All of it permanently stuck on the LRU.
> > 
> > I'll continue to poke at it - see what kernel it started with, which
> > filesystems it affects, whether it happens on UP&&!PREEMPT, etc.  Not a
> > quick process.
> 
> I reproduced *similar* issue with 2.6.11. The reason I say similar, is
> there is no OOM kill, but very low free memory and machine doesn't
> respond at all. (I booted my machine with 256M memory and ran 20 copies
> of fsx on ext3).
> 
> 

Yes, I re-run the same test on 2.6.11 for 24 hours, like Badari see on
his machine, my machine did not go to OOM on 2.6.11,still alive, but
memory is very low(only 5M free). Killed all fsx and umount the ext3
filesystem did not bring back much memory. I will going to rerun the
tests without the mapped read/write to see what happen.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-25 Thread Badari Pulavarty

On Fri, 2005-03-25 at 16:17, Dave Jones wrote:
> On Wed, Mar 23, 2005 at 11:53:04AM -0800, Mingming Cao wrote:
> 
>  > The fsx command is:
>  > 
>  > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
>  > 
>  > I also see fsx tests start to generating report about read bad data
>  > about the tests have run for about 9 hours(one hour before of the OOM
>  > happen). 
> 
> Is writing to the same testfile from multiple fsx's supposed to work?
> It sounds like a surefire way to break the consistency checking that it does.
> I'm surprised it lasts 9hrs before it breaks.
> 
> In the past I've done tests like..
> 
> for i in `seq 1 100`
> do
>   fsx foo$i &
> done
> 
> to make each process use a different test file.
> 

No. We are doing on different files - Mingming cut and pasted
only a single line from the script.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-25 Thread Dave Jones

On Wed, Mar 23, 2005 at 11:53:04AM -0800, Mingming Cao wrote:

 > The fsx command is:
 > 
 > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
 > 
 > I also see fsx tests start to generating report about read bad data
 > about the tests have run for about 9 hours(one hour before of the OOM
 > happen). 

Is writing to the same testfile from multiple fsx's supposed to work?
It sounds like a surefire way to break the consistency checking that it does.
I'm surprised it lasts 9hrs before it breaks.

In the past I've done tests like..

for i in `seq 1 100`
do
  fsx foo$i &
done

to make each process use a different test file.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-25 Thread Badari Pulavarty

On Fri, 2005-03-25 at 13:56, Andrew Morton wrote:
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> >
> > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> > hours the system hit OOM, and OOM keep killing processes one by one. I
> > could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> > 512MB RAM. Also the problem could be reproduced on running the same test
> > on reiser fs.
> > 
> > The fsx command is:
> > 
> > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> 
> I was able to reproduce this on ext3.  Seven instances of the above leaked
> 10-15MB over 10 hours.  All of it permanently stuck on the LRU.
> 
> I'll continue to poke at it - see what kernel it started with, which
> filesystems it affects, whether it happens on UP&&!PREEMPT, etc.  Not a
> quick process.

I reproduced *similar* issue with 2.6.11. The reason I say similar, is
there is no OOM kill, but very low free memory and machine doesn't
respond at all. (I booted my machine with 256M memory and ran 20 copies
of fsx on ext3).


Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-25 Thread Andrew Morton

Mingming Cao <[EMAIL PROTECTED]> wrote:
>
> I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> hours the system hit OOM, and OOM keep killing processes one by one. I
> could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> 512MB RAM. Also the problem could be reproduced on running the same test
> on reiser fs.
> 
> The fsx command is:
> 
> ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &

I was able to reproduce this on ext3.  Seven instances of the above leaked
10-15MB over 10 hours.  All of it permanently stuck on the LRU.

I'll continue to poke at it - see what kernel it started with, which
filesystems it affects, whether it happens on UP&&!PREEMPT, etc.  Not a
quick process.

Given that you also saw it on reiserfs, it might be a bug in the core
mmap/truncate/unmap handling.  We'll see.

> I also see fsx tests start to generating report about read bad data
> about the tests have run for about 9 hours(one hour before of the OOM
> happen). 

I haven't noticed anything like that.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-23 Thread Andrew Morton

Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
>
> On Wed, Mar 23, 2005 at 03:42:32PM -0800, Andrew Morton wrote:
> > I'm suspecting here that we simply leaked a refcount on every darn
> > pagecache page in the machine.  Note how mapped memory has shrunk down to
> > less than a megabyte and everything which can be swapped out has been
> > swapped out.
> > 
> > If so, then oom-killing everything in the world is pretty inevitable.
> 
> Agreed, it looks like a memleak of a page_count (while mapcount is fine).
> 
> I would suggest looking after pages part of pagecache (i.e.
> page->mapcount not null) that have a mapcount of 0 and a page_count > 1,
> almost all of them should be like that during the memleak, and almost
> none should be like that before the memleak.
> 
> This seems unrelated to the bug that started the thread that was clearly
> a slab shrinking issue and not a pagecache memleak.
> 

The vmscan.c changes in -rc1 look harmless enough.  That's assuming that
2.6.11 doesn't have the bug.

btw, that new orphanned-page handling code has a printk in it, and nobody
has reported it coming out yet...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-23 Thread Andrea Arcangeli

On Wed, Mar 23, 2005 at 03:42:32PM -0800, Andrew Morton wrote:
> I'm suspecting here that we simply leaked a refcount on every darn
> pagecache page in the machine.  Note how mapped memory has shrunk down to
> less than a megabyte and everything which can be swapped out has been
> swapped out.
> 
> If so, then oom-killing everything in the world is pretty inevitable.

Agreed, it looks like a memleak of a page_count (while mapcount is fine).

I would suggest looking after pages part of pagecache (i.e.
page->mapcount not null) that have a mapcount of 0 and a page_count > 1,
almost all of them should be like that during the memleak, and almost
none should be like that before the memleak.

This seems unrelated to the bug that started the thread that was clearly
a slab shrinking issue and not a pagecache memleak.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-23 Thread Badari Pulavarty

Andrew Morton wrote:
"Martin J. Bligh" <[EMAIL PROTECTED]> wrote:
Nothing beats poking around in a dead machine's guts with kgdb though.
Everyone his taste.
But I was surprised by

SwapTotal: 1052216 kB
SwapFree:  1045984 kB
Strange that processes are killed while lots of swap is available.
I don't think we're that smart about it. If we're really low on mem, it
seems we invoke the OOM killer whether processes are causing the problem
or not. 

OTOH, if we can't free the kernel mem, we don't have much choice, but 
it's not really helping much ;-)


I'm suspecting here that we simply leaked a refcount on every darn
pagecache page in the machine.  Note how mapped memory has shrunk down to
less than a megabyte and everything which can be swapped out has been
swapped out.
That makes sense. We have almost 485MB in active and inactive caches,
but we are not able reclai them :(
Active: 243580 kB
Inactive:   242248 kB
If so, then oom-killing everything in the world is pretty inevitable.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-23 Thread Andrew Morton

"Martin J. Bligh" <[EMAIL PROTECTED]> wrote:
>
> >> Nothing beats poking around in a dead machine's guts with kgdb though.
> > 
> > Everyone his taste.
> > 
> > But I was surprised by
> > 
> >> SwapTotal: 1052216 kB
> >> SwapFree:  1045984 kB
> > 
> > Strange that processes are killed while lots of swap is available.
> 
> I don't think we're that smart about it. If we're really low on mem, it
> seems we invoke the OOM killer whether processes are causing the problem
> or not. 
> 
> OTOH, if we can't free the kernel mem, we don't have much choice, but 
> it's not really helping much ;-)
> 

I'm suspecting here that we simply leaked a refcount on every darn
pagecache page in the machine.  Note how mapped memory has shrunk down to
less than a megabyte and everything which can be swapped out has been
swapped out.

If so, then oom-killing everything in the world is pretty inevitable.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-23 Thread Martin J. Bligh

>> Nothing beats poking around in a dead machine's guts with kgdb though.
> 
> Everyone his taste.
> 
> But I was surprised by
> 
>> SwapTotal: 1052216 kB
>> SwapFree:  1045984 kB
> 
> Strange that processes are killed while lots of swap is available.

I don't think we're that smart about it. If we're really low on mem, it
seems we invoke the OOM killer whether processes are causing the problem
or not. 

OTOH, if we can't free the kernel mem, we don't have much choice, but 
it's not really helping much ;-)

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-23 Thread Andries Brouwer

On Wed, Mar 23, 2005 at 03:20:55PM -0800, Andrew Morton wrote:

> Nothing beats poking around in a dead machine's guts with kgdb though.

Everyone his taste.

But I was surprised by

> SwapTotal: 1052216 kB
> SwapFree:  1045984 kB

Strange that processes are killed while lots of swap is available.

Andries
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-23 Thread Andrew Morton

"Martin J. Bligh" <[EMAIL PROTECTED]> wrote:
>
> > It would be interesting if you could run the same test on 2.6.11.  
> 
> One thing I'm finding is that it's hard to backtrace who has each page
> in this sort of situation. My plan is to write a debug patch to walk
> mem_map and dump out some info on each page. I would appreciate ideas
> on what info would be useful here. Some things are fairly obvious, like
> we want to know if it's anon / mapped into address space (& which),
> whether it's slab / buffers / pagecache etc ... any other suggestions
> you have would be much appreciated.

You could use

page-owner-tracking-leak-detector.patch
make-page_owner-handle-non-contiguous-page-ranges.patch
add-gfp_mask-to-page-owner.patch

which sticks an 8-slot stack backtrace into each page, recording who
allocated it.

But that's probably not very interesting info for pagecache pages.

Nothing beats poking around in a dead machine's guts with kgdb though.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-23 Thread Martin J. Bligh

>> I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
>>  2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
>>  hours the system hit OOM, and OOM keep killing processes one by one.
> 
> I don't have a very good record reading these oom dumps lately, but this
> one look really weird.  Basically no mapped memory, tons of pagecache on
> the LRU.
> 
> It would be interesting if you could run the same test on 2.6.11.  

One thing I'm finding is that it's hard to backtrace who has each page
in this sort of situation. My plan is to write a debug patch to walk
mem_map and dump out some info on each page. I would appreciate ideas
on what info would be useful here. Some things are fairly obvious, like
we want to know if it's anon / mapped into address space (& which),
whether it's slab / buffers / pagecache etc ... any other suggestions
you have would be much appreciated.

I'm suspecting in many cases we don't keep enough info, and it would be
too slow to keep it in the default case - so I may need to add some
extra debug fields in struct page as a config option, but let's start
with what we have.

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM problems on 2.6.12-rc1 with many fsx tests

2005-03-23 Thread Andrew Morton

Mingming Cao <[EMAIL PROTECTED]> wrote:
>
> I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
>  2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
>  hours the system hit OOM, and OOM keep killing processes one by one.

I don't have a very good record reading these oom dumps lately, but this
one look really weird.  Basically no mapped memory, tons of pagecache on
the LRU.

It would be interesting if you could run the same test on 2.6.11.  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

Re: OOM problems on 2.6.12-rc1 with many fsx tests

26 matches

Site Navigation

Mail list logo

Footer information