Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests
Hi, On Wed, 2005-04-06 at 02:23, Andrew Morton wrote: > Nobody has noticed the now-fixed leak since 2.6.6 and this one appears to > be 100x slower. Which is fortunate because this one is going to take a > long time to fix. I'll poke at it some more. OK, I'm now at the stage where I can kick off that fsx test on a kernel without your leak fix, kill it, umount and get Whoops: found 43 unfreeable buffers still on the superblock debug list for sb 0100296b2d48. Tracing one... buffer trace for buffer at 0x01003edaa9c8 (I am CPU 0) ... with a trace pointing to journal_unmap_buffer(). I'll try with the fix in place to see if there are any other cases showing up with the same problem. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests
Mingming Cao <[EMAIL PROTECTED]> wrote: > > I run the test(20 instances of fsx) with your patch on 2.6.12-rc1 with > 512MB RAM (where I were able to constantly re-create the mem leak and > lead to OOM before). The result is the kernel did not get into OOM after > about 19 hours(before it took about 9 hours or so), system is still > responsive. However I did notice about ~60MB delta between Active > +Inactive and Buffers+cached+Swapcached+Mapped+Slab yes. Nobody has noticed the now-fixed leak since 2.6.6 and this one appears to be 100x slower. Which is fortunate because this one is going to take a long time to fix. I'll poke at it some more. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests
On Mon, 2005-04-04 at 13:04 -0700, Andrew Morton wrote: > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > On Sun, 2005-04-03 at 18:35 -0700, Andrew Morton wrote: > > > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > > > > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on > > > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 > > > > hours the system hit OOM, and OOM keep killing processes one by one. I > > > > could reproduce this problem very constantly on a 2 way PIII 700MHZ > > > > with > > > > 512MB RAM. Also the problem could be reproduced on running the same > > > > test > > > > on reiser fs. > > > > > > > > The fsx command is: > > > > > > > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & > > > > > > > > > This ext3 bug goes all the way back to 2.6.6. > > > > > I don't know yet why you saw problems with reiser3 and I'm pretty sure I > > > saw problems with ext2. More testing is needed there. > > > > > > > We (Janet and I) are chasing this bug as well. Janet is able to > > reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down > > this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will > > double check it. Will try your patch today. > > There's a second leak, with similar-looking symptoms. At ~50 > commits/second it has leaked ~10MB in 24 hours, so it's very slow - less > than a hundredth the rate of the first one. > I run the test(20 instances of fsx) with your patch on 2.6.12-rc1 with 512MB RAM (where I were able to constantly re-create the mem leak and lead to OOM before). The result is the kernel did not get into OOM after about 19 hours(before it took about 9 hours or so), system is still responsive. However I did notice about ~60MB delta between Active +Inactive and Buffers+cached+Swapcached+Mapped+Slab Here is the current /proc/meminfo elm3b92:~ # cat /proc/meminfo MemTotal: 510400 kB MemFree: 97004 kB Buffers:196772 kB Cached: 77608 kB SwapCached: 0 kB Active: 299064 kB Inactive:83140 kB HighTotal: 0 kB HighFree:0 kB LowTotal: 510400 kB LowFree: 97004 kB SwapTotal: 1052216 kB SwapFree: 1052216 kB Dirty:1600 kB Writeback: 0 kB Mapped: 23256 kB Slab:24548 kB CommitLimit: 1307416 kB Committed_AS:73560 kB PageTables:532 kB VmallocTotal: 516024 kB VmallocUsed: 3700 kB VmallocChunk: 512320 kB - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
Hi, On Mon, 2005-04-04 at 02:35, Andrew Morton wrote: > Without the below patch it's possible to make ext3 leak at around a > megabyte per minute by arranging for the fs to run a commit every 50 > milliseconds, btw. Ouch! > (Stephen, please review...) Doing so now. > The patch teaches journal_unmap_buffer() about buffers which are on the > committing transaction's t_locked_list. These buffers have been written and > I/O has completed. Agreed. The key here is that the buffer is locked before journal_unmap_buffer() is called, so we can indeed rely on it being safely on disk. > We can take them off the transaction and undirty them > within the context of journal_invalidatepage()->journal_unmap_buffer(). Right - the committing transaction can't be doing any more writes, and the current transaction has explicitly told us to throw away its own writes if we get here. Unfiling the buffer should be safe. > + if (jh->b_jlist == BJ_Locked) { > + /* > + * The buffer is on the committing transaction's locked > + * list. We have the buffer locked, so I/O has > + * completed. So we can nail the buffer now. > + */ > + may_free = __dispose_buffer(jh, transaction); > + goto zap_buffer; > + } ACK. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
>> > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on >> > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 >> > > hours the system hit OOM, and OOM keep killing processes one by one. I >> > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with >> > > 512MB RAM. Also the problem could be reproduced on running the same test >> > > on reiser fs. >> > > >> > > The fsx command is: >> > > >> > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & >> > >> > >> > This ext3 bug goes all the way back to 2.6.6. >> >> > I don't know yet why you saw problems with reiser3 and I'm pretty sure I >> > saw problems with ext2. More testing is needed there. >> > >> >> We (Janet and I) are chasing this bug as well. Janet is able to >> reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down >> this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will >> double check it. Will try your patch today. > > There's a second leak, with similar-looking symptoms. At ~50 > commits/second it has leaked ~10MB in 24 hours, so it's very slow - less > than a hundredth the rate of the first one. What are you using to see these with, just kgdb, and a large cranial capacity? Or is there some more magic? m. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
"Martin J. Bligh" <[EMAIL PROTECTED]> wrote: > > >> > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on > >> > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 > >> > > hours the system hit OOM, and OOM keep killing processes one by one. I > >> > > could reproduce this problem very constantly on a 2 way PIII 700MHZ > >> > > with > >> > > 512MB RAM. Also the problem could be reproduced on running the same > >> > > test > >> > > on reiser fs. > >> > > > >> > > The fsx command is: > >> > > > >> > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & > >> > > >> > > >> > This ext3 bug goes all the way back to 2.6.6. > >> > >> > I don't know yet why you saw problems with reiser3 and I'm pretty sure I > >> > saw problems with ext2. More testing is needed there. > >> > > >> > >> We (Janet and I) are chasing this bug as well. Janet is able to > >> reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down > >> this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will > >> double check it. Will try your patch today. > > > > There's a second leak, with similar-looking symptoms. At ~50 > > commits/second it has leaked ~10MB in 24 hours, so it's very slow - less > > than a hundredth the rate of the first one. > > What are you using to see these with, just kgdb, and a large cranial > capacity? Or is there some more magic? > Nothing magical: run the test for a while, kill everything, cause a huge swapstorm then look at the meminfo numbers. If active+inactive is significantly larger than cahed+buffers+swapcached+mapped+minus-a-bit then it's leaked. Right now I have: MemTotal: 246264 kB MemFree:196148 kB Buffers: 4200 kB Cached: 3308 kB SwapCached: 8064 kB Active: 21548 kB Inactive:12532 kB HighTotal: 0 kB HighFree:0 kB LowTotal: 246264 kB LowFree:196148 kB SwapTotal: 1020116 kB SwapFree: 1001612 kB Dirty: 60 kB Writeback: 0 kB Mapped: 2284 kB Slab:12200 kB CommitLimit: 1143248 kB Committed_AS:34004 kB PageTables: 1200 kB VmallocTotal: 774136 kB VmallocUsed: 82832 kB VmallocChunk: 691188 kB HugePages_Total: 0 HugePages_Free: 0 33 megs on the LRU, unaccounted for in other places. Once the leak is nice and large I can start a new swapstorm, set a breakpoint in try_to_free_buffers() (for example) and start looking at the state of the page and its buffers. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
Mingming Cao <[EMAIL PROTECTED]> wrote: > > On Sun, 2005-04-03 at 18:35 -0700, Andrew Morton wrote: > > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on > > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 > > > hours the system hit OOM, and OOM keep killing processes one by one. I > > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with > > > 512MB RAM. Also the problem could be reproduced on running the same test > > > on reiser fs. > > > > > > The fsx command is: > > > > > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & > > > > > > This ext3 bug goes all the way back to 2.6.6. > > > I don't know yet why you saw problems with reiser3 and I'm pretty sure I > > saw problems with ext2. More testing is needed there. > > > > We (Janet and I) are chasing this bug as well. Janet is able to > reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down > this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will > double check it. Will try your patch today. There's a second leak, with similar-looking symptoms. At ~50 commits/second it has leaked ~10MB in 24 hours, so it's very slow - less than a hundredth the rate of the first one. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
On Sun, 2005-04-03 at 18:35 -0700, Andrew Morton wrote: > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 > > hours the system hit OOM, and OOM keep killing processes one by one. I > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with > > 512MB RAM. Also the problem could be reproduced on running the same test > > on reiser fs. > > > > The fsx command is: > > > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & > > > This ext3 bug goes all the way back to 2.6.6. > I don't know yet why you saw problems with reiser3 and I'm pretty sure I > saw problems with ext2. More testing is needed there. > We (Janet and I) are chasing this bug as well. Janet is able to reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will double check it. Will try your patch today. Thanks, Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
Mingming Cao <[EMAIL PROTECTED]> wrote: > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 > hours the system hit OOM, and OOM keep killing processes one by one. I > could reproduce this problem very constantly on a 2 way PIII 700MHZ with > 512MB RAM. Also the problem could be reproduced on running the same test > on reiser fs. > > The fsx command is: > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & This ext3 bug goes all the way back to 2.6.6. I don't know yet why you saw problems with reiser3 and I'm pretty sure I saw problems with ext2. More testing is needed there. Without the below patch it's possible to make ext3 leak at around a megabyte per minute by arranging for the fs to run a commit every 50 milliseconds, btw. (Stephen, please review...) This fixes the lots-of-fsx-linux-instances-cause-a-slow-leak bug. It's been there since 2.6.6, caused by: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5/2.6.5-mm4/broken-out/jbd-move-locked-buffers.patch That patch moves under-writeout ordered-data buffers onto a separate journal list during commit. It took out the old code which was based on a single list. The old code (necessarily) had logic which would restart I/O against buffers which had been redirtied while they were on the committing transaction's t_sync_datalist list. The new code only writes buffers once, ignoring redirtyings by a later transaction, which is good. But over on the truncate side of things, in journal_unmap_buffer(), we're treating buffers on the t_locked_list as inviolable things which belong to the committing transaction, and we just leave them alone during concurrent truncate-vs-commit. The net effect is that when truncate tries to invalidate a page whose buffers are on t_locked_list and have been redirtied, journal_unmap_buffer() just leaves those buffers alone. truncate will remove the page from its mapping and we end up with an anonymous clean page with dirty buffers, which is an illegal state for a page. The JBD commit will not clean those buffers as they are removed from t_locked_list. The VM (try_to_free_buffers) cannot reclaim these pages. The patch teaches journal_unmap_buffer() about buffers which are on the committing transaction's t_locked_list. These buffers have been written and I/O has completed. We can take them off the transaction and undirty them within the context of journal_invalidatepage()->journal_unmap_buffer(). Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> --- 25-akpm/fs/jbd/transaction.c | 13 +++-- 1 files changed, 11 insertions(+), 2 deletions(-) diff -puN fs/jbd/transaction.c~jbd-dirty-buffer-leak-fix fs/jbd/transaction.c --- 25/fs/jbd/transaction.c~jbd-dirty-buffer-leak-fix 2005-04-03 15:12:12.0 -0700 +++ 25-akpm/fs/jbd/transaction.c2005-04-03 15:14:40.0 -0700 @@ -1812,7 +1812,17 @@ static int journal_unmap_buffer(journal_ } } } else if (transaction == journal->j_committing_transaction) { - /* If it is committing, we simply cannot touch it. We + if (jh->b_jlist == BJ_Locked) { + /* +* The buffer is on the committing transaction's locked +* list. We have the buffer locked, so I/O has +* completed. So we can nail the buffer now. +*/ + may_free = __dispose_buffer(jh, transaction); + goto zap_buffer; + } + /* +* If it is committing, we simply cannot touch it. We * can remove it's next_transaction pointer from the * running transaction if that is set, but nothing * else. */ @@ -1887,7 +1897,6 @@ int journal_invalidatepage(journal_t *jo unsigned int next_off = curr_off + bh->b_size; next = bh->b_this_page; - /* AKPM: doing lock_buffer here may be overly paranoid */ if (offset <= curr_off) { /* This block is wholly outside the truncation point */ lock_buffer(bh); _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests
Badari Pulavarty wrote: Mingming Cao wrote: On Sat, 2005-03-26 at 16:23 -0800, Mingming Cao wrote: On Fri, 2005-03-25 at 14:11 -0800, Badari Pulavarty wrote: On Fri, 2005-03-25 at 13:56, Andrew Morton wrote: Mingming Cao <[EMAIL PROTECTED]> wrote: I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 hours the system hit OOM, and OOM keep killing processes one by one. I could reproduce this problem very constantly on a 2 way PIII 700MHZ with 512MB RAM. Also the problem could be reproduced on running the same test on reiser fs. The fsx command is: ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & I was able to reproduce this on ext3. Seven instances of the above leaked 10-15MB over 10 hours. All of it permanently stuck on the LRU. I'll continue to poke at it - see what kernel it started with, which filesystems it affects, whether it happens on UP&&!PREEMPT, etc. Not a quick process. I reproduced *similar* issue with 2.6.11. The reason I say similar, is there is no OOM kill, but very low free memory and machine doesn't respond at all. (I booted my machine with 256M memory and ran 20 copies of fsx on ext3). Yes, I re-run the same test on 2.6.11 for 24 hours, like Badari see on his machine, my machine did not go to OOM on 2.6.11,still alive, but memory is very low(only 5M free). Killed all fsx and umount the ext3 filesystem did not bring back much memory. I will going to rerun the tests without the mapped read/write to see what happen. Run fsx tests without mapped IO on 2.6.11 seems fine. Here is the /proc/meminfo after 18 hours run: Mingming, Reproduce it on 2.6.11 with mapped IO tests. That will tell us when the regression started. Sorry - Ignore my request, Mingming already did this work and posted the result. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests
Mingming Cao wrote: On Sat, 2005-03-26 at 16:23 -0800, Mingming Cao wrote: On Fri, 2005-03-25 at 14:11 -0800, Badari Pulavarty wrote: On Fri, 2005-03-25 at 13:56, Andrew Morton wrote: Mingming Cao <[EMAIL PROTECTED]> wrote: I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 hours the system hit OOM, and OOM keep killing processes one by one. I could reproduce this problem very constantly on a 2 way PIII 700MHZ with 512MB RAM. Also the problem could be reproduced on running the same test on reiser fs. The fsx command is: ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & I was able to reproduce this on ext3. Seven instances of the above leaked 10-15MB over 10 hours. All of it permanently stuck on the LRU. I'll continue to poke at it - see what kernel it started with, which filesystems it affects, whether it happens on UP&&!PREEMPT, etc. Not a quick process. I reproduced *similar* issue with 2.6.11. The reason I say similar, is there is no OOM kill, but very low free memory and machine doesn't respond at all. (I booted my machine with 256M memory and ran 20 copies of fsx on ext3). Yes, I re-run the same test on 2.6.11 for 24 hours, like Badari see on his machine, my machine did not go to OOM on 2.6.11,still alive, but memory is very low(only 5M free). Killed all fsx and umount the ext3 filesystem did not bring back much memory. I will going to rerun the tests without the mapped read/write to see what happen. Run fsx tests without mapped IO on 2.6.11 seems fine. Here is the /proc/meminfo after 18 hours run: Mingming, Reproduce it on 2.6.11 with mapped IO tests. That will tell us when the regression started. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests
On Sat, 2005-03-26 at 16:23 -0800, Mingming Cao wrote: > On Fri, 2005-03-25 at 14:11 -0800, Badari Pulavarty wrote: > > On Fri, 2005-03-25 at 13:56, Andrew Morton wrote: > > > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > > > > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on > > > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 > > > > hours the system hit OOM, and OOM keep killing processes one by one. I > > > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with > > > > 512MB RAM. Also the problem could be reproduced on running the same test > > > > on reiser fs. > > > > > > > > The fsx command is: > > > > > > > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & > > > > > > I was able to reproduce this on ext3. Seven instances of the above leaked > > > 10-15MB over 10 hours. All of it permanently stuck on the LRU. > > > > > > I'll continue to poke at it - see what kernel it started with, which > > > filesystems it affects, whether it happens on UP&&!PREEMPT, etc. Not a > > > quick process. > > > > I reproduced *similar* issue with 2.6.11. The reason I say similar, is > > there is no OOM kill, but very low free memory and machine doesn't > > respond at all. (I booted my machine with 256M memory and ran 20 copies > > of fsx on ext3). > > > > > > Yes, I re-run the same test on 2.6.11 for 24 hours, like Badari see on > his machine, my machine did not go to OOM on 2.6.11,still alive, but > memory is very low(only 5M free). Killed all fsx and umount the ext3 > filesystem did not bring back much memory. I will going to rerun the > tests without the mapped read/write to see what happen. > > Run fsx tests without mapped IO on 2.6.11 seems fine. Here is the /proc/meminfo after 18 hours run: # cat /proc/meminfo MemTotal: 510464 kB MemFree: 6004 kB Buffers:179420 kB Cached: 9144 kB SwapCached: 0 kB Active: 313236 kB Inactive: 171380 kB HighTotal: 0 kB HighFree:0 kB LowTotal: 510464 kB LowFree: 6004 kB SwapTotal: 1052216 kB SwapFree: 1052216 kB Dirty:2100 kB Writeback: 0 kB Mapped: 24884 kB Slab:14788 kB CommitLimit: 1307448 kB Committed_AS:78032 kB PageTables:720 kB VmallocTotal: 516024 kB VmallocUsed: 1672 kB VmallocChunk: 514352 kB elm3b92:~ # killall -9 fsx elm3b92:~ # cat /proc/meminfo MemTotal: 510464 kB MemFree: 21332 kB Buffers:179668 kB Cached: 8828 kB SwapCached: 0 kB Active: 298748 kB Inactive: 171152 kB HighTotal: 0 kB HighFree:0 kB LowTotal: 510464 kB LowFree: 21332 kB SwapTotal: 1052216 kB SwapFree: 1052216 kB Dirty:1140 kB Writeback: 0 kB Mapped: 11648 kB Slab:14632 kB CommitLimit: 1307448 kB Committed_AS:59800 kB PageTables:492 kB VmallocTotal: 516024 kB VmallocUsed: 1672 kB VmallocChunk: 514352 kB elm3b92:~ # umount /mnt/ext3 elm3b92:~ # cat /proc/meminfo MemTotal: 510464 kB MemFree:181636 kB Buffers: 22092 kB Cached: 6740 kB SwapCached: 0 kB Active: 151284 kB Inactive: 158948 kB HighTotal: 0 kB HighFree:0 kB LowTotal: 510464 kB LowFree:181636 kB SwapTotal: 1052216 kB SwapFree: 1052216 kB Dirty: 0 kB Writeback: 0 kB Mapped: 11656 kB Slab:14052 kB CommitLimit: 1307448 kB Committed_AS:59800 kB PageTables:492 kB VmallocTotal: 516024 kB VmallocUsed: 1672 kB VmallocChunk: 514352 kB - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests
On Fri, 2005-03-25 at 14:11 -0800, Badari Pulavarty wrote: > On Fri, 2005-03-25 at 13:56, Andrew Morton wrote: > > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on > > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 > > > hours the system hit OOM, and OOM keep killing processes one by one. I > > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with > > > 512MB RAM. Also the problem could be reproduced on running the same test > > > on reiser fs. > > > > > > The fsx command is: > > > > > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & > > > > I was able to reproduce this on ext3. Seven instances of the above leaked > > 10-15MB over 10 hours. All of it permanently stuck on the LRU. > > > > I'll continue to poke at it - see what kernel it started with, which > > filesystems it affects, whether it happens on UP&&!PREEMPT, etc. Not a > > quick process. > > I reproduced *similar* issue with 2.6.11. The reason I say similar, is > there is no OOM kill, but very low free memory and machine doesn't > respond at all. (I booted my machine with 256M memory and ran 20 copies > of fsx on ext3). > > Yes, I re-run the same test on 2.6.11 for 24 hours, like Badari see on his machine, my machine did not go to OOM on 2.6.11,still alive, but memory is very low(only 5M free). Killed all fsx and umount the ext3 filesystem did not bring back much memory. I will going to rerun the tests without the mapped read/write to see what happen. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests
On Fri, 2005-03-25 at 16:17, Dave Jones wrote: > On Wed, Mar 23, 2005 at 11:53:04AM -0800, Mingming Cao wrote: > > > The fsx command is: > > > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & > > > > I also see fsx tests start to generating report about read bad data > > about the tests have run for about 9 hours(one hour before of the OOM > > happen). > > Is writing to the same testfile from multiple fsx's supposed to work? > It sounds like a surefire way to break the consistency checking that it does. > I'm surprised it lasts 9hrs before it breaks. > > In the past I've done tests like.. > > for i in `seq 1 100` > do > fsx foo$i & > done > > to make each process use a different test file. > No. We are doing on different files - Mingming cut and pasted only a single line from the script. Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
On Wed, Mar 23, 2005 at 11:53:04AM -0800, Mingming Cao wrote: > The fsx command is: > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & > > I also see fsx tests start to generating report about read bad data > about the tests have run for about 9 hours(one hour before of the OOM > happen). Is writing to the same testfile from multiple fsx's supposed to work? It sounds like a surefire way to break the consistency checking that it does. I'm surprised it lasts 9hrs before it breaks. In the past I've done tests like.. for i in `seq 1 100` do fsx foo$i & done to make each process use a different test file. Dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests
On Fri, 2005-03-25 at 13:56, Andrew Morton wrote: > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 > > hours the system hit OOM, and OOM keep killing processes one by one. I > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with > > 512MB RAM. Also the problem could be reproduced on running the same test > > on reiser fs. > > > > The fsx command is: > > > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & > > I was able to reproduce this on ext3. Seven instances of the above leaked > 10-15MB over 10 hours. All of it permanently stuck on the LRU. > > I'll continue to poke at it - see what kernel it started with, which > filesystems it affects, whether it happens on UP&&!PREEMPT, etc. Not a > quick process. I reproduced *similar* issue with 2.6.11. The reason I say similar, is there is no OOM kill, but very low free memory and machine doesn't respond at all. (I booted my machine with 256M memory and ran 20 copies of fsx on ext3). Thanks, Badari - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
Mingming Cao <[EMAIL PROTECTED]> wrote: > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 > hours the system hit OOM, and OOM keep killing processes one by one. I > could reproduce this problem very constantly on a 2 way PIII 700MHZ with > 512MB RAM. Also the problem could be reproduced on running the same test > on reiser fs. > > The fsx command is: > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 & I was able to reproduce this on ext3. Seven instances of the above leaked 10-15MB over 10 hours. All of it permanently stuck on the LRU. I'll continue to poke at it - see what kernel it started with, which filesystems it affects, whether it happens on UP&&!PREEMPT, etc. Not a quick process. Given that you also saw it on reiserfs, it might be a bug in the core mmap/truncate/unmap handling. We'll see. > I also see fsx tests start to generating report about read bad data > about the tests have run for about 9 hours(one hour before of the OOM > happen). I haven't noticed anything like that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > On Wed, Mar 23, 2005 at 03:42:32PM -0800, Andrew Morton wrote: > > I'm suspecting here that we simply leaked a refcount on every darn > > pagecache page in the machine. Note how mapped memory has shrunk down to > > less than a megabyte and everything which can be swapped out has been > > swapped out. > > > > If so, then oom-killing everything in the world is pretty inevitable. > > Agreed, it looks like a memleak of a page_count (while mapcount is fine). > > I would suggest looking after pages part of pagecache (i.e. > page->mapcount not null) that have a mapcount of 0 and a page_count > 1, > almost all of them should be like that during the memleak, and almost > none should be like that before the memleak. > > This seems unrelated to the bug that started the thread that was clearly > a slab shrinking issue and not a pagecache memleak. > The vmscan.c changes in -rc1 look harmless enough. That's assuming that 2.6.11 doesn't have the bug. btw, that new orphanned-page handling code has a printk in it, and nobody has reported it coming out yet... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
On Wed, Mar 23, 2005 at 03:42:32PM -0800, Andrew Morton wrote: > I'm suspecting here that we simply leaked a refcount on every darn > pagecache page in the machine. Note how mapped memory has shrunk down to > less than a megabyte and everything which can be swapped out has been > swapped out. > > If so, then oom-killing everything in the world is pretty inevitable. Agreed, it looks like a memleak of a page_count (while mapcount is fine). I would suggest looking after pages part of pagecache (i.e. page->mapcount not null) that have a mapcount of 0 and a page_count > 1, almost all of them should be like that during the memleak, and almost none should be like that before the memleak. This seems unrelated to the bug that started the thread that was clearly a slab shrinking issue and not a pagecache memleak. Thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests
Andrew Morton wrote: "Martin J. Bligh" <[EMAIL PROTECTED]> wrote: Nothing beats poking around in a dead machine's guts with kgdb though. Everyone his taste. But I was surprised by SwapTotal: 1052216 kB SwapFree: 1045984 kB Strange that processes are killed while lots of swap is available. I don't think we're that smart about it. If we're really low on mem, it seems we invoke the OOM killer whether processes are causing the problem or not. OTOH, if we can't free the kernel mem, we don't have much choice, but it's not really helping much ;-) I'm suspecting here that we simply leaked a refcount on every darn pagecache page in the machine. Note how mapped memory has shrunk down to less than a megabyte and everything which can be swapped out has been swapped out. That makes sense. We have almost 485MB in active and inactive caches, but we are not able reclai them :( Active: 243580 kB Inactive: 242248 kB If so, then oom-killing everything in the world is pretty inevitable. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
"Martin J. Bligh" <[EMAIL PROTECTED]> wrote: > > >> Nothing beats poking around in a dead machine's guts with kgdb though. > > > > Everyone his taste. > > > > But I was surprised by > > > >> SwapTotal: 1052216 kB > >> SwapFree: 1045984 kB > > > > Strange that processes are killed while lots of swap is available. > > I don't think we're that smart about it. If we're really low on mem, it > seems we invoke the OOM killer whether processes are causing the problem > or not. > > OTOH, if we can't free the kernel mem, we don't have much choice, but > it's not really helping much ;-) > I'm suspecting here that we simply leaked a refcount on every darn pagecache page in the machine. Note how mapped memory has shrunk down to less than a megabyte and everything which can be swapped out has been swapped out. If so, then oom-killing everything in the world is pretty inevitable. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
>> Nothing beats poking around in a dead machine's guts with kgdb though. > > Everyone his taste. > > But I was surprised by > >> SwapTotal: 1052216 kB >> SwapFree: 1045984 kB > > Strange that processes are killed while lots of swap is available. I don't think we're that smart about it. If we're really low on mem, it seems we invoke the OOM killer whether processes are causing the problem or not. OTOH, if we can't free the kernel mem, we don't have much choice, but it's not really helping much ;-) M. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
On Wed, Mar 23, 2005 at 03:20:55PM -0800, Andrew Morton wrote: > Nothing beats poking around in a dead machine's guts with kgdb though. Everyone his taste. But I was surprised by > SwapTotal: 1052216 kB > SwapFree: 1045984 kB Strange that processes are killed while lots of swap is available. Andries - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
"Martin J. Bligh" <[EMAIL PROTECTED]> wrote: > > > It would be interesting if you could run the same test on 2.6.11. > > One thing I'm finding is that it's hard to backtrace who has each page > in this sort of situation. My plan is to write a debug patch to walk > mem_map and dump out some info on each page. I would appreciate ideas > on what info would be useful here. Some things are fairly obvious, like > we want to know if it's anon / mapped into address space (& which), > whether it's slab / buffers / pagecache etc ... any other suggestions > you have would be much appreciated. You could use page-owner-tracking-leak-detector.patch make-page_owner-handle-non-contiguous-page-ranges.patch add-gfp_mask-to-page-owner.patch which sticks an 8-slot stack backtrace into each page, recording who allocated it. But that's probably not very interesting info for pagecache pages. Nothing beats poking around in a dead machine's guts with kgdb though. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
>> I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on >> 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 >> hours the system hit OOM, and OOM keep killing processes one by one. > > I don't have a very good record reading these oom dumps lately, but this > one look really weird. Basically no mapped memory, tons of pagecache on > the LRU. > > It would be interesting if you could run the same test on 2.6.11. One thing I'm finding is that it's hard to backtrace who has each page in this sort of situation. My plan is to write a debug patch to walk mem_map and dump out some info on each page. I would appreciate ideas on what info would be useful here. Some things are fairly obvious, like we want to know if it's anon / mapped into address space (& which), whether it's slab / buffers / pagecache etc ... any other suggestions you have would be much appreciated. I'm suspecting in many cases we don't keep enough info, and it would be too slow to keep it in the default case - so I may need to add some extra debug fields in struct page as a config option, but let's start with what we have. M. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
Mingming Cao <[EMAIL PROTECTED]> wrote: > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10 > hours the system hit OOM, and OOM keep killing processes one by one. I don't have a very good record reading these oom dumps lately, but this one look really weird. Basically no mapped memory, tons of pagecache on the LRU. It would be interesting if you could run the same test on 2.6.11. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/