Re: [RFC] Reproducible OOM with partial workaround
Dear Andrew, >>> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads. >> Please see below ... > ... Was this dump taken when the system was at or near oom? No, that was a "quiescent" machine. Please see a just-before-OOM dump in my next message (in a little while). > Please send a copy of the oom-killer kernel message dump, if you still > have one. Please see one in next message, or in http://bugs.debian.org/695182 >> I tried setting dirty_ratio to "funny" values, that did not seem to >> help. > Did you try setting it as low as possible? Probably. Maybe. Sorry, cannot say with certainty. >> Did you notice my patch about bdi_position_ratio(), how it was >> plain wrong half the time (for negative x)? > Nope, please resend. Quoting from http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182 : ... - In bdi_position_ratio() get difference (setpoint-dirty) right even when it is negative, which happens often. Normally these numbers are "small" and even with left-shift I never observed a 32-bit overflow. I believe it should be possible to re-write the whole function in 32-bit ints; maybe it is not worth the effort to make it "efficient"; seeing how this function was always wrong and we survived, it should simply be removed. ... --- mm/page-writeback.c.old 2012-10-17 13:50:15.0 +1100 +++ mm/page-writeback.c 2013-01-06 21:54:59.0 +1100 [ Line numbers out because other patches not shown ] ... @@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio( * => fast response on large errors; small oscillation near setpoint */ setpoint = (freerun + limit) / 2; - x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT, + x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT, limit - setpoint + 1); pos_ratio = x; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; ... Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with partial workaround
On Fri, 11 Jan 2013 22:51:35 +1100 paul.sz...@sydney.edu.au wrote: > Dear Andrew, > > > Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads. > > Please see below: I do not know what any of that means. This machine has > been running just fine, with all my users logging in here via XDMCP from > X-terminals, dozens logged in simultaneously. (But, I think I could make > it go OOM with more processes or logins.) I'm counting 107MB in slab there. Was this dump taken when the system was at or near oom? Please send a copy of the oom-killer kernel message dump, if you still have one. > > If so, you *may* be able to work around this by setting > > /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum > > amount of dirty pagecache around. Then, with luck, if we haven't > > broken the buffer_heads_over_limit logic it in the past decade (we > > probably have), the VM should be able to reclaim those buffer_heads. > > I tried setting dirty_ratio to "funny" values, that did not seem to > help. Did you try setting it as low as possible? > Did you notice my patch about bdi_position_ratio(), how it was > plain wrong half the time (for negative x)? Nope, please resend. > Anyway that did not help. > > > Alternatively, use a filesystem which doesn't attach buffer_heads to > > dirty pages. xfs or btrfs, perhaps. > > Seems there is also a problem not related to filesystem... or rather, > the essence does not seem to be filesystem or caches. The filesystem > thing now seems OK with my patch doing drop_caches. hm, if doing a regular drop_caches fixes things then that implies the problem is not with dirty pagecache. Odd. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with partial workaround
On 01/10/2013 05:46 PM, paul.sz...@sydney.edu.au wrote: >> > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit >> > kernel without either violating the ABI (3GB/1GB split) or doing >> > something that never got merged upstream ... > Sorry to be so contradictory: > > psz@como:~$ uname -a > Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 > 18:34:25 EST 2013 i686 GNU/Linux > psz@como:~$ free -l > total used free sharedbuffers cached > Mem: 644469004729292 59717608 0 15972 480520 > Low:375836 304400 71436 > High: 640710644424892 59646172 > -/+ buffers/cache:4232800 60214100 > Swap:134217724 0 134217724 Hey, that's pretty cool! I would swear that the mem_map[] overhead was such that they wouldn't boot, but perhaps those brain cells died on me. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with partial workaround
Dear Andrew, > Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads. Please see below: I do not know what any of that means. This machine has been running just fine, with all my users logging in here via XDMCP from X-terminals, dozens logged in simultaneously. (But, I think I could make it go OOM with more processes or logins.) > If so, you *may* be able to work around this by setting > /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum > amount of dirty pagecache around. Then, with luck, if we haven't > broken the buffer_heads_over_limit logic it in the past decade (we > probably have), the VM should be able to reclaim those buffer_heads. I tried setting dirty_ratio to "funny" values, that did not seem to help. Did you notice my patch about bdi_position_ratio(), how it was plain wrong half the time (for negative x)? Anyway that did not help. > Alternatively, use a filesystem which doesn't attach buffer_heads to > dirty pages. xfs or btrfs, perhaps. Seems there is also a problem not related to filesystem... or rather, the essence does not seem to be filesystem or caches. The filesystem thing now seems OK with my patch doing drop_caches. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia --- root@como:~# free -lm total used free sharedbuffers cached Mem: 62936 2317 60618 0 41635 Low: 367271 95 High:62569 2045 60523 -/+ buffers/cache: 1640 61295 Swap: 131071 0 131071 root@como:~# cat /proc/slabinfo slabinfo - version: 2.1 # name : tunables: slabdata fuse_request 0 0376 434 : tunables000 : slabdata 0 0 0 fuse_inode 0 0448 364 : tunables000 : slabdata 0 0 0 bsg_cmd0 0288 282 : tunables000 : slabdata 0 0 0 ntfs_big_inode_cache 0 0512 324 : tunables000 : slabdata 0 0 0 ntfs_inode_cache 0 0176 462 : tunables000 : slabdata 0 0 0 nfs_direct_cache 0 0 80 511 : tunables000 : slabdata 0 0 0 nfs_inode_cache 5404 5404584 284 : tunables000 : slabdata193193 0 isofs_inode_cache 0 0360 454 : tunables000 : slabdata 0 0 0 fat_inode_cache0 0408 404 : tunables000 : slabdata 0 0 0 fat_cache 0 0 24 1701 : tunables000 : slabdata 0 0 0 jbd2_revoke_record 0 0 32 1281 : tunables000 : slabdata 0 0 0 journal_handle 5440 5440 24 1701 : tunables000 : slabdata 32 32 0 journal_head 16768 16768 64 641 : tunables000 : slabdata262262 0 revoke_record 20224 20224 16 2561 : tunables000 : slabdata 79 79 0 ext4_inode_cache 0 0584 284 : tunables000 : slabdata 0 0 0 ext4_free_data 0 0 40 1021 : tunables000 : slabdata 0 0 0 ext4_allocation_context 0 0112 361 : tunables00 0 : slabdata 0 0 0 ext4_prealloc_space 0 0 72 561 : tunables000 : slabdata 0 0 0 ext4_io_end0 0576 284 : tunables000 : slabdata 0 0 0 ext4_io_page 0 0 8 5121 : tunables000 : slabdata 0 0 0 ext2_inode_cache 0 0480 344 : tunables000 : slabdata 0 0 0 ext3_inode_cache 16531 19965488 334 : tunables000 : slabdata605605 0 ext3_xattr 0 0 48 851 : tunables000 : slabdata 0 0 0 dquot840840192 422 : tunables000 : slabdata 20 20 0 rpc_inode_cache 144144448 364 : tunables000 : slabdata 4 4 0 UDP-Lite 0 0576 284 : tunables000 : slabdata 0 0 0 xfrm_dst_cache 0 0320 514 : tunables000 : slabdata 0 0 0 UDP 896896576 284 : tunables000 : slabdata 32 32 0 tw_sock_TCP 1344 1344128 321 : tunables000 : slabdata 42 42 0 TCP 1457 1624 1152 28
Re: [RFC] Reproducible OOM with partial workaround
On Fri, 2013-01-11 at 00:01 -0800, Andrew Morton wrote: > On Fri, 11 Jan 2013 12:46:15 +1100 paul.sz...@sydney.edu.au wrote: > > > > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit > > > kernel without either violating the ABI (3GB/1GB split) or doing > > > something that never got merged upstream ... > > > > Sorry to be so contradictory: > > > > psz@como:~$ uname -a > > Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 > > 18:34:25 EST 2013 i686 GNU/Linux > > psz@como:~$ free -l > > total used free sharedbuffers cached > > Mem: 644469004729292 59717608 0 15972 480520 > > Low:375836 304400 71436 > > High: 640710644424892 59646172 > > -/+ buffers/cache:4232800 60214100 > > Swap:134217724 0 134217724 > > psz@como:~$ > > > > (though I would not know about violations). > > > > But OK, I take your point that I should move with the times. > > Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads. > > If so, you *may* be able to work around this by setting > /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum > amount of dirty pagecache around. Then, with luck, if we haven't > broken the buffer_heads_over_limit logic it in the past decade (we > probably have), the VM should be able to reclaim those buffer_heads. > > Alternatively, use a filesystem which doesn't attach buffer_heads to > dirty pages. xfs or btrfs, perhaps. > Hi Andrew, What's the meaning of attaching buffer_heads to dirty pages? > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with partial workaround
On Fri, 11 Jan 2013 12:46:15 +1100 paul.sz...@sydney.edu.au wrote: > > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit > > kernel without either violating the ABI (3GB/1GB split) or doing > > something that never got merged upstream ... > > Sorry to be so contradictory: > > psz@como:~$ uname -a > Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 > 18:34:25 EST 2013 i686 GNU/Linux > psz@como:~$ free -l > total used free sharedbuffers cached > Mem: 644469004729292 59717608 0 15972 480520 > Low:375836 304400 71436 > High: 640710644424892 59646172 > -/+ buffers/cache:4232800 60214100 > Swap:134217724 0 134217724 > psz@como:~$ > > (though I would not know about violations). > > But OK, I take your point that I should move with the times. Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads. If so, you *may* be able to work around this by setting /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum amount of dirty pagecache around. Then, with luck, if we haven't broken the buffer_heads_over_limit logic it in the past decade (we probably have), the VM should be able to reclaim those buffer_heads. Alternatively, use a filesystem which doesn't attach buffer_heads to dirty pages. xfs or btrfs, perhaps. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with partial workaround
Dear Dave, > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit > kernel without either violating the ABI (3GB/1GB split) or doing > something that never got merged upstream ... Sorry to be so contradictory: psz@como:~$ uname -a Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 EST 2013 i686 GNU/Linux psz@como:~$ free -l total used free sharedbuffers cached Mem: 644469004729292 59717608 0 15972 480520 Low:375836 304400 71436 High: 640710644424892 59646172 -/+ buffers/cache:4232800 60214100 Swap:134217724 0 134217724 psz@como:~$ (though I would not know about violations). But OK, I take your point that I should move with the times. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with partial workaround
On 01/10/2013 04:46 PM, paul.sz...@sydney.edu.au wrote: >> Your configuration has never worked. This isn't a regression ... >> ... does not mean that we expect it to work. > > Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used; > that all development is for 64-bit only? My last 4GB laptop had a 1GB hole and needed HIGHMEM64G since it had RAM at 0->5GB. That worked just fine, btw. The problem isn't with HIGHMEM64G itself. I'm not saying HIGHMEM64G is inherently bad, just that it gets gradually worse and worse as you add more RAM. I don't believe 64GB of RAM has _ever_ been booted on a 32-bit kernel without either violating the ABI (3GB/1GB split) or doing something that never got merged upstream (that 4GB/4GB split, or other fun stuff like page clustering). > I find it puzzling that there seems to be a sharp cutoff at 32GB RAM, > no problem under but OOM just over; whereas I would have expected > lowmem starvation to be gradual, with OOM occuring much sooner with > 64GB than with 34GB. Also, the kernel seems capable of reclaiming > lowmem, so I wonder why does that fail just over the 32GB threshhold. > (Obviously I have no idea what I am talking about.) It _is_ puzzling. It isn't immediately obvious to me why the slab that you have isn't being reclaimed. There might, indeed, be a fixable bug there. But, there are probably a bunch more bugs which will keep you from having a nice, smoothly-running system, mostly those bugs have not had much attention in the 10 years or so since 64-bit x86 became commonplace. Plus, even 10 years ago, when folks were working on this actively, we _never_ got things running smoothly on 32GB of RAM. Take a look at this: http://support.bull.com/ols/product/system/linux/redhat/help/kbf/g/inst/PrKB11417 You are effectively running the "SMP kernel" (hugemem is a completely different beast). I had a 32GB i386 system. It was a really, really fun system to play with, and its never-ending list of bugs helped keep me employed for several years. You don't want to unnecessarily inflict that pain on yourself, really. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with partial workaround
Dear Dave, > Your configuration has never worked. This isn't a regression ... > ... does not mean that we expect it to work. Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used; that all development is for 64-bit only? > ... 64-bit kernels should basically be drop-in replacements ... Will think about that. I know all my servers are 64-bit capable, will need to check all my desktops. --- I find it puzzling that there seems to be a sharp cutoff at 32GB RAM, no problem under but OOM just over; whereas I would have expected lowmem starvation to be gradual, with OOM occuring much sooner with 64GB than with 34GB. Also, the kernel seems capable of reclaiming lowmem, so I wonder why does that fail just over the 32GB threshhold. (Obviously I have no idea what I am talking about.) --- Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with partial workaround
On 01/10/2013 01:58 PM, paul.sz...@sydney.edu.au wrote: > I developed a workaround patch for this particular OOM demo, dropping > filesystem caches when about to exhaust lowmem. However, subsequently > I observed OOM when running many processes (as yet I do not have an > easy-to-reproduce demo of this); so as I suspected, the essence of the > problem is not with FS caches. > > Could you please help in finding the cause of this OOM bug? As was mentioned in the bug, your 32GB of physical memory only ends up giving ~900MB of low memory to the kernel. Of that, around 600MB is used for "mem_map[]", leaving only about 300MB available to the kernel for *ALL* of its allocations at runtime. Your configuration has never worked. This isn't a regression, it's simply something that we know never worked in Linux and it's a very hard problem to solve. One Linux vendor (at least) went to a huge amount of trouble to develop, ship, and supported a kernel that supported large 32-bit machines, but it was never merged upstream and work stopped on it when such machines became rare beasts: http://lwn.net/Articles/39925/ I believe just about any Linux vendor would call your configuration "unsupported". Just because the kernel can boot does not mean that we expect it to work. It's possible that some tweaks of the vm knobs (like lowmem_reserve) could help you here. But, really, you don't want to run a 32-bit kernel on such a large machine. Very, very few folks are running 32-bit kernels on these systems and you're likely to keep running in to bugs because this is such a rare configuration. We've been very careful to ensure that 64-bit kernels shoul basically be drop-in replacements for 32-bit ones. You can keep userspace 100% 32-bit, and just have a 64-bit kernel. If you're really set on staying 32-bit, I might have a NUMA-Q I can give you. ;) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/