Re: CFS scheduler unfairly prefers pinned tasks
I wrote: The Linux CFS scheduler prefers pinned tasks and unfairly gives more CPU time to tasks that have set CPU affinity. ... I believe I have now solved the problem, simply by setting: for n in /proc/sys/kernel/sched_domain/cpu*/domain0/min_interval; do echo 0 > $n; done for n in /proc/sys/kernel/sched_domain/cpu*/domain0/max_interval; do echo 1 > $n; done Testing with real-life jobs, I found I needed min_- and max_interval for domain1 also, and a couple of other non-default values, so: for n in /proc/sys/kernel/sched_domain/cpu*/dom*/min_interval; do echo 0 > $n; done for n in /proc/sys/kernel/sched_domain/cpu*/dom*/max_interval; do echo 1 > $n; done echo 10 > /proc/sys/kernel/sched_latency_ns echo 10 > /proc/sys/kernel/sched_min_granularity_ns echo 1 > /proc/sys/kernel/sched_wakeup_granularity_ns and then things seem fair and my users are happy. Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: disable task group re-weighting on the desktop
Dear Mike, Did you check whether setting min_- and max_interval e.g. as per https://lkml.org/lkml/2015/10/11/34 would help with your issue (instead of your "horrible gs destroying" patch)? Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS scheduler unfairly prefers pinned tasks
I wrote: The Linux CFS scheduler prefers pinned tasks and unfairly gives more CPU time to tasks that have set CPU affinity. I believe I have now solved the problem, simply by setting: for n in /proc/sys/kernel/sched_domain/cpu*/domain0/min_interval; do echo 0 > $n; done for n in /proc/sys/kernel/sched_domain/cpu*/domain0/max_interval; do echo 1 > $n; done I am not sure what the domain1 values would be for (that I see exist on my 4*E5-4627v2 server). So far I do not see any negative effects of using these (extreme?) settings. (Explanation of what these things are meant for, or pointers to documentation, would be appreciated.) --- Thanks for the insightful discussion. (Scary, isn't it?) Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: disable task group re-weighting on the desktop
Dear Mike, > ... so yes, un-related. Thanks for clarifying. > I haven't seen the problem you reported. ... You mean you chose not to reproduce: you persisted in pinning your perts, whereas the problem was stated with un-pinned perts (and pinned oinks). But that is OK... others did reproduce, and anyway I believe I have now fixed my problem. (Solution in that "other" email thread.) Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS scheduler unfairly prefers pinned tasks
I wrote: The Linux CFS scheduler prefers pinned tasks and unfairly gives more CPU time to tasks that have set CPU affinity. I believe I have now solved the problem, simply by setting: for n in /proc/sys/kernel/sched_domain/cpu*/domain0/min_interval; do echo 0 > $n; done for n in /proc/sys/kernel/sched_domain/cpu*/domain0/max_interval; do echo 1 > $n; done I am not sure what the domain1 values would be for (that I see exist on my 4*E5-4627v2 server). So far I do not see any negative effects of using these (extreme?) settings. (Explanation of what these things are meant for, or pointers to documentation, would be appreciated.) --- Thanks for the insightful discussion. (Scary, isn't it?) Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: disable task group re-weighting on the desktop
Dear Mike, > ... so yes, un-related. Thanks for clarifying. > I haven't seen the problem you reported. ... You mean you chose not to reproduce: you persisted in pinning your perts, whereas the problem was stated with un-pinned perts (and pinned oinks). But that is OK... others did reproduce, and anyway I believe I have now fixed my problem. (Solution in that "other" email thread.) Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: disable task group re-weighting on the desktop
Dear Mike, Did you check whether setting min_- and max_interval e.g. as per https://lkml.org/lkml/2015/10/11/34 would help with your issue (instead of your "horrible gs destroying" patch)? Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS scheduler unfairly prefers pinned tasks
I wrote: The Linux CFS scheduler prefers pinned tasks and unfairly gives more CPU time to tasks that have set CPU affinity. ... I believe I have now solved the problem, simply by setting: for n in /proc/sys/kernel/sched_domain/cpu*/domain0/min_interval; do echo 0 > $n; done for n in /proc/sys/kernel/sched_domain/cpu*/domain0/max_interval; do echo 1 > $n; done Testing with real-life jobs, I found I needed min_- and max_interval for domain1 also, and a couple of other non-default values, so: for n in /proc/sys/kernel/sched_domain/cpu*/dom*/min_interval; do echo 0 > $n; done for n in /proc/sys/kernel/sched_domain/cpu*/dom*/max_interval; do echo 1 > $n; done echo 10 > /proc/sys/kernel/sched_latency_ns echo 10 > /proc/sys/kernel/sched_min_granularity_ns echo 1 > /proc/sys/kernel/sched_wakeup_granularity_ns and then things seem fair and my users are happy. Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: disable task group re-weighting on the desktop
Dear Mike, You CCed me on this patch. Is that because you expect this to solve "my" problem also? You had some measurements of many oinks vs many perts or vs "desktop", but not many oinks vs 1 or 2 perts as per my "complaint". You also changed the subject line, so maybe this is all un-related. Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] sched: disable task group re-weighting on the desktop
Dear Mike, You CCed me on this patch. Is that because you expect this to solve "my" problem also? You had some measurements of many oinks vs many perts or vs "desktop", but not many oinks vs 1 or 2 perts as per my "complaint". You also changed the subject line, so maybe this is all un-related. Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS scheduler unfairly prefers pinned tasks
Dear Mike, >>> I see a fairness issue ... but one opposite to your complaint. >> Why is that opposite? ... > > Well, not exactly opposite, only opposite in that the one pert task also > receives MORE than it's fair share when unpinned. Two 100$ hogs sharing > one CPU should each get 50% of that CPU. ... But you are using CGROUPs, grouping all oinks into one group, and the one pert into another: requesting each group to get same total CPU. Since pert has one process only, the most he can get is 100% (not 400%), and it is quite OK for the oinks together to get 700%. > IFF ... massively parallel and synchronized ... You would be making the assumption that you had the machine to yourself: might be the wrong thing to assume. >> Good to see that you agree ... > Weeell, we've disagreed on pretty much everything ... Sorry I disagree: we do agree on the essence. :-) Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS scheduler unfairly prefers pinned tasks
Dear Mike, > I see a fairness issue ... but one opposite to your complaint. Why is that opposite? I think it would be fair for the one pert process to get 100% CPU, the many oink processes can get everything else. That one oink is lowly 10% (when others are 100%) is of no consequence. What happens when you un-pin pert: does it get 100%? What if you run two perts? Have you reproduced my observations? --- Good to see that you agree on the fairness issue... it MUST be fixed! CFS might be wrong or wasteful, but never unfair. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS scheduler unfairly prefers pinned tasks
Dear Mike, > I see a fairness issue ... but one opposite to your complaint. Why is that opposite? I think it would be fair for the one pert process to get 100% CPU, the many oink processes can get everything else. That one oink is lowly 10% (when others are 100%) is of no consequence. What happens when you un-pin pert: does it get 100%? What if you run two perts? Have you reproduced my observations? --- Good to see that you agree on the fairness issue... it MUST be fixed! CFS might be wrong or wasteful, but never unfair. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS scheduler unfairly prefers pinned tasks
Dear Mike, >>> I see a fairness issue ... but one opposite to your complaint. >> Why is that opposite? ... > > Well, not exactly opposite, only opposite in that the one pert task also > receives MORE than it's fair share when unpinned. Two 100$ hogs sharing > one CPU should each get 50% of that CPU. ... But you are using CGROUPs, grouping all oinks into one group, and the one pert into another: requesting each group to get same total CPU. Since pert has one process only, the most he can get is 100% (not 400%), and it is quite OK for the oinks together to get 700%. > IFF ... massively parallel and synchronized ... You would be making the assumption that you had the machine to yourself: might be the wrong thing to assume. >> Good to see that you agree ... > Weeell, we've disagreed on pretty much everything ... Sorry I disagree: we do agree on the essence. :-) Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS scheduler unfairly prefers pinned tasks
Dear Mike, >> ... the CFS is meant to be fair, using things like vruntime >> to preempt, and throttling. Why are those pinned tasks not preempted or >> throttled? > > Imagine you own a 8192 CPU box for a moment, all CPUs having one pinned > task, plus one extra unpinned task, and ponder what would have to happen > in order to meet your utilization expectation. ... Sorry but the kernel contradicts. As per my original report, things are "fair" in the case of: - with CGROUP controls and the two kinds of processes run by different users, when there is just one un-pinned process and that is so on my quad-core i5-3470 baby or my 32-core 4*E5-4627v2 server (and everywhere that I tested). The kernel is smart and gets it right for one un-pinned process: why not for two? Now re-testing further (on some machines with CGROUP): on the i5-3470 things are fair still with one un-pinned (become un-fair with two), on the 4*E5-4627v2 are fair still with 4 un-pinned (become un-fair with 5). Does this suggest that the kernel does things right within each physical CPU, but breaks across several (or exact contrary)? Maybe not: on a 2*E5530 machine, things are fair with just one un-pinned and un-fair with 2 already. > What you're seeing is not a bug. No task can occupy more than one CPU > at a time, making space reservation on multiple CPUs a very bad idea. I agree that pinning may be bad... should not the kernel penalize the badly pinned processes? Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS scheduler unfairly prefers pinned tasks
Dear Mike, >> .. CFS ... unfairly gives more CPU time to [pinned] tasks ... > > If they can all migrate, load balancing can move any of them to try to > fix the permanent imbalance, so they'll all bounce about sharing a CPU > with some other hog, and it all kinda sorta works out. > > When most are pinned, to make it work out long term you'd have to be > short term unfair, walking the unpinned minority around the box in a > carefully orchestrated dance... and have omniscient powers that assure > that none of the tasks you're trying to equalize is gonna do something > rude like leave, sleep, fork or whatever, and muck up the grand plan. Could not your argument be turned around: for a pinned task it is harder to find an idle CPU, so they should get less time? But really... those pinned tasks do not hog the CPU forever. Whatever kicks them off: could not that be done just a little earlier? And further... the CFS is meant to be fair, using things like vruntime to preempt, and throttling. Why are those pinned tasks not preempted or throttled? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS scheduler unfairly prefers pinned tasks
Dear Mike, >> ... the CFS is meant to be fair, using things like vruntime >> to preempt, and throttling. Why are those pinned tasks not preempted or >> throttled? > > Imagine you own a 8192 CPU box for a moment, all CPUs having one pinned > task, plus one extra unpinned task, and ponder what would have to happen > in order to meet your utilization expectation. ... Sorry but the kernel contradicts. As per my original report, things are "fair" in the case of: - with CGROUP controls and the two kinds of processes run by different users, when there is just one un-pinned process and that is so on my quad-core i5-3470 baby or my 32-core 4*E5-4627v2 server (and everywhere that I tested). The kernel is smart and gets it right for one un-pinned process: why not for two? Now re-testing further (on some machines with CGROUP): on the i5-3470 things are fair still with one un-pinned (become un-fair with two), on the 4*E5-4627v2 are fair still with 4 un-pinned (become un-fair with 5). Does this suggest that the kernel does things right within each physical CPU, but breaks across several (or exact contrary)? Maybe not: on a 2*E5530 machine, things are fair with just one un-pinned and un-fair with 2 already. > What you're seeing is not a bug. No task can occupy more than one CPU > at a time, making space reservation on multiple CPUs a very bad idea. I agree that pinning may be bad... should not the kernel penalize the badly pinned processes? Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS scheduler unfairly prefers pinned tasks
Dear Mike, >> .. CFS ... unfairly gives more CPU time to [pinned] tasks ... > > If they can all migrate, load balancing can move any of them to try to > fix the permanent imbalance, so they'll all bounce about sharing a CPU > with some other hog, and it all kinda sorta works out. > > When most are pinned, to make it work out long term you'd have to be > short term unfair, walking the unpinned minority around the box in a > carefully orchestrated dance... and have omniscient powers that assure > that none of the tasks you're trying to equalize is gonna do something > rude like leave, sleep, fork or whatever, and muck up the grand plan. Could not your argument be turned around: for a pinned task it is harder to find an idle CPU, so they should get less time? But really... those pinned tasks do not hog the CPU forever. Whatever kicks them off: could not that be done just a little earlier? And further... the CFS is meant to be fair, using things like vruntime to preempt, and throttling. Why are those pinned tasks not preempted or throttled? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
CFS scheduler unfairly prefers pinned tasks
The Linux CFS scheduler prefers pinned tasks and unfairly gives more CPU time to tasks that have set CPU affinity. This effect is observed with or without CGROUP controls. To demonstrate: on an otherwise idle machine, as some user run several processes pinned to each CPU, one for each CPU (as many as CPUs present in the system) e.g. for a quad-core non-HyperThreaded machine: taskset -c 0 perl -e 'while(1){1}' & taskset -c 1 perl -e 'while(1){1}' & taskset -c 2 perl -e 'while(1){1}' & taskset -c 3 perl -e 'while(1){1}' & and (as that same or some other user) run some without pinning: perl -e 'while(1){1}' & perl -e 'while(1){1}' & and use e.g. top to observe that the pinned processes get more CPU time than "fair". Fairness is obtained when either: - there are as many un-pinned processes as CPUs; or - with CGROUP controls and the two kinds of processes run by different users, when there is just one un-pinned process; or - if the pinning is turned off for these processes (or they are started without). Any insight is welcome! --- I would appreciate replies direct to me as I am not subscribed to the linux-kernel mailing list (but will try to watch the archives). This bug is also reported to Debian, please see http://bugs.debian.org/800945 I use Debian with the 3.16 kernel, have not yet tried 4.* kernels. Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
CFS scheduler unfairly prefers pinned tasks
The Linux CFS scheduler prefers pinned tasks and unfairly gives more CPU time to tasks that have set CPU affinity. This effect is observed with or without CGROUP controls. To demonstrate: on an otherwise idle machine, as some user run several processes pinned to each CPU, one for each CPU (as many as CPUs present in the system) e.g. for a quad-core non-HyperThreaded machine: taskset -c 0 perl -e 'while(1){1}' & taskset -c 1 perl -e 'while(1){1}' & taskset -c 2 perl -e 'while(1){1}' & taskset -c 3 perl -e 'while(1){1}' & and (as that same or some other user) run some without pinning: perl -e 'while(1){1}' & perl -e 'while(1){1}' & and use e.g. top to observe that the pinned processes get more CPU time than "fair". Fairness is obtained when either: - there are as many un-pinned processes as CPUs; or - with CGROUP controls and the two kinds of processes run by different users, when there is just one un-pinned process; or - if the pinning is turned off for these processes (or they are started without). Any insight is welcome! --- I would appreciate replies direct to me as I am not subscribed to the linux-kernel mailing list (but will try to watch the archives). This bug is also reported to Debian, please see http://bugs.debian.org/800945 I use Debian with the 3.16 kernel, have not yet tried 4.* kernels. Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with just a few sleeps
Dear Simon, > So if he config sparse memory, the issue can be solved I think. In my config file I have: CONFIG_HAVE_SPARSE_IRQ=y CONFIG_SPARSE_IRQ=y CONFIG_ARCH_SPARSEMEM_ENABLE=y # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_SPARSEMEM_STATIC=y # CONFIG_INPUT_SPARSEKMAP is not set # CONFIG_SPARSE_RCU_POINTER is not set Is that sufficient for sparse memory, or should I try something else? Or maybe, you meant that some kernel source patches might be possible in the sparse memory code? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with just a few sleeps
Dear Simon, So if he config sparse memory, the issue can be solved I think. In my config file I have: CONFIG_HAVE_SPARSE_IRQ=y CONFIG_SPARSE_IRQ=y CONFIG_ARCH_SPARSEMEM_ENABLE=y # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_SPARSEMEM_STATIC=y # CONFIG_INPUT_SPARSEKMAP is not set # CONFIG_SPARSE_RCU_POINTER is not set Is that sufficient for sparse memory, or should I try something else? Or maybe, you meant that some kernel source patches might be possible in the sparse memory code? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps
Dear Ben, >>>> PAE is broken for any amount of RAM. >>> No it isn't. >> Could I please ask you to expand on that? > > I already did, a few messages back. OK, thanks. Noting however that fewer than those back, I said: ... PAE with any RAM fails the "sleep test": n=0; while [ $n -lt 33000 ]; do sleep 600 & ((n=n+1)); done and somewhere also said that non-PAE passes. Does not that prove that PAE is broken? Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps
Dear Ben, >> PAE is broken for any amount of RAM. > > No it isn't. Could I please ask you to expand on that? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps
Dear Ben, > Based on your experience I might propose to change the automatic kernel > selection for i386 so that we use 'amd64' on a system with >16GB RAM and > a capable processor. Don't you mean change to amd64 for >4GB (or any RAM), never using PAE? PAE is broken for any amount of RAM. More precisely, PAE with any RAM fails the "sleep test": n=0; while [ $n -lt 33000 ]; do sleep 600 & ((n=n+1)); done and with >32GB fails the "write test": n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; ((n=n+1)); done Why do you think 16GB is significant? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps
Dear Ben, Thanks for the repeated explanations. > PAE was a stop-gap ... > ... [PAE] completely untenable. Is this a good time to withdraw PAE, to tell the world that it does not work? Maybe you should have had such comments in the code. Seems that amd64 now works "somewhat": on Debian the linux-image package is tricky to install, and linux-headers is even harder. Is there work being done to make this smoother? --- I am still not convinced by the "lowmem starvation" explanation: because then PAE should have worked fine on my 3GB machine; maybe I should also try PAE on my 512MB laptop. - Though, what do I know, have not yet found the buggy line of code I believe is lurking there... Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps
Dear Ben, Thanks for the repeated explanations. PAE was a stop-gap ... ... [PAE] completely untenable. Is this a good time to withdraw PAE, to tell the world that it does not work? Maybe you should have had such comments in the code. Seems that amd64 now works somewhat: on Debian the linux-image package is tricky to install, and linux-headers is even harder. Is there work being done to make this smoother? --- I am still not convinced by the lowmem starvation explanation: because then PAE should have worked fine on my 3GB machine; maybe I should also try PAE on my 512MB laptop. - Though, what do I know, have not yet found the buggy line of code I believe is lurking there... Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps
Dear Ben, Based on your experience I might propose to change the automatic kernel selection for i386 so that we use 'amd64' on a system with 16GB RAM and a capable processor. Don't you mean change to amd64 for 4GB (or any RAM), never using PAE? PAE is broken for any amount of RAM. More precisely, PAE with any RAM fails the sleep test: n=0; while [ $n -lt 33000 ]; do sleep 600 ((n=n+1)); done and with 32GB fails the write test: n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; ((n=n+1)); done Why do you think 16GB is significant? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps
Dear Ben, PAE is broken for any amount of RAM. No it isn't. Could I please ask you to expand on that? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps
Dear Ben, PAE is broken for any amount of RAM. No it isn't. Could I please ask you to expand on that? I already did, a few messages back. OK, thanks. Noting however that fewer than those back, I said: ... PAE with any RAM fails the sleep test: n=0; while [ $n -lt 33000 ]; do sleep 600 ((n=n+1)); done and somewhere also said that non-PAE passes. Does not that prove that PAE is broken? Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with just a few sleeps
Dear Pavel and Dave, > The assertion was that 4GB with no PAE passed a forkbomb test (ooming) > while 4GB of RAM with PAE hung, thus _PAE_ is broken. Yes, PAE is broken. Still, maybe the above needs slight correction: non-PAE HIGHMEM4G passed the "sleep test": no OOM, nothing unexpected; whereas PAE OOMed then hung (tested with various RAM from 3GB to 64GB). The feeling I get is that amd64 is proposed as a drop-in replacement for PAE, that support and development of PAE is gone, that PAE is dead. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with just a few sleeps
Dear Pavel and Dave, The assertion was that 4GB with no PAE passed a forkbomb test (ooming) while 4GB of RAM with PAE hung, thus _PAE_ is broken. Yes, PAE is broken. Still, maybe the above needs slight correction: non-PAE HIGHMEM4G passed the sleep test: no OOM, nothing unexpected; whereas PAE OOMed then hung (tested with various RAM from 3GB to 64GB). The feeling I get is that amd64 is proposed as a drop-in replacement for PAE, that support and development of PAE is gone, that PAE is dead. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory
Dear Jonathan, >> If you can identify where it was fixed then your patch for older >> versions should go to stable with a reference to the upstream fix (see >> Documentation/stable_kernel_rules.txt). > > How about this patch? > > It was applied in mainline during the 3.3 merge window, so kernels > newer than 3.2.y shouldn't need it. > > ... > commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d upstream. > ... Yes, I beleive that is the correct patch, surely better than my simple subtraction of min_free_kbytes. Noting, that this does not "solve" all problems, the latest 3.8 kernel still crashes with OOM: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1098961/comments/18 Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory
Dear Jonathan, If you can identify where it was fixed then your patch for older versions should go to stable with a reference to the upstream fix (see Documentation/stable_kernel_rules.txt). How about this patch? It was applied in mainline during the 3.3 merge window, so kernels newer than 3.2.y shouldn't need it. ... commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d upstream. ... Yes, I beleive that is the correct patch, surely better than my simple subtraction of min_free_kbytes. Noting, that this does not solve all problems, the latest 3.8 kernel still crashes with OOM: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1098961/comments/18 Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()
Dear Fengguang (et al), > There are 260MB reclaimable slab pages in the normal zone, however we > somehow failed to reclaim them. ... Could the problem be that without CONFIG_NUMA, zone_reclaim_mode stays at zero and anyway zone_reclaim() does nothing in include/linux/swap.h ? Though... there is no CONFIG_NUMA nor /proc/sys/vm/zone_reclaim_mode in the Ubuntu non-PAE "plain" HIGHMEM4G kernel, and still it handles the "sleep test" just fine. Where does reclaiming happen (or meant to happen)? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory
Dear Ben, > ... the mm maintainers are probably much better placed ... Exactly. Now I wonder: are you one of them? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory
Dear Ben, > If you can identify where it was fixed then ... Sorry I cannot do that. I have no idea where kernel changelogs are kept. I am happy to do some work. Please do not call me lazy. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Subtract min_free_kbytes from dirtyable memory
Dear Minchan, > So what's the effect for user? > ... > It seems you saw old kernel. > ... > Current kernel includes ... > So I think we don't need this patch. As I understand now, my patch is "right" and needed for older kernels; for newer kernels, the issue has been fixed in equivalent ways; it was an oversight that the change was not backported; and any justification you need, you can get from those "later better" patches. I asked: A question: what is the use or significance of vm_highmem_is_dirtyable? It seems odd that it would be used in setting limits or threshholds, but not used in decisions where to put dirty things. Is that so, is that as should be? What is the recommended setting of highmem_is_dirtyable? The silence is deafening. I guess highmem_is_dirtyable is an aberration. Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Subtract min_free_kbytes from dirtyable memory
Dear Minchan, So what's the effect for user? ... It seems you saw old kernel. ... Current kernel includes ... So I think we don't need this patch. As I understand now, my patch is right and needed for older kernels; for newer kernels, the issue has been fixed in equivalent ways; it was an oversight that the change was not backported; and any justification you need, you can get from those later better patches. I asked: A question: what is the use or significance of vm_highmem_is_dirtyable? It seems odd that it would be used in setting limits or threshholds, but not used in decisions where to put dirty things. Is that so, is that as should be? What is the recommended setting of highmem_is_dirtyable? The silence is deafening. I guess highmem_is_dirtyable is an aberration. Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory
Dear Ben, If you can identify where it was fixed then ... Sorry I cannot do that. I have no idea where kernel changelogs are kept. I am happy to do some work. Please do not call me lazy. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory
-- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory
Dear Ben, ... the mm maintainers are probably much better placed ... Exactly. Now I wonder: are you one of them? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()
Dear Fengguang (et al), There are 260MB reclaimable slab pages in the normal zone, however we somehow failed to reclaim them. ... Could the problem be that without CONFIG_NUMA, zone_reclaim_mode stays at zero and anyway zone_reclaim() does nothing in include/linux/swap.h ? Though... there is no CONFIG_NUMA nor /proc/sys/vm/zone_reclaim_mode in the Ubuntu non-PAE plain HIGHMEM4G kernel, and still it handles the sleep test just fine. Where does reclaiming happen (or meant to happen)? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()
Dear Fengguang, > There are 260MB reclaimable slab pages in the normal zone ... Marked "all_unreclaimable? yes": is that wrong? Question asked also in: http://marc.info/?l=linux-mm=135873981326767=2 > ... however we somehow failed to reclaim them. ... I made a patch that would do a drop_caches at that point, please see: http://bugs.debian.org/695182 http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;filename=drop_caches.patch;att=1;bug=695182 http://marc.info/?l=linux-mm=135785511125549=2 and that successfully avoided OOM when writing files. But, the drop_caches patch did not protect against the "sleep test". > ... What's your filesystem and the content of /proc/slabinfo? Filesystem is EXT3. See output of slabinfo in Debian bug above or in http://marc.info/?l=linux-mm=135796154427544=2 Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()
Dear Jan, > I think he found the culprit of the problem being min_free_kbytes was not > properly reflected in the dirty throttling. ... Paul please correct me > if I'm wrong. Sorry but have to correct you. I noticed and patched/corrected two problems, one with (setpoint-dirty) in bdi_position_ratio(), another with min_free_kbytes not subtracted from dirtyable memory. Fixing those problems, singly or in combination, did not help in avoiding OOM: running n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; ((n=$n+1)); done still produces an OOM after a few files written (on a PAE machine with over 32GB RAM). Also, a quite similar OOM may be produced on any PAE machine with n=0; while [ $n -lt 33000 ]; do sleep 600 & ((n=n+1)); done This was tested on machines with as low as just 3GB RAM ... and curiously the same machine with "plain" (not PAE but HIGHMEM4G) kernel handles the same "sleep test" without any problems. (Thus I now think that the remaining bug is not with writeback.) Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()
Dear Fengguang, > Or more simple, you may show us the OOM dmesg which will contain the > number of dirty pages. ... Do you mean kern.log lines like: [ 744.754199] bash invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0 [ 744.754202] bash cpuset=/ mems_allowed=0 [ 744.754204] Pid: 3836, comm: bash Not tainted 3.2.0-4-686-pae #1 Debian 3.2.32-1 ... [ 744.754354] active_anon:13497 inactive_anon:129 isolated_anon:0 [ 744.754354] active_file:2664 inactive_file:4144756 isolated_file:0 [ 744.754355] unevictable:0 dirty:510 writeback:0 unstable:0 [ 744.754356] free:11867217 slab_reclaimable:68289 slab_unreclaimable:7204 [ 744.754356] mapped:8066 shmem:250 pagetables:519 bounce:0 [ 744.754361] DMA free:4260kB min:784kB low:980kB high:1176kB active_anon:0kB inactive_anon:0kB active_file:4kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15784kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11628kB slab_unreclaimable:4kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:499 all_unreclaimable? yes [ 744.754364] lowmem_reserve[]: 0 867 62932 62932 [ 744.754369] Normal free:43788kB min:44112kB low:55140kB high:66168kB active_anon:0kB inactive_anon:0kB active_file:912kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:887976kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:261528kB slab_unreclaimable:28812kB kernel_stack:3096kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:16060 all_unreclaimable? yes [ 744.754372] lowmem_reserve[]: 0 0 496525 496525 [ 744.754377] HighMem free:47420820kB min:512kB low:789888kB high:1579264kB active_anon:53988kB inactive_anon:516kB active_file:9740kB inactive_file:16579320kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:63555300kB mlocked:0kB dirty:2040kB writeback:0kB mapped:32260kB shmem:1000kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:2076kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 744.754380] lowmem_reserve[]: 0 0 0 0 [ 744.754381] DMA: 445*4kB 36*8kB 3*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 4260kB [ 744.754386] Normal: 1132*4kB 620*8kB 237*16kB 70*32kB 38*64kB 26*128kB 20*256kB 14*512kB 4*1024kB 3*2048kB 0*4096kB = 43808kB [ 744.754390] HighMem: 226*4kB 242*8kB 155*16kB 66*32kB 10*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 2*2048kB 11574*4096kB = 47420680kB [ 744.754395] 4148173 total pagecache pages [ 744.754396] 0 pages in swap cache [ 744.754397] Swap cache stats: add 0, delete 0, find 0/0 [ 744.754397] Free swap = 0kB [ 744.754398] Total swap = 0kB [ 744.900649] 16777200 pages RAM [ 744.900650] 16549378 pages HighMem [ 744.900651] 664304 pages reserved [ 744.900652] 4162276 pages shared [ 744.900653] 104263 pages non-shared ? (The above and similar were reported to http://bugs.debian.org/695182 .) Do you want me to log and report something else? I believe the above crash may be provoked simply by running: n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; (( n = $n + 1 )); done & on any PAE machine with over 32GB RAM. Oddly the problem does not seem to occur when using mem=32g or lower on the kernel boot line (or on machines with less than 32GB RAM). Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()
Dear Fengguang, Or more simple, you may show us the OOM dmesg which will contain the number of dirty pages. ... Do you mean kern.log lines like: [ 744.754199] bash invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0 [ 744.754202] bash cpuset=/ mems_allowed=0 [ 744.754204] Pid: 3836, comm: bash Not tainted 3.2.0-4-686-pae #1 Debian 3.2.32-1 ... [ 744.754354] active_anon:13497 inactive_anon:129 isolated_anon:0 [ 744.754354] active_file:2664 inactive_file:4144756 isolated_file:0 [ 744.754355] unevictable:0 dirty:510 writeback:0 unstable:0 [ 744.754356] free:11867217 slab_reclaimable:68289 slab_unreclaimable:7204 [ 744.754356] mapped:8066 shmem:250 pagetables:519 bounce:0 [ 744.754361] DMA free:4260kB min:784kB low:980kB high:1176kB active_anon:0kB inactive_anon:0kB active_file:4kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15784kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11628kB slab_unreclaimable:4kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:499 all_unreclaimable? yes [ 744.754364] lowmem_reserve[]: 0 867 62932 62932 [ 744.754369] Normal free:43788kB min:44112kB low:55140kB high:66168kB active_anon:0kB inactive_anon:0kB active_file:912kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:887976kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:261528kB slab_unreclaimable:28812kB kernel_stack:3096kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:16060 all_unreclaimable? yes [ 744.754372] lowmem_reserve[]: 0 0 496525 496525 [ 744.754377] HighMem free:47420820kB min:512kB low:789888kB high:1579264kB active_anon:53988kB inactive_anon:516kB active_file:9740kB inactive_file:16579320kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:63555300kB mlocked:0kB dirty:2040kB writeback:0kB mapped:32260kB shmem:1000kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:2076kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 744.754380] lowmem_reserve[]: 0 0 0 0 [ 744.754381] DMA: 445*4kB 36*8kB 3*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 4260kB [ 744.754386] Normal: 1132*4kB 620*8kB 237*16kB 70*32kB 38*64kB 26*128kB 20*256kB 14*512kB 4*1024kB 3*2048kB 0*4096kB = 43808kB [ 744.754390] HighMem: 226*4kB 242*8kB 155*16kB 66*32kB 10*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 2*2048kB 11574*4096kB = 47420680kB [ 744.754395] 4148173 total pagecache pages [ 744.754396] 0 pages in swap cache [ 744.754397] Swap cache stats: add 0, delete 0, find 0/0 [ 744.754397] Free swap = 0kB [ 744.754398] Total swap = 0kB [ 744.900649] 16777200 pages RAM [ 744.900650] 16549378 pages HighMem [ 744.900651] 664304 pages reserved [ 744.900652] 4162276 pages shared [ 744.900653] 104263 pages non-shared ? (The above and similar were reported to http://bugs.debian.org/695182 .) Do you want me to log and report something else? I believe the above crash may be provoked simply by running: n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; (( n = $n + 1 )); done on any PAE machine with over 32GB RAM. Oddly the problem does not seem to occur when using mem=32g or lower on the kernel boot line (or on machines with less than 32GB RAM). Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()
Dear Jan, I think he found the culprit of the problem being min_free_kbytes was not properly reflected in the dirty throttling. ... Paul please correct me if I'm wrong. Sorry but have to correct you. I noticed and patched/corrected two problems, one with (setpoint-dirty) in bdi_position_ratio(), another with min_free_kbytes not subtracted from dirtyable memory. Fixing those problems, singly or in combination, did not help in avoiding OOM: running n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; ((n=$n+1)); done still produces an OOM after a few files written (on a PAE machine with over 32GB RAM). Also, a quite similar OOM may be produced on any PAE machine with n=0; while [ $n -lt 33000 ]; do sleep 600 ((n=n+1)); done This was tested on machines with as low as just 3GB RAM ... and curiously the same machine with plain (not PAE but HIGHMEM4G) kernel handles the same sleep test without any problems. (Thus I now think that the remaining bug is not with writeback.) Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()
Dear Fengguang, There are 260MB reclaimable slab pages in the normal zone ... Marked all_unreclaimable? yes: is that wrong? Question asked also in: http://marc.info/?l=linux-mmm=135873981326767w=2 ... however we somehow failed to reclaim them. ... I made a patch that would do a drop_caches at that point, please see: http://bugs.debian.org/695182 http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;filename=drop_caches.patch;att=1;bug=695182 http://marc.info/?l=linux-mmm=135785511125549w=2 and that successfully avoided OOM when writing files. But, the drop_caches patch did not protect against the sleep test. ... What's your filesystem and the content of /proc/slabinfo? Filesystem is EXT3. See output of slabinfo in Debian bug above or in http://marc.info/?l=linux-mmm=135796154427544w=2 Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Subtract min_free_kbytes from dirtyable memory
Dear Minchan, > So what's the effect for user? Sorry I have no idea. The kernel seems to work well without this patch; or in fact not so well, PAE crashing with spurious OOM. In my fruitless efforts of avoiding OOM by sensible choices of sysctl tunables, I noticed that maybe the treatment of min_free_kbytes was not right. Getting this right did not help in avoiding OOM. > It seems you saw old kernel. Yes I have Debian on my machines. :-) > Current kernel includes following logic. > > static unsigned long global_dirtyable_memory(void) > { > unsigned long x; > > x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages(); > x -= min(x, dirty_balance_reserve); > > if (!vm_highmem_is_dirtyable) > x -= highmem_dirtyable_memory(x); > > return x + 1; /* Ensure that we never return 0 */ > } > > And dirty_lanace_reserve already includes high_wmark_pages. > Look at calculate_totalreserve_pages. > > So I think we don't need this patch. > Thanks. Presumably, dirty_balance_reserve takes min_free_kbytes into account? Then I agree that this patch is not needed on those newer kernels. A question: what is the use or significance of vm_highmem_is_dirtyable? It seems odd that it would be used in setting limits or threshholds, but not used in decisions where to put dirty things. Is that so, is that as should be? What is the recommended setting of highmem_is_dirtyable? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Subtract min_free_kbytes from dirtyable memory
Dear Minchan, So what's the effect for user? Sorry I have no idea. The kernel seems to work well without this patch; or in fact not so well, PAE crashing with spurious OOM. In my fruitless efforts of avoiding OOM by sensible choices of sysctl tunables, I noticed that maybe the treatment of min_free_kbytes was not right. Getting this right did not help in avoiding OOM. It seems you saw old kernel. Yes I have Debian on my machines. :-) Current kernel includes following logic. static unsigned long global_dirtyable_memory(void) { unsigned long x; x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages(); x -= min(x, dirty_balance_reserve); if (!vm_highmem_is_dirtyable) x -= highmem_dirtyable_memory(x); return x + 1; /* Ensure that we never return 0 */ } And dirty_lanace_reserve already includes high_wmark_pages. Look at calculate_totalreserve_pages. So I think we don't need this patch. Thanks. Presumably, dirty_balance_reserve takes min_free_kbytes into account? Then I agree that this patch is not needed on those newer kernels. A question: what is the use or significance of vm_highmem_is_dirtyable? It seems odd that it would be used in setting limits or threshholds, but not used in decisions where to put dirty things. Is that so, is that as should be? What is the recommended setting of highmem_is_dirtyable? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Subtract min_free_kbytes from dirtyable memory
When calculating amount of dirtyable memory, min_free_kbytes should be subtracted because it is not intended for dirty pages. Using an "extern int" because that is the only interface to some such sysctl values. (This patch does not solve the PAE OOM issue.) Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia Reported-by: Paul Szabo Reference: http://bugs.debian.org/695182 Signed-off-by: Paul Szabo --- mm/page-writeback.c.old 2012-12-06 22:20:40.0 +1100 +++ mm/page-writeback.c 2013-01-21 13:57:05.0 +1100 @@ -343,12 +343,16 @@ unsigned long determine_dirtyable_memory(void) { unsigned long x; + extern int min_free_kbytes; x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages(); if (!vm_highmem_is_dirtyable) x -= highmem_dirtyable_memory(x); + /* Subtract min_free_kbytes */ + x -= min(x, min_free_kbytes >> (PAGE_SHIFT - 10)); + return x + 1; /* Ensure that we never return 0 */ } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] MAX_PAUSE to be at least 4
Ensure MAX_PAUSE is 4 or larger, so limits in return clamp_val(t, 4, MAX_PAUSE); (the only use of it) are not back-to-front. (This patch does not solve the PAE OOM issue.) Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia Reported-by: Paul Szabo Reference: http://bugs.debian.org/695182 Signed-off-by: Paul Szabo --- mm/page-writeback.c.old 2012-12-06 22:20:40.0 +1100 +++ mm/page-writeback.c 2013-01-21 13:57:05.0 +1100 @@ -39,7 +39,7 @@ /* * Sleep at most 200ms at a time in balance_dirty_pages(). */ -#define MAX_PAUSE max(HZ/5, 1) +#define MAX_PAUSE max(HZ/5, 4) /* * Estimate write bandwidth at 200ms intervals. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] MAX_PAUSE to be at least 4
Ensure MAX_PAUSE is 4 or larger, so limits in return clamp_val(t, 4, MAX_PAUSE); (the only use of it) are not back-to-front. (This patch does not solve the PAE OOM issue.) Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia Reported-by: Paul Szabo p...@maths.usyd.edu.au Reference: http://bugs.debian.org/695182 Signed-off-by: Paul Szabo p...@maths.usyd.edu.au --- mm/page-writeback.c.old 2012-12-06 22:20:40.0 +1100 +++ mm/page-writeback.c 2013-01-21 13:57:05.0 +1100 @@ -39,7 +39,7 @@ /* * Sleep at most 200ms at a time in balance_dirty_pages(). */ -#define MAX_PAUSE max(HZ/5, 1) +#define MAX_PAUSE max(HZ/5, 4) /* * Estimate write bandwidth at 200ms intervals. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Subtract min_free_kbytes from dirtyable memory
When calculating amount of dirtyable memory, min_free_kbytes should be subtracted because it is not intended for dirty pages. Using an extern int because that is the only interface to some such sysctl values. (This patch does not solve the PAE OOM issue.) Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia Reported-by: Paul Szabo p...@maths.usyd.edu.au Reference: http://bugs.debian.org/695182 Signed-off-by: Paul Szabo p...@maths.usyd.edu.au --- mm/page-writeback.c.old 2012-12-06 22:20:40.0 +1100 +++ mm/page-writeback.c 2013-01-21 13:57:05.0 +1100 @@ -343,12 +343,16 @@ unsigned long determine_dirtyable_memory(void) { unsigned long x; + extern int min_free_kbytes; x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages(); if (!vm_highmem_is_dirtyable) x -= highmem_dirtyable_memory(x); + /* Subtract min_free_kbytes */ + x -= min(x, min_free_kbytes (PAGE_SHIFT - 10)); + return x + 1; /* Ensure that we never return 0 */ } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Negative (setpoint-dirty) in bdi_position_ratio()
In bdi_position_ratio(), get difference (setpoint-dirty) right even when negative. Both setpoint and dirty are unsigned long, the difference was zero-padded thus wrongly sign-extended to s64. This issue affects all 32-bit architectures, does not affect 64-bit architectures where long and s64 are equivalent. In this function, dirty is between freerun and limit, the pseudo-float x is between [-1,1], expected to be negative about half the time. With zero-padding, instead of a small negative x we obtained a large positive one so bdi_position_ratio() returned garbage. Casting the difference to s64 also prevents overflow with left-shift; though normally these numbers are small and I never observed a 32-bit overflow there. (This patch does not solve the PAE OOM issue.) Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia Reported-by: Paul Szabo Reference: http://bugs.debian.org/695182 Signed-off-by: Paul Szabo --- mm/page-writeback.c.old 2012-12-06 22:20:40.0 +1100 +++ mm/page-writeback.c 2013-01-20 07:47:55.0 +1100 @@ -559,7 +559,7 @@ static unsigned long bdi_position_ratio( * => fast response on large errors; small oscillation near setpoint */ setpoint = (freerun + limit) / 2; - x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT, + x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT, limit - setpoint + 1); pos_ratio = x; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Negative (setpoint-dirty) in bdi_position_ratio()
In bdi_position_ratio(), get difference (setpoint-dirty) right even when negative. Both setpoint and dirty are unsigned long, the difference was zero-padded thus wrongly sign-extended to s64. This issue affects all 32-bit architectures, does not affect 64-bit architectures where long and s64 are equivalent. In this function, dirty is between freerun and limit, the pseudo-float x is between [-1,1], expected to be negative about half the time. With zero-padding, instead of a small negative x we obtained a large positive one so bdi_position_ratio() returned garbage. Casting the difference to s64 also prevents overflow with left-shift; though normally these numbers are small and I never observed a 32-bit overflow there. (This patch does not solve the PAE OOM issue.) Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia Reported-by: Paul Szabo p...@maths.usyd.edu.au Reference: http://bugs.debian.org/695182 Signed-off-by: Paul Szabo p...@maths.usyd.edu.au --- mm/page-writeback.c.old 2012-12-06 22:20:40.0 +1100 +++ mm/page-writeback.c 2013-01-20 07:47:55.0 +1100 @@ -559,7 +559,7 @@ static unsigned long bdi_position_ratio( * = fast response on large errors; small oscillation near setpoint */ setpoint = (freerun + limit) / 2; - x = div_s64((setpoint - dirty) RATELIMIT_CALC_SHIFT, + x = div_s64(((s64)setpoint - (s64)dirty) RATELIMIT_CALC_SHIFT, limit - setpoint + 1); pos_ratio = x; pos_ratio = pos_ratio * x RATELIMIT_CALC_SHIFT; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with just a few sleeps
Dear Dave, >> On my large machine, 'free' fails to show about 2GB memory ... > You probably have a memory hole. ... > The e820 map (during early boot in dmesg) or /proc/iomem will let you > locate your memory holes. Now that my machine is running an amd64 kernel, 'free' shows total Mem 65854128 (up from 64447796 with PAE kernel), and I do not see much change in /proc/iomem output (below). Is that as should be? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia --- root@zeno:~# uname -a Linux zeno.maths.usyd.edu.au 3.2.35-pk06.12-amd64 #2 SMP Thu Jan 17 13:19:53 EST 2013 x86_64 GNU/Linux root@zeno:~# free total used free sharedbuffers cached Mem: 658541281591704 64262424 0 227036 175620 -/+ buffers/cache:1189048 64665080 Swap:195312636 0 195312636 root@zeno:~# cat /proc/iomem - : reserved 0001-00099bff : System RAM 00099c00-0009 : reserved 000a-000b : PCI Bus :00 000c-000d : PCI Bus :00 000c-000c7fff : Video ROM 000c8000-000cf5ff : Adapter ROM 000cf800-000d07ff : Adapter ROM 000d0800-000d0bff : Adapter ROM 000e-000f : reserved 000f-000f : System ROM 0010-7e445fff : System RAM 0100-0168f8c6 : Kernel code 0168f8c7-018f24bf : Kernel data 0197d000-019dafff : Kernel bss 7e446000-7e565fff : ACPI Non-volatile Storage 7e566000-7f1e2fff : reserved 7f1e3000-7f25efff : ACPI Tables 7f25f000-7f31cfff : reserved 7f31d000-7f323fff : ACPI Non-volatile Storage 7f324000-7f333fff : reserved 7f334000-7f33bfff : ACPI Non-volatile Storage 7f33c000-7f365fff : reserved 7f366000-7f7f : ACPI Non-volatile Storage 7f80-7fff : RAM buffer 8000-dfff : PCI Bus :00 8000-8fff : PCI MMCONFIG [bus 00-ff] 8000-8fff : reserved 9000-900f : :00:16.0 9010-901f : :00:16.1 dd00-ddff : PCI Bus :08 dd00-ddff : :08:03.0 de00-de4f : PCI Bus :07 de00-de3f : :07:00.0 de47c000-de47 : :07:00.0 de60-de6f : PCI Bus :02 df00-df8f : PCI Bus :08 df00-df7f : :08:03.0 df80-df803fff : :08:03.0 df90-df9f : PCI Bus :07 dfa0-dfaf : PCI Bus :02 dfa0-dfa1 : :02:00.1 dfa0-dfa1 : igb dfa2-dfa3 : :02:00.0 dfa2-dfa3 : igb dfa4-dfa43fff : :02:00.1 dfa4-dfa43fff : igb dfa44000-dfa47fff : :02:00.0 dfa44000-dfa47fff : igb dfb0-dfb03fff : :00:04.7 dfb04000-dfb07fff : :00:04.6 dfb08000-dfb0bfff : :00:04.5 dfb0c000-dfb0 : :00:04.4 dfb1-dfb13fff : :00:04.3 dfb14000-dfb17fff : :00:04.2 dfb18000-dfb1bfff : :00:04.1 dfb1c000-dfb1 : :00:04.0 dfb2-dfb200ff : :00:1f.3 dfb21000-dfb217ff : :00:1f.2 dfb21000-dfb217ff : ahci dfb22000-dfb223ff : :00:1d.0 dfb22000-dfb223ff : ehci_hcd dfb23000-dfb233ff : :00:1a.0 dfb23000-dfb233ff : ehci_hcd dfb25000-dfb25fff : :00:05.4 dfffc000-dfffdfff : pnp 00:02 e000-fbff : PCI Bus :80 fbe0-fbef : PCI Bus :84 fbe0-fbe3 : :84:00.0 fbe4-fbe5 : :84:00.0 fbe6-fbe63fff : :84:00.0 fbf0-fbf03fff : :80:04.7 fbf04000-fbf07fff : :80:04.6 fbf08000-fbf0bfff : :80:04.5 fbf0c000-fbf0 : :80:04.4 fbf1-fbf13fff : :80:04.3 fbf14000-fbf17fff : :80:04.2 fbf18000-fbf1bfff : :80:04.1 fbf1c000-fbf1 : :80:04.0 fbf2-fbf20fff : :80:05.4 fbffe000-fbff : pnp 00:12 fc00-fcff : pnp 00:01 fd00-fdff : pnp 00:01 fe00-feaf : pnp 00:01 feb0-febf : pnp 00:01 fec0-fec003ff : IOAPIC 0 fec01000-fec013ff : IOAPIC 1 fec4-fec403ff : IOAPIC 2 fed0-fed003ff : HPET 0 fed08000-fed08fff : pnp 00:0c fed1c000-fed3 : reserved fed1c000-fed1 : pnp 00:0c fed45000-fedf : pnp 00:01 fee0-fee00fff : Local APIC ff00- : reserved ff00- : pnp 00:0c 1-107fff : System RAM root@zeno:~# --- For comparison, output obtained (and reported previously) when machine was running PAE kernel: root@zeno:~# cat /proc/iomem - : reserved 0001-00099bff : System RAM 00099c00-0009 : reserved 000a-000b : PCI Bus :00 000a-000b : Video RAM area 000c-000d : PCI Bus :00 000c-000c7fff : Video ROM 000c8000-000cf5ff : Adapter ROM 000cf800-000d07ff : Adapter ROM 000d0800-000d0bff : Adapter ROM 000e-000f : reserved 000f-000f : System ROM 0010-7e445fff : System RAM 0100-01610e15 : Kernel code 01610e16-01802dff : Kernel data 0188-018b2fff : Kernel bss 7e
Re: [RFC] Reproducible OOM with just a few sleeps
Dear Dave, On my large machine, 'free' fails to show about 2GB memory ... You probably have a memory hole. ... The e820 map (during early boot in dmesg) or /proc/iomem will let you locate your memory holes. Now that my machine is running an amd64 kernel, 'free' shows total Mem 65854128 (up from 64447796 with PAE kernel), and I do not see much change in /proc/iomem output (below). Is that as should be? Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia --- root@zeno:~# uname -a Linux zeno.maths.usyd.edu.au 3.2.35-pk06.12-amd64 #2 SMP Thu Jan 17 13:19:53 EST 2013 x86_64 GNU/Linux root@zeno:~# free total used free sharedbuffers cached Mem: 658541281591704 64262424 0 227036 175620 -/+ buffers/cache:1189048 64665080 Swap:195312636 0 195312636 root@zeno:~# cat /proc/iomem - : reserved 0001-00099bff : System RAM 00099c00-0009 : reserved 000a-000b : PCI Bus :00 000c-000d : PCI Bus :00 000c-000c7fff : Video ROM 000c8000-000cf5ff : Adapter ROM 000cf800-000d07ff : Adapter ROM 000d0800-000d0bff : Adapter ROM 000e-000f : reserved 000f-000f : System ROM 0010-7e445fff : System RAM 0100-0168f8c6 : Kernel code 0168f8c7-018f24bf : Kernel data 0197d000-019dafff : Kernel bss 7e446000-7e565fff : ACPI Non-volatile Storage 7e566000-7f1e2fff : reserved 7f1e3000-7f25efff : ACPI Tables 7f25f000-7f31cfff : reserved 7f31d000-7f323fff : ACPI Non-volatile Storage 7f324000-7f333fff : reserved 7f334000-7f33bfff : ACPI Non-volatile Storage 7f33c000-7f365fff : reserved 7f366000-7f7f : ACPI Non-volatile Storage 7f80-7fff : RAM buffer 8000-dfff : PCI Bus :00 8000-8fff : PCI MMCONFIG [bus 00-ff] 8000-8fff : reserved 9000-900f : :00:16.0 9010-901f : :00:16.1 dd00-ddff : PCI Bus :08 dd00-ddff : :08:03.0 de00-de4f : PCI Bus :07 de00-de3f : :07:00.0 de47c000-de47 : :07:00.0 de60-de6f : PCI Bus :02 df00-df8f : PCI Bus :08 df00-df7f : :08:03.0 df80-df803fff : :08:03.0 df90-df9f : PCI Bus :07 dfa0-dfaf : PCI Bus :02 dfa0-dfa1 : :02:00.1 dfa0-dfa1 : igb dfa2-dfa3 : :02:00.0 dfa2-dfa3 : igb dfa4-dfa43fff : :02:00.1 dfa4-dfa43fff : igb dfa44000-dfa47fff : :02:00.0 dfa44000-dfa47fff : igb dfb0-dfb03fff : :00:04.7 dfb04000-dfb07fff : :00:04.6 dfb08000-dfb0bfff : :00:04.5 dfb0c000-dfb0 : :00:04.4 dfb1-dfb13fff : :00:04.3 dfb14000-dfb17fff : :00:04.2 dfb18000-dfb1bfff : :00:04.1 dfb1c000-dfb1 : :00:04.0 dfb2-dfb200ff : :00:1f.3 dfb21000-dfb217ff : :00:1f.2 dfb21000-dfb217ff : ahci dfb22000-dfb223ff : :00:1d.0 dfb22000-dfb223ff : ehci_hcd dfb23000-dfb233ff : :00:1a.0 dfb23000-dfb233ff : ehci_hcd dfb25000-dfb25fff : :00:05.4 dfffc000-dfffdfff : pnp 00:02 e000-fbff : PCI Bus :80 fbe0-fbef : PCI Bus :84 fbe0-fbe3 : :84:00.0 fbe4-fbe5 : :84:00.0 fbe6-fbe63fff : :84:00.0 fbf0-fbf03fff : :80:04.7 fbf04000-fbf07fff : :80:04.6 fbf08000-fbf0bfff : :80:04.5 fbf0c000-fbf0 : :80:04.4 fbf1-fbf13fff : :80:04.3 fbf14000-fbf17fff : :80:04.2 fbf18000-fbf1bfff : :80:04.1 fbf1c000-fbf1 : :80:04.0 fbf2-fbf20fff : :80:05.4 fbffe000-fbff : pnp 00:12 fc00-fcff : pnp 00:01 fd00-fdff : pnp 00:01 fe00-feaf : pnp 00:01 feb0-febf : pnp 00:01 fec0-fec003ff : IOAPIC 0 fec01000-fec013ff : IOAPIC 1 fec4-fec403ff : IOAPIC 2 fed0-fed003ff : HPET 0 fed08000-fed08fff : pnp 00:0c fed1c000-fed3 : reserved fed1c000-fed1 : pnp 00:0c fed45000-fedf : pnp 00:01 fee0-fee00fff : Local APIC ff00- : reserved ff00- : pnp 00:0c 1-107fff : System RAM root@zeno:~# --- For comparison, output obtained (and reported previously) when machine was running PAE kernel: root@zeno:~# cat /proc/iomem - : reserved 0001-00099bff : System RAM 00099c00-0009 : reserved 000a-000b : PCI Bus :00 000a-000b : Video RAM area 000c-000d : PCI Bus :00 000c-000c7fff : Video ROM 000c8000-000cf5ff : Adapter ROM 000cf800-000d07ff : Adapter ROM 000d0800-000d0bff : Adapter ROM 000e-000f : reserved 000f-000f : System ROM 0010-7e445fff : System RAM 0100-01610e15 : Kernel code 01610e16-01802dff : Kernel data 0188-018b2fff : Kernel bss 7e446000-7e565fff : ACPI
Re: [RFC] Reproducible OOM with just a few sleeps
Dear Dave, >> ... What is unacceptable is that PAE crashes or freezes with OOM: >> it should gracefully handle the issue. Noting that (for a machine >> with 4GB or under) PAE fails where the HIGHMEM4G kernel succeeds ... > > You have found a delta, but you're not really making apples-to-apples > comparisons. The page tables ... I understand that the exact sizes of page tables are very important to developers. To the rest of us, all that matters is that the kernel moves them to highmem or swap or whatever, that it maybe emits some error message but that it does not crash or freeze. > There's probably a bug here. But, it's incredibly unlikely to be seen > in practice on anything resembling a modern system. ... Probably, I found the bug on a very modern and brand-new system, just trying to copy a few ISO image files and trying to log in a hundred students. My machine crashed under those very practical and normal circumstances. The demos with dd and sleep were just that: easily reproducible demos. > ... easily worked around by upgrading to a 64-bit kernel ... Do you mean that PAE should never be used, but to use amd64 instead? > ... Raising the vm.min_free_kbytes sysctl (to perhaps 10x of > its current value on your system) is likely to help the hangs too, > although it will further "consume" lowmem. I have tried that, it did not work. As you say, it is backward. > ... for a bug with ... so many reasonable workarounds ... Only one workaround was proposed: use amd64. PAE is buggy and useless, should be deprecated and removed. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with just a few sleeps
Dear Dave, >> Seems that any i386 PAE machine will go OOM just by running a few >> processes. To reproduce: >> sh -c 'n=0; while [ $n -lt 1 ]; do sleep 600 & ((n=n+1)); done' >> ... > I think what you're seeing here is that, as the amount of total memory > increases, the amount of lowmem available _decreases_ due to inflation > of mem_map[] (and a few other more minor things). The number of sleeps > you can do is bound by the number of processes, as you noticed from > ulimit. Creating processes that don't use much memory eats a relatively > large amount of low memory. > This is a sad (and counterintuitive) fact: more RAM actually *CREATES* > RAM bottlenecks on 32-bit systems. I understand that more RAM leaves less lowmem. What is unacceptable is that PAE crashes or freezes with OOM: it should gracefully handle the issue. Noting that (for a machine with 4GB or under) PAE fails where the HIGHMEM4G kernel succeeds and survives. >> On my large machine, 'free' fails to show about 2GB memory ... > You probably have a memory hole. ... > The e820 map (during early boot in dmesg) or /proc/iomem will let you > locate your memory holes. Thanks, that might explain it. Output of /proc/iomem below: sorry I do not know how to interpret it. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia --- root@zeno:~# cat /proc/iomem - : reserved 0001-00099bff : System RAM 00099c00-0009 : reserved 000a-000b : PCI Bus :00 000a-000b : Video RAM area 000c-000d : PCI Bus :00 000c-000c7fff : Video ROM 000c8000-000cf5ff : Adapter ROM 000cf800-000d07ff : Adapter ROM 000d0800-000d0bff : Adapter ROM 000e-000f : reserved 000f-000f : System ROM 0010-7e445fff : System RAM 0100-01610e15 : Kernel code 01610e16-01802dff : Kernel data 0188-018b2fff : Kernel bss 7e446000-7e565fff : ACPI Non-volatile Storage 7e566000-7f1e2fff : reserved 7f1e3000-7f25efff : ACPI Tables 7f25f000-7f31cfff : reserved 7f31d000-7f323fff : ACPI Non-volatile Storage 7f324000-7f333fff : reserved 7f334000-7f33bfff : ACPI Non-volatile Storage 7f33c000-7f365fff : reserved 7f366000-7f7f : ACPI Non-volatile Storage 7f80-7fff : RAM buffer 8000-dfff : PCI Bus :00 8000-8fff : PCI MMCONFIG [bus 00-ff] 8000-8fff : reserved 9000-900f : :00:16.0 9010-901f : :00:16.1 dd00-ddff : PCI Bus :08 dd00-ddff : :08:03.0 de00-de4f : PCI Bus :07 de00-de3f : :07:00.0 de47c000-de47 : :07:00.0 de60-de6f : PCI Bus :02 df00-df8f : PCI Bus :08 df00-df7f : :08:03.0 df80-df803fff : :08:03.0 df90-df9f : PCI Bus :07 dfa0-dfaf : PCI Bus :02 dfa0-dfa1 : :02:00.1 dfa0-dfa1 : igb dfa2-dfa3 : :02:00.0 dfa2-dfa3 : igb dfa4-dfa43fff : :02:00.1 dfa4-dfa43fff : igb dfa44000-dfa47fff : :02:00.0 dfa44000-dfa47fff : igb dfb0-dfb03fff : :00:04.7 dfb04000-dfb07fff : :00:04.6 dfb08000-dfb0bfff : :00:04.5 dfb0c000-dfb0 : :00:04.4 dfb1-dfb13fff : :00:04.3 dfb14000-dfb17fff : :00:04.2 dfb18000-dfb1bfff : :00:04.1 dfb1c000-dfb1 : :00:04.0 dfb2-dfb200ff : :00:1f.3 dfb21000-dfb217ff : :00:1f.2 dfb21000-dfb217ff : ahci dfb22000-dfb223ff : :00:1d.0 dfb22000-dfb223ff : ehci_hcd dfb23000-dfb233ff : :00:1a.0 dfb23000-dfb233ff : ehci_hcd dfb25000-dfb25fff : :00:05.4 dfffc000-dfffdfff : pnp 00:02 e000-fbff : PCI Bus :80 fbe0-fbef : PCI Bus :84 fbe0-fbe3 : :84:00.0 fbe4-fbe5 : :84:00.0 fbe6-fbe63fff : :84:00.0 fbf0-fbf03fff : :80:04.7 fbf04000-fbf07fff : :80:04.6 fbf08000-fbf0bfff : :80:04.5 fbf0c000-fbf0 : :80:04.4 fbf1-fbf13fff : :80:04.3 fbf14000-fbf17fff : :80:04.2 fbf18000-fbf1bfff : :80:04.1 fbf1c000-fbf1 : :80:04.0 fbf2-fbf20fff : :80:05.4 fbffe000-fbff : pnp 00:12 fc00-fcff : pnp 00:01 fd00-fdff : pnp 00:01 fe00-feaf : pnp 00:01 feb0-febf : pnp 00:01 fec0-fec003ff : IOAPIC 0 fec01000-fec013ff : IOAPIC 1 fec4-fec403ff : IOAPIC 2 fed0-fed003ff : HPET 0 fed08000-fed08fff : pnp 00:0c fed1c000-fed3 : reserved fed1c000-fed1 : pnp 00:0c fed45000-fedf : pnp 00:01 fee0-fee00fff : Local APIC ff00- : reserved ff00- : pnp 00:0c 1-107fff : System RAM root@zeno:~# -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in th
Re: [RFC] Reproducible OOM with just a few sleeps
Dear Dave, Seems that any i386 PAE machine will go OOM just by running a few processes. To reproduce: sh -c 'n=0; while [ $n -lt 1 ]; do sleep 600 ((n=n+1)); done' ... I think what you're seeing here is that, as the amount of total memory increases, the amount of lowmem available _decreases_ due to inflation of mem_map[] (and a few other more minor things). The number of sleeps you can do is bound by the number of processes, as you noticed from ulimit. Creating processes that don't use much memory eats a relatively large amount of low memory. This is a sad (and counterintuitive) fact: more RAM actually *CREATES* RAM bottlenecks on 32-bit systems. I understand that more RAM leaves less lowmem. What is unacceptable is that PAE crashes or freezes with OOM: it should gracefully handle the issue. Noting that (for a machine with 4GB or under) PAE fails where the HIGHMEM4G kernel succeeds and survives. On my large machine, 'free' fails to show about 2GB memory ... You probably have a memory hole. ... The e820 map (during early boot in dmesg) or /proc/iomem will let you locate your memory holes. Thanks, that might explain it. Output of /proc/iomem below: sorry I do not know how to interpret it. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia --- root@zeno:~# cat /proc/iomem - : reserved 0001-00099bff : System RAM 00099c00-0009 : reserved 000a-000b : PCI Bus :00 000a-000b : Video RAM area 000c-000d : PCI Bus :00 000c-000c7fff : Video ROM 000c8000-000cf5ff : Adapter ROM 000cf800-000d07ff : Adapter ROM 000d0800-000d0bff : Adapter ROM 000e-000f : reserved 000f-000f : System ROM 0010-7e445fff : System RAM 0100-01610e15 : Kernel code 01610e16-01802dff : Kernel data 0188-018b2fff : Kernel bss 7e446000-7e565fff : ACPI Non-volatile Storage 7e566000-7f1e2fff : reserved 7f1e3000-7f25efff : ACPI Tables 7f25f000-7f31cfff : reserved 7f31d000-7f323fff : ACPI Non-volatile Storage 7f324000-7f333fff : reserved 7f334000-7f33bfff : ACPI Non-volatile Storage 7f33c000-7f365fff : reserved 7f366000-7f7f : ACPI Non-volatile Storage 7f80-7fff : RAM buffer 8000-dfff : PCI Bus :00 8000-8fff : PCI MMCONFIG [bus 00-ff] 8000-8fff : reserved 9000-900f : :00:16.0 9010-901f : :00:16.1 dd00-ddff : PCI Bus :08 dd00-ddff : :08:03.0 de00-de4f : PCI Bus :07 de00-de3f : :07:00.0 de47c000-de47 : :07:00.0 de60-de6f : PCI Bus :02 df00-df8f : PCI Bus :08 df00-df7f : :08:03.0 df80-df803fff : :08:03.0 df90-df9f : PCI Bus :07 dfa0-dfaf : PCI Bus :02 dfa0-dfa1 : :02:00.1 dfa0-dfa1 : igb dfa2-dfa3 : :02:00.0 dfa2-dfa3 : igb dfa4-dfa43fff : :02:00.1 dfa4-dfa43fff : igb dfa44000-dfa47fff : :02:00.0 dfa44000-dfa47fff : igb dfb0-dfb03fff : :00:04.7 dfb04000-dfb07fff : :00:04.6 dfb08000-dfb0bfff : :00:04.5 dfb0c000-dfb0 : :00:04.4 dfb1-dfb13fff : :00:04.3 dfb14000-dfb17fff : :00:04.2 dfb18000-dfb1bfff : :00:04.1 dfb1c000-dfb1 : :00:04.0 dfb2-dfb200ff : :00:1f.3 dfb21000-dfb217ff : :00:1f.2 dfb21000-dfb217ff : ahci dfb22000-dfb223ff : :00:1d.0 dfb22000-dfb223ff : ehci_hcd dfb23000-dfb233ff : :00:1a.0 dfb23000-dfb233ff : ehci_hcd dfb25000-dfb25fff : :00:05.4 dfffc000-dfffdfff : pnp 00:02 e000-fbff : PCI Bus :80 fbe0-fbef : PCI Bus :84 fbe0-fbe3 : :84:00.0 fbe4-fbe5 : :84:00.0 fbe6-fbe63fff : :84:00.0 fbf0-fbf03fff : :80:04.7 fbf04000-fbf07fff : :80:04.6 fbf08000-fbf0bfff : :80:04.5 fbf0c000-fbf0 : :80:04.4 fbf1-fbf13fff : :80:04.3 fbf14000-fbf17fff : :80:04.2 fbf18000-fbf1bfff : :80:04.1 fbf1c000-fbf1 : :80:04.0 fbf2-fbf20fff : :80:05.4 fbffe000-fbff : pnp 00:12 fc00-fcff : pnp 00:01 fd00-fdff : pnp 00:01 fe00-feaf : pnp 00:01 feb0-febf : pnp 00:01 fec0-fec003ff : IOAPIC 0 fec01000-fec013ff : IOAPIC 1 fec4-fec403ff : IOAPIC 2 fed0-fed003ff : HPET 0 fed08000-fed08fff : pnp 00:0c fed1c000-fed3 : reserved fed1c000-fed1 : pnp 00:0c fed45000-fedf : pnp 00:01 fee0-fee00fff : Local APIC ff00- : reserved ff00- : pnp 00:0c 1-107fff : System RAM root@zeno:~# -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Reproducible OOM with just a few sleeps
Dear Dave, ... What is unacceptable is that PAE crashes or freezes with OOM: it should gracefully handle the issue. Noting that (for a machine with 4GB or under) PAE fails where the HIGHMEM4G kernel succeeds ... You have found a delta, but you're not really making apples-to-apples comparisons. The page tables ... I understand that the exact sizes of page tables are very important to developers. To the rest of us, all that matters is that the kernel moves them to highmem or swap or whatever, that it maybe emits some error message but that it does not crash or freeze. There's probably a bug here. But, it's incredibly unlikely to be seen in practice on anything resembling a modern system. ... Probably, I found the bug on a very modern and brand-new system, just trying to copy a few ISO image files and trying to log in a hundred students. My machine crashed under those very practical and normal circumstances. The demos with dd and sleep were just that: easily reproducible demos. ... easily worked around by upgrading to a 64-bit kernel ... Do you mean that PAE should never be used, but to use amd64 instead? ... Raising the vm.min_free_kbytes sysctl (to perhaps 10x of its current value on your system) is likely to help the hangs too, although it will further consume lowmem. I have tried that, it did not work. As you say, it is backward. ... for a bug with ... so many reasonable workarounds ... Only one workaround was proposed: use amd64. PAE is buggy and useless, should be deprecated and removed. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with just a few sleeps
The issue is a regression with PAE, reproduced and verified on Ubuntu, on my home PC with 3GB RAM. My PC was running kernel linux-image-3.2.0-35-generic so it showed: psz@DellE520:~$ uname -a Linux DellE520 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5 17:45:18 UTC 2012 i686 i686 i386 GNU/Linux psz@DellE520:~$ free -l total used free sharedbuffers cached Mem: 3087972 6922562395716 0 18276 427116 Low:861464 71372 790092 High: 2226508 6208841605624 -/+ buffers/cache: 2468642841108 Swap: 2920 258364 19742556 Then it handled the "sleep test" bash -c 'n=0; while [ $n -lt 33000 ]; do sleep 600 & ((n=n+1)); ((m=n%500)); if [ $m -lt 1 ]; then echo -n "$n - "; date; free -l; sleep 1; fi; done' just fine, stopped only by "max user processes" (default setting of "ulimit -u 23964"), or raising that limit stopped when the machine ran out of PID space; there was no OOM. Installing and running the PAE kernel so it showed: psz@DellE520:~$ uname -a Linux DellE520 3.2.0-35-generic-pae #55-Ubuntu SMP Wed Dec 5 18:04:39 UTC 2012 i686 i686 i386 GNU/Linux psz@DellE520:~$ free -l total used free sharedbuffers cached Mem: 3087620 6811882406432 0 167332 352296 Low:865208 214080 651128 High: 412 4671081755304 -/+ buffers/cache: 1615602926060 Swap: 2920 0 2920 and re-trying the "sleep test", it ran into OOM after 18000 or so sleeps and crashed/froze so I had to press the POWER button to recover. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with just a few sleeps
The issue is a regression with PAE, reproduced and verified on Ubuntu, on my home PC with 3GB RAM. My PC was running kernel linux-image-3.2.0-35-generic so it showed: psz@DellE520:~$ uname -a Linux DellE520 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5 17:45:18 UTC 2012 i686 i686 i386 GNU/Linux psz@DellE520:~$ free -l total used free sharedbuffers cached Mem: 3087972 6922562395716 0 18276 427116 Low:861464 71372 790092 High: 2226508 6208841605624 -/+ buffers/cache: 2468642841108 Swap: 2920 258364 19742556 Then it handled the sleep test bash -c 'n=0; while [ $n -lt 33000 ]; do sleep 600 ((n=n+1)); ((m=n%500)); if [ $m -lt 1 ]; then echo -n $n - ; date; free -l; sleep 1; fi; done' just fine, stopped only by max user processes (default setting of ulimit -u 23964), or raising that limit stopped when the machine ran out of PID space; there was no OOM. Installing and running the PAE kernel so it showed: psz@DellE520:~$ uname -a Linux DellE520 3.2.0-35-generic-pae #55-Ubuntu SMP Wed Dec 5 18:04:39 UTC 2012 i686 i686 i386 GNU/Linux psz@DellE520:~$ free -l total used free sharedbuffers cached Mem: 3087620 6811882406432 0 167332 352296 Low:865208 214080 651128 High: 412 4671081755304 -/+ buffers/cache: 1615602926060 Swap: 2920 0 2920 and re-trying the sleep test, it ran into OOM after 18000 or so sleeps and crashed/froze so I had to press the POWER button to recover. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Reproducible OOM with just a few sleeps
Dear Linux-MM, Seems that any i386 PAE machine will go OOM just by running a few processes. To reproduce: sh -c 'n=0; while [ $n -lt 1 ]; do sleep 600 & ((n=n+1)); done' My machine has 64GB RAM. With previous OOM episodes, it seemed that running (booting) it with mem=32G might avoid OOM; but an OOM was obtained just the same, and also with lower memory: Memorysleeps to OOM free shows total (mem=64G) 5300 64447796 mem=32G 10200 31155512 mem=16G 13400 14509364 mem=8G14200 6186296 mem=6G15200 4105532 mem=4G16400 2041364 The machine does not run out of highmem, nor does it use any swap. Comparing with my desktop PC: has 4GB RAM installed, free shows 3978592 total. Running the "sleep test", it simply froze after 16400 running... no response to ping, will need to press the RESET button. --- On my large machine, 'free' fails to show about 2GB memory, e.g. with mem=16G it shows: root@zeno:~# free -l total used free sharedbuffers cached Mem: 14509364 435440 14073924 0 4068 111328 Low:769044 120232 648812 High: 13740320 315208 13425112 -/+ buffers/cache: 320044 14189320 Swap:134217724 0 134217724 --- Please let me know of any ideas, or if you want me to run some other test or want to see some other output. Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia - Details for when my machine was running with 64GB RAM: In another window I was running cat /proc/slabinfo; free -l repeatedly, and output of that (just before OOM) was: + cat /proc/slabinfo slabinfo - version: 2.1 # name : tunables: slabdata fuse_request 0 0376 434 : tunables000 : slabdata 0 0 0 fuse_inode 0 0448 364 : tunables000 : slabdata 0 0 0 bsg_cmd0 0288 282 : tunables000 : slabdata 0 0 0 ntfs_big_inode_cache 0 0512 324 : tunables000 : slabdata 0 0 0 ntfs_inode_cache 0 0176 462 : tunables000 : slabdata 0 0 0 nfs_direct_cache 0 0 80 511 : tunables000 : slabdata 0 0 0 nfs_inode_cache 28 28584 284 : tunables000 : slabdata 1 1 0 isofs_inode_cache 0 0360 454 : tunables000 : slabdata 0 0 0 fat_inode_cache0 0408 404 : tunables000 : slabdata 0 0 0 fat_cache 0 0 24 1701 : tunables000 : slabdata 0 0 0 jbd2_revoke_record 0 0 32 1281 : tunables000 : slabdata 0 0 0 journal_handle 4080 4080 24 1701 : tunables000 : slabdata 24 24 0 journal_head1024 1024 64 641 : tunables000 : slabdata 16 16 0 revoke_record768768 16 2561 : tunables000 : slabdata 3 3 0 ext4_inode_cache 0 0584 284 : tunables000 : slabdata 0 0 0 ext4_free_data 0 0 40 1021 : tunables000 : slabdata 0 0 0 ext4_allocation_context 0 0112 361 : tunables00 0 : slabdata 0 0 0 ext4_prealloc_space 0 0 72 561 : tunables000 : slabdata 0 0 0 ext4_io_end0 0576 284 : tunables000 : slabdata 0 0 0 ext4_io_page 0 0 8 5121 : tunables000 : slabdata 0 0 0 ext2_inode_cache 0 0480 344 : tunables000 : slabdata 0 0 0 ext3_inode_cache1467 2079488 334 : tunables000 : slabdata 63 63 0 ext3_xattr 0 0 48 851 : tunables000 : slabdata 0 0 0 dquot168168192 422 : tunables000 : slabdata 4 4 0 rpc_inode_cache 108108448 364 : tunables000 : slabdata 3 3 0 UDP-Lite 0 0576 284 : tunables000 : slabdata 0 0 0 xfrm_dst_cache 0 0320 514 : tunables000 : slabdata 0 0 0 UDP 336336576 284 : tunables000 : slabdata 12 12 0 tw_sock_TCP 32 32
Re: [RFC] Reproducible OOM with partial workaround
Dear Andrew, >>> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads. >> Please see below ... > ... Was this dump taken when the system was at or near oom? No, that was a "quiescent" machine. Please see a just-before-OOM dump in my next message (in a little while). > Please send a copy of the oom-killer kernel message dump, if you still > have one. Please see one in next message, or in http://bugs.debian.org/695182 >> I tried setting dirty_ratio to "funny" values, that did not seem to >> help. > Did you try setting it as low as possible? Probably. Maybe. Sorry, cannot say with certainty. >> Did you notice my patch about bdi_position_ratio(), how it was >> plain wrong half the time (for negative x)? > Nope, please resend. Quoting from http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182 : ... - In bdi_position_ratio() get difference (setpoint-dirty) right even when it is negative, which happens often. Normally these numbers are "small" and even with left-shift I never observed a 32-bit overflow. I believe it should be possible to re-write the whole function in 32-bit ints; maybe it is not worth the effort to make it "efficient"; seeing how this function was always wrong and we survived, it should simply be removed. ... --- mm/page-writeback.c.old 2012-10-17 13:50:15.0 +1100 +++ mm/page-writeback.c 2013-01-06 21:54:59.0 +1100 [ Line numbers out because other patches not shown ] ... @@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio( * => fast response on large errors; small oscillation near setpoint */ setpoint = (freerun + limit) / 2; - x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT, + x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT, limit - setpoint + 1); pos_ratio = x; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; ... Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with partial workaround
Dear Andrew, > Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads. Please see below: I do not know what any of that means. This machine has been running just fine, with all my users logging in here via XDMCP from X-terminals, dozens logged in simultaneously. (But, I think I could make it go OOM with more processes or logins.) > If so, you *may* be able to work around this by setting > /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum > amount of dirty pagecache around. Then, with luck, if we haven't > broken the buffer_heads_over_limit logic it in the past decade (we > probably have), the VM should be able to reclaim those buffer_heads. I tried setting dirty_ratio to "funny" values, that did not seem to help. Did you notice my patch about bdi_position_ratio(), how it was plain wrong half the time (for negative x)? Anyway that did not help. > Alternatively, use a filesystem which doesn't attach buffer_heads to > dirty pages. xfs or btrfs, perhaps. Seems there is also a problem not related to filesystem... or rather, the essence does not seem to be filesystem or caches. The filesystem thing now seems OK with my patch doing drop_caches. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia --- root@como:~# free -lm total used free sharedbuffers cached Mem: 62936 2317 60618 0 41635 Low: 367271 95 High:62569 2045 60523 -/+ buffers/cache: 1640 61295 Swap: 131071 0 131071 root@como:~# cat /proc/slabinfo slabinfo - version: 2.1 # name : tunables: slabdata fuse_request 0 0376 434 : tunables000 : slabdata 0 0 0 fuse_inode 0 0448 364 : tunables000 : slabdata 0 0 0 bsg_cmd0 0288 282 : tunables000 : slabdata 0 0 0 ntfs_big_inode_cache 0 0512 324 : tunables000 : slabdata 0 0 0 ntfs_inode_cache 0 0176 462 : tunables000 : slabdata 0 0 0 nfs_direct_cache 0 0 80 511 : tunables000 : slabdata 0 0 0 nfs_inode_cache 5404 5404584 284 : tunables000 : slabdata193193 0 isofs_inode_cache 0 0360 454 : tunables000 : slabdata 0 0 0 fat_inode_cache0 0408 404 : tunables000 : slabdata 0 0 0 fat_cache 0 0 24 1701 : tunables000 : slabdata 0 0 0 jbd2_revoke_record 0 0 32 1281 : tunables000 : slabdata 0 0 0 journal_handle 5440 5440 24 1701 : tunables000 : slabdata 32 32 0 journal_head 16768 16768 64 641 : tunables000 : slabdata262262 0 revoke_record 20224 20224 16 2561 : tunables000 : slabdata 79 79 0 ext4_inode_cache 0 0584 284 : tunables000 : slabdata 0 0 0 ext4_free_data 0 0 40 1021 : tunables000 : slabdata 0 0 0 ext4_allocation_context 0 0112 361 : tunables00 0 : slabdata 0 0 0 ext4_prealloc_space 0 0 72 561 : tunables000 : slabdata 0 0 0 ext4_io_end0 0576 284 : tunables000 : slabdata 0 0 0 ext4_io_page 0 0 8 5121 : tunables000 : slabdata 0 0 0 ext2_inode_cache 0 0480 344 : tunables000 : slabdata 0 0 0 ext3_inode_cache 16531 19965488 334 : tunables000 : slabdata605605 0 ext3_xattr 0 0 48 851 : tunables000 : slabdata 0 0 0 dquot840840192 422 : tunables000 : slabdata 20 20 0 rpc_inode_cache 144144448 364 : tunables000 : slabdata 4 4 0 UDP-Lite 0 0576 284 : tunables000 : slabdata 0 0 0 xfrm_dst_cache 0 0320 514 : tunables000 : slabdata 0 0 0 UDP 896896576 284 : tunables000 : slabdata 32 32 0 tw_sock_TCP 1344 1344128 321 : tunables000 : slabdata 42 42 0 TCP
Re: [RFC] Reproducible OOM with partial workaround
Dear Andrew, Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads. Please see below: I do not know what any of that means. This machine has been running just fine, with all my users logging in here via XDMCP from X-terminals, dozens logged in simultaneously. (But, I think I could make it go OOM with more processes or logins.) If so, you *may* be able to work around this by setting /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum amount of dirty pagecache around. Then, with luck, if we haven't broken the buffer_heads_over_limit logic it in the past decade (we probably have), the VM should be able to reclaim those buffer_heads. I tried setting dirty_ratio to funny values, that did not seem to help. Did you notice my patch about bdi_position_ratio(), how it was plain wrong half the time (for negative x)? Anyway that did not help. Alternatively, use a filesystem which doesn't attach buffer_heads to dirty pages. xfs or btrfs, perhaps. Seems there is also a problem not related to filesystem... or rather, the essence does not seem to be filesystem or caches. The filesystem thing now seems OK with my patch doing drop_caches. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia --- root@como:~# free -lm total used free sharedbuffers cached Mem: 62936 2317 60618 0 41635 Low: 367271 95 High:62569 2045 60523 -/+ buffers/cache: 1640 61295 Swap: 131071 0 131071 root@como:~# cat /proc/slabinfo slabinfo - version: 2.1 # nameactive_objs num_objs objsize objperslab pagesperslab : tunables limit batchcount sharedfactor : slabdata active_slabs num_slabs sharedavail fuse_request 0 0376 434 : tunables000 : slabdata 0 0 0 fuse_inode 0 0448 364 : tunables000 : slabdata 0 0 0 bsg_cmd0 0288 282 : tunables000 : slabdata 0 0 0 ntfs_big_inode_cache 0 0512 324 : tunables000 : slabdata 0 0 0 ntfs_inode_cache 0 0176 462 : tunables000 : slabdata 0 0 0 nfs_direct_cache 0 0 80 511 : tunables000 : slabdata 0 0 0 nfs_inode_cache 5404 5404584 284 : tunables000 : slabdata193193 0 isofs_inode_cache 0 0360 454 : tunables000 : slabdata 0 0 0 fat_inode_cache0 0408 404 : tunables000 : slabdata 0 0 0 fat_cache 0 0 24 1701 : tunables000 : slabdata 0 0 0 jbd2_revoke_record 0 0 32 1281 : tunables000 : slabdata 0 0 0 journal_handle 5440 5440 24 1701 : tunables000 : slabdata 32 32 0 journal_head 16768 16768 64 641 : tunables000 : slabdata262262 0 revoke_record 20224 20224 16 2561 : tunables000 : slabdata 79 79 0 ext4_inode_cache 0 0584 284 : tunables000 : slabdata 0 0 0 ext4_free_data 0 0 40 1021 : tunables000 : slabdata 0 0 0 ext4_allocation_context 0 0112 361 : tunables00 0 : slabdata 0 0 0 ext4_prealloc_space 0 0 72 561 : tunables000 : slabdata 0 0 0 ext4_io_end0 0576 284 : tunables000 : slabdata 0 0 0 ext4_io_page 0 0 8 5121 : tunables000 : slabdata 0 0 0 ext2_inode_cache 0 0480 344 : tunables000 : slabdata 0 0 0 ext3_inode_cache 16531 19965488 334 : tunables000 : slabdata605605 0 ext3_xattr 0 0 48 851 : tunables000 : slabdata 0 0 0 dquot840840192 422 : tunables000 : slabdata 20 20 0 rpc_inode_cache 144144448 364 : tunables000 : slabdata 4 4 0 UDP-Lite 0 0576 284 : tunables000 : slabdata 0 0 0 xfrm_dst_cache 0 0320 514 : tunables000 : slabdata 0 0 0 UDP 896896576 284 : tunables000 : slabdata 32 32 0 tw_sock_TCP 1344 1344128 321
Re: [RFC] Reproducible OOM with partial workaround
Dear Andrew, Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads. Please see below ... ... Was this dump taken when the system was at or near oom? No, that was a quiescent machine. Please see a just-before-OOM dump in my next message (in a little while). Please send a copy of the oom-killer kernel message dump, if you still have one. Please see one in next message, or in http://bugs.debian.org/695182 I tried setting dirty_ratio to funny values, that did not seem to help. Did you try setting it as low as possible? Probably. Maybe. Sorry, cannot say with certainty. Did you notice my patch about bdi_position_ratio(), how it was plain wrong half the time (for negative x)? Nope, please resend. Quoting from http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182 : ... - In bdi_position_ratio() get difference (setpoint-dirty) right even when it is negative, which happens often. Normally these numbers are small and even with left-shift I never observed a 32-bit overflow. I believe it should be possible to re-write the whole function in 32-bit ints; maybe it is not worth the effort to make it efficient; seeing how this function was always wrong and we survived, it should simply be removed. ... --- mm/page-writeback.c.old 2012-10-17 13:50:15.0 +1100 +++ mm/page-writeback.c 2013-01-06 21:54:59.0 +1100 [ Line numbers out because other patches not shown ] ... @@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio( * = fast response on large errors; small oscillation near setpoint */ setpoint = (freerun + limit) / 2; - x = div_s64((setpoint - dirty) RATELIMIT_CALC_SHIFT, + x = div_s64(((s64)setpoint - (s64)dirty) RATELIMIT_CALC_SHIFT, limit - setpoint + 1); pos_ratio = x; pos_ratio = pos_ratio * x RATELIMIT_CALC_SHIFT; ... Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Reproducible OOM with just a few sleeps
Dear Linux-MM, Seems that any i386 PAE machine will go OOM just by running a few processes. To reproduce: sh -c 'n=0; while [ $n -lt 1 ]; do sleep 600 ((n=n+1)); done' My machine has 64GB RAM. With previous OOM episodes, it seemed that running (booting) it with mem=32G might avoid OOM; but an OOM was obtained just the same, and also with lower memory: Memorysleeps to OOM free shows total (mem=64G) 5300 64447796 mem=32G 10200 31155512 mem=16G 13400 14509364 mem=8G14200 6186296 mem=6G15200 4105532 mem=4G16400 2041364 The machine does not run out of highmem, nor does it use any swap. Comparing with my desktop PC: has 4GB RAM installed, free shows 3978592 total. Running the sleep test, it simply froze after 16400 running... no response to ping, will need to press the RESET button. --- On my large machine, 'free' fails to show about 2GB memory, e.g. with mem=16G it shows: root@zeno:~# free -l total used free sharedbuffers cached Mem: 14509364 435440 14073924 0 4068 111328 Low:769044 120232 648812 High: 13740320 315208 13425112 -/+ buffers/cache: 320044 14189320 Swap:134217724 0 134217724 --- Please let me know of any ideas, or if you want me to run some other test or want to see some other output. Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia - Details for when my machine was running with 64GB RAM: In another window I was running cat /proc/slabinfo; free -l repeatedly, and output of that (just before OOM) was: + cat /proc/slabinfo slabinfo - version: 2.1 # nameactive_objs num_objs objsize objperslab pagesperslab : tunables limit batchcount sharedfactor : slabdata active_slabs num_slabs sharedavail fuse_request 0 0376 434 : tunables000 : slabdata 0 0 0 fuse_inode 0 0448 364 : tunables000 : slabdata 0 0 0 bsg_cmd0 0288 282 : tunables000 : slabdata 0 0 0 ntfs_big_inode_cache 0 0512 324 : tunables000 : slabdata 0 0 0 ntfs_inode_cache 0 0176 462 : tunables000 : slabdata 0 0 0 nfs_direct_cache 0 0 80 511 : tunables000 : slabdata 0 0 0 nfs_inode_cache 28 28584 284 : tunables000 : slabdata 1 1 0 isofs_inode_cache 0 0360 454 : tunables000 : slabdata 0 0 0 fat_inode_cache0 0408 404 : tunables000 : slabdata 0 0 0 fat_cache 0 0 24 1701 : tunables000 : slabdata 0 0 0 jbd2_revoke_record 0 0 32 1281 : tunables000 : slabdata 0 0 0 journal_handle 4080 4080 24 1701 : tunables000 : slabdata 24 24 0 journal_head1024 1024 64 641 : tunables000 : slabdata 16 16 0 revoke_record768768 16 2561 : tunables000 : slabdata 3 3 0 ext4_inode_cache 0 0584 284 : tunables000 : slabdata 0 0 0 ext4_free_data 0 0 40 1021 : tunables000 : slabdata 0 0 0 ext4_allocation_context 0 0112 361 : tunables00 0 : slabdata 0 0 0 ext4_prealloc_space 0 0 72 561 : tunables000 : slabdata 0 0 0 ext4_io_end0 0576 284 : tunables000 : slabdata 0 0 0 ext4_io_page 0 0 8 5121 : tunables000 : slabdata 0 0 0 ext2_inode_cache 0 0480 344 : tunables000 : slabdata 0 0 0 ext3_inode_cache1467 2079488 334 : tunables000 : slabdata 63 63 0 ext3_xattr 0 0 48 851 : tunables000 : slabdata 0 0 0 dquot168168192 422 : tunables000 : slabdata 4 4 0 rpc_inode_cache 108108448 364 : tunables000 : slabdata 3 3 0 UDP-Lite 0 0576 284 : tunables000 : slabdata 0 0 0 xfrm_dst_cache 0 0320 514 : tunables000 : slabdata 0 0 0 UDP 336336576 284
Re: [RFC] Reproducible OOM with partial workaround
Dear Dave, > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit > kernel without either violating the ABI (3GB/1GB split) or doing > something that never got merged upstream ... Sorry to be so contradictory: psz@como:~$ uname -a Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 EST 2013 i686 GNU/Linux psz@como:~$ free -l total used free sharedbuffers cached Mem: 644469004729292 59717608 0 15972 480520 Low:375836 304400 71436 High: 640710644424892 59646172 -/+ buffers/cache:4232800 60214100 Swap:134217724 0 134217724 psz@como:~$ (though I would not know about violations). But OK, I take your point that I should move with the times. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with partial workaround
Dear Dave, > Your configuration has never worked. This isn't a regression ... > ... does not mean that we expect it to work. Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used; that all development is for 64-bit only? > ... 64-bit kernels should basically be drop-in replacements ... Will think about that. I know all my servers are 64-bit capable, will need to check all my desktops. --- I find it puzzling that there seems to be a sharp cutoff at 32GB RAM, no problem under but OOM just over; whereas I would have expected lowmem starvation to be gradual, with OOM occuring much sooner with 64GB than with 34GB. Also, the kernel seems capable of reclaiming lowmem, so I wonder why does that fail just over the 32GB threshhold. (Obviously I have no idea what I am talking about.) --- Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Reproducible OOM with partial workaround
Dear Linux-MM, On a machine with i386 kernel and over 32GB RAM, an OOM condition is reliably obtained simply by writing a few files to some local disk e.g. with: n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; ((n=$n+1)); done Crash usually occurs after 16 or 32 files written. Seems that the problem may be avoided by using mem=32G on the kernel boot, and that it occurs with any amount of RAM over 32GB. I developed a workaround patch for this particular OOM demo, dropping filesystem caches when about to exhaust lowmem. However, subsequently I observed OOM when running many processes (as yet I do not have an easy-to-reproduce demo of this); so as I suspected, the essence of the problem is not with FS caches. Could you please help in finding the cause of this OOM bug? Please see http://bugs.debian.org/695182 for details, in particular my workaround patch http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182 (Please reply to me directly, as I am not a subscriber to the linux-mm mailing list.) Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Reproducible OOM with partial workaround
Dear Linux-MM, On a machine with i386 kernel and over 32GB RAM, an OOM condition is reliably obtained simply by writing a few files to some local disk e.g. with: n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; ((n=$n+1)); done Crash usually occurs after 16 or 32 files written. Seems that the problem may be avoided by using mem=32G on the kernel boot, and that it occurs with any amount of RAM over 32GB. I developed a workaround patch for this particular OOM demo, dropping filesystem caches when about to exhaust lowmem. However, subsequently I observed OOM when running many processes (as yet I do not have an easy-to-reproduce demo of this); so as I suspected, the essence of the problem is not with FS caches. Could you please help in finding the cause of this OOM bug? Please see http://bugs.debian.org/695182 for details, in particular my workaround patch http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182 (Please reply to me directly, as I am not a subscriber to the linux-mm mailing list.) Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with partial workaround
Dear Dave, Your configuration has never worked. This isn't a regression ... ... does not mean that we expect it to work. Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used; that all development is for 64-bit only? ... 64-bit kernels should basically be drop-in replacements ... Will think about that. I know all my servers are 64-bit capable, will need to check all my desktops. --- I find it puzzling that there seems to be a sharp cutoff at 32GB RAM, no problem under but OOM just over; whereas I would have expected lowmem starvation to be gradual, with OOM occuring much sooner with 64GB than with 34GB. Also, the kernel seems capable of reclaiming lowmem, so I wonder why does that fail just over the 32GB threshhold. (Obviously I have no idea what I am talking about.) --- Thanks, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Reproducible OOM with partial workaround
Dear Dave, ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit kernel without either violating the ABI (3GB/1GB split) or doing something that never got merged upstream ... Sorry to be so contradictory: psz@como:~$ uname -a Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 EST 2013 i686 GNU/Linux psz@como:~$ free -l total used free sharedbuffers cached Mem: 644469004729292 59717608 0 15972 480520 Low:375836 304400 71436 High: 640710644424892 59646172 -/+ buffers/cache:4232800 60214100 Swap:134217724 0 134217724 psz@como:~$ (though I would not know about violations). But OK, I take your point that I should move with the times. Cheers, Paul Paul Szabo p...@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of SydneyAustralia -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/