Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
On 20/08/2019 07:46, Daniel Drake wrote: To share our results so far, despite this daemon being a quick initial implementation, we find that it is bringing excellent results, no more memory pressure hangs. The system recovers in less than 30 seconds, usually in more like 10-15 seconds. That's obviously a lot better than hard freezes but I wouldn't call such system lock-ups an excellent result. PSI-triggered OOM killer would have indeed been very useful as an emergency brake, and IMHO such mechanism should be built in the kernel and enabled by default. But in my experience it does a very poor job at detecting imminent freezes on systems without swap or with very fast swap (zram). So far, watching MemAvailable (like earlyoom does) is far more reliable and accurate. Unfortunately, there just doesn't seem to be a kernel feature that would reserve a user-defined amount of memory for caches. There's just one issue we've seen so far: a single report of psi reporting memory pressure on a desktop system with 4GB RAM which is only running the normal desktop components plus a single gmail tab in the web browser. psi occasionally reports high memory pressure, so then psi-monitor steps in and kills the browser tab, which seems erroneous. Is it Chrome/Chromium? If so, that's a known bug (https://bugs.chromium.org/p/chromium/issues/detail?id=333617) Best regards, ndrw
Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
On 09/08/2019 09:57, Michal Hocko wrote: This is a useful feedback! What was your workload? Which kernel version? With 16GB zram swap and swappiness=60 I get the avg10 memory PSI numbers of about 10 when swap is half filled and ~30 immediately before the freeze. Swapping with zram has less effect on system responsiveness comparing to swapping to an ssd, so, if combined with the proposed PSI triggered OOM killer, this could be a viable solution. Still, using swap only to make PSI sensing work when triggering OOM killer at non-zero available memory would do the job just as well is a bit of an overkill. I don't really need these extra few GB or memory, just want to get rid of system freezes. Perhaps we could have both heuristics. Best regards, ndrw
Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
On 09/08/2019 11:50, Michal Hocko wrote: We try to protect low amount of cache. Have a look at get_scan_count function. But the exact amount of the cache to be protected is really hard to know wihtout a crystal ball or understanding of the workload. The kernel doesn't have neither of the two. Thank you. I'm familiarizing myself with the code. Is there anyone I could discuss some details with? I don't want to create too much noise here. For example, are file pages created by mmaping files and are anon page exclusively allocated on heap (RW data)? If so, where do "streaming IO" pages belong to? We have been thinking about this problem for a long time and couldn't come up with anything much better than we have now. PSI is the most recent improvement in that area. If you have better ideas then patches are always welcome. In general, I found there are very few user accessible knobs for adjusting caching, especially in the pre-OOM phase. On the other hand, swapping, dirty page caching, have many options or can even be disabled completely. For example, I would like to try disabling/limiting eviction of some/all file pages (for example exec pages) akin to disabling swapping, but there is no such mechanism. Yes, there would likely be problems with large RO mmapped files that would need to be addressed, but in many applications users would be interested in having such options. Adjusting how aggressive/conservative the system should be with the OOM killer also falls into this category. [OOM killer accuracy] That is a completely orthogonal problem, I am afraid. So far we have been discussing _when_ to trigger OOM killer. This is _who_ to kill. I haven't heard any recent examples that the victim selection would be way off and killing something obviously incorrect. You are right. I've assumed earlyoom is more accurate because of OOM killer performing better on a system that isn't stalled yet (perhaps it does). But actually, earlyoom doesn't trigger OOM killer at all: https://github.com/rfjakob/earlyoom#why-not-trigger-the-kernel-oom-killer Apparently some applications (chrome and electron-based tools) set their oom_score_adj incorrectly - this matches my observations of OOM killer behavior: https://bugs.chromium.org/p/chromium/issues/detail?id=333617 Something that other people can play with to reproduce the issue would be more than welcome. This is the script I used. It reliably reproduces the issue: https://github.com/ndrw6/import_postcodes/blob/master/import_postcodes.py but it has quite a few dependencies, needs some input data and, in general, does a lot more than just fill up the memory. I will try to come up with something simpler. Best regards, ndrw
Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
On 09/08/2019 09:57, Michal Hocko wrote: We already do have a reserve (min_free_kbytes). That gives kswapd some room to perform reclaim in the background without obvious latencies to allocating tasks (well CPU still be used so there is still some effect). I tried this option in the past. Unfortunately, I didn't prevent freezes. My understanding is this option reserves some amount of memory to not be swapped out but does not prevent the kernel from evicting all pages from cache when more memory is needed. Kswapd tries to keep a balance and free memory low but still with some room to satisfy an immediate memory demand. Once kswapd doesn't catch up with the memory demand we dive into the direct reclaim and that is where people usually see latencies coming from. Reclaiming memory is fine, of course, but not all the way to 0 caches. No caches means all executable pages, ro pages (e.g. fonts) are evicted from memory and have to be constantly reloaded on every user action. All this while competing with tasks that are using up all memory. This happens with of without swap, although swap does spread this issue in time a bit. The main problem here is that it is hard to tell from a single allocation latency that we have a bigger problem. As already said, the usual trashing scenario doesn't show problem during the reclaim because pages can be freed up very efficiently. The problem is that they are refaulted very quickly so we are effectively rotating working set like crazy. Compare that to a normal used-once streaming IO workload which is generating a lot of page cache that can be recycled in a similar pace but a working set doesn't get freed. Free memory figures will look very similar in both cases. Thank you for the explanation. It is indeed a difficult problem - some cached pages (streaming IO) will likely not be needed again and should be discarded asap, other (like mmapped executable/ro pages of UI utilities) will cause thrashing when evicted under high memory pressure. Another aspect is that PSI is probably not the best measure of detecting imminent thrashing. However, if it can at least detect a freeze that has already occurred and force the OOM killer that is still a lot better than a dead system, which is the current user experience. Good that earlyoom works for you. I am giving it as an example of a heuristic that seems to work very well for me. Something to look into. And yes, I wouldn't mind having such mechanism built into the kernel. All I am saying is that this is not generally applicable heuristic because we do care about a larger variety of workloads. I should probably emphasise that the OOM killer is there as a _last resort_ hand break when something goes terribly wrong. It operates at times when any user intervention would be really hard because there is a lack of resources to be actionable. It is indeed a last resort solution - without it the system is unusable. Still, accuracy matters because killing a wrong task does not fix the problem (a task hogging memory is still running) and may break the system anyway if something important is killed instead. [...] This is a useful feedback! What was your workload? Which kernel version? I tested it by running a python script that processes a large amount of data in memory (needs around 15GB of RAM). I normally run 2 instances of that script in parallel but for testing I started 4 of them. I sometimes experience the same issue when using multiple regular memory intensive desktop applications in a manner described in the first post but that's harder to reproduce because of the user input needed. [ 0.00] Linux version 5.0.0-21-generic (buildd@lgw01-amd64-036) (gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1)) #22-Ubuntu SMP Tue Jul 2 13:27:33 UTC 2019 (Ubuntu 5.0.0-21.22-generic 5.0.15) AMD CPU with 4 cores, 8 threads. AMDGPU graphics stack. Best regards, ndrw
Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
On 08/08/2019 19:59, Michal Hocko wrote: Well, I am afraid that implementing anything like that in the kernel will lead to many regressions and bug reports. People tend to have very different opinions on when it is suitable to kill a potentially important part of a workload just because memory gets low. Are you proposing having a zero memory reserve or not having such option at all? I'm fine with the current default (zero reserve/margin). I strongly prefer forcing OOM killer when the system is still running normally. Not just for preventing stalls: in my limited testing I found the OOM killer on a stalled system rather inaccurate, occasionally killing system services etc. I had much better experience with earlyoom. LRU aspect doesn't help much, really. If we are reclaiming the same set of pages becuase they are needed for the workload to operate then we are effectivelly treshing no matter what kind of replacement policy you are going to use. In my case it would work fine (my system already works well with earlyoom, and without it it remains responsive until last couple hundred MB of RAM). PSI is giving you a matric that tells you how much time you spend on the memory reclaim. So you can start watching the system from lower utilization already. I've tested it on a system with 45GB of RAM, SSD, swap disabled (my intention was to approximate a worst-case scenario) and it didn't really detect stall before it happened. I can see some activity after reaching ~42GB, the system remains fully responsive until it suddenly freezes and requires sysrq-f. PSI appears to increase a bit when the system is about to run out of memory but the change is so small it would be difficult to set a reliable threshold. I expect the PSI numbers to increase significantly after the stall (I wasn't able to capture them) but, as mentioned above, I was hoping for a solution that would work before the stall. $ while true; do sleep 1; cat /proc/pressure/memory ; done [starting a test script and waiting for several minutes to fill up memory] some avg10=0.00 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 some avg10=0.00 avg60=0.00 avg300=0.00 total=10389 full avg10=0.00 avg60=0.00 avg300=0.00 total=6442 some avg10=0.00 avg60=0.00 avg300=0.00 total=18950 full avg10=0.00 avg60=0.00 avg300=0.00 total=11576 some avg10=0.00 avg60=0.00 avg300=0.00 total=25655 full avg10=0.00 avg60=0.00 avg300=0.00 total=16159 some avg10=0.00 avg60=0.00 avg300=0.00 total=31438 full avg10=0.00 avg60=0.00 avg300=0.00 total=19552 some avg10=0.00 avg60=0.00 avg300=0.00 total=44549 full avg10=0.00 avg60=0.00 avg300=0.00 total=27772 some avg10=0.00 avg60=0.00 avg300=0.00 total=52520 full avg10=0.00 avg60=0.00 avg300=0.00 total=32580 some avg10=0.00 avg60=0.00 avg300=0.00 total=60451 full avg10=0.00 avg60=0.00 avg300=0.00 total=37704 some avg10=0.00 avg60=0.00 avg300=0.00 total=68986 full avg10=0.00 avg60=0.00 avg300=0.00 total=42859 some avg10=0.00 avg60=0.00 avg300=0.00 total=76598 full avg10=0.00 avg60=0.00 avg300=0.00 total=48370 some avg10=0.00 avg60=0.00 avg300=0.00 total=83080 full avg10=0.00 avg60=0.00 avg300=0.00 total=52930 some avg10=0.00 avg60=0.00 avg300=0.00 total=89384 full avg10=0.00 avg60=0.00 avg300=0.00 total=56350 some avg10=0.00 avg60=0.00 avg300=0.00 total=95293 full avg10=0.00 avg60=0.00 avg300=0.00 total=60260 some avg10=0.00 avg60=0.00 avg300=0.00 total=101566 full avg10=0.00 avg60=0.00 avg300=0.00 total=64408 some avg10=0.00 avg60=0.00 avg300=0.00 total=108131 full avg10=0.00 avg60=0.00 avg300=0.00 total=68412 some avg10=0.00 avg60=0.00 avg300=0.00 total=121932 full avg10=0.00 avg60=0.00 avg300=0.00 total=77413 some avg10=0.00 avg60=0.00 avg300=0.00 total=140807 full avg10=0.00 avg60=0.00 avg300=0.00 total=91269 some avg10=0.00 avg60=0.00 avg300=0.00 total=170494 full avg10=0.00 avg60=0.00 avg300=0.00 total=110611 [stall, sysrq-f] Best regards, ndrw
Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
On 8 August 2019 17:32:28 BST, Michal Hocko wrote: > >> Would it be possible to reserve a fixed (configurable) amount of RAM >for caches, > >I am afraid there is nothing like that available and I would even argue >it doesn't make much sense either. What would you consider to be a >cache? A kernel/userspace reclaimable memory? What about any other in >kernel memory users? How would you setup such a limit and make it >reasonably maintainable over different kernel releases when the memory >footprint changes over time? Frankly, I don't know. The earlyoom userspace tool works well enough for me so I assumed this functionality could be implemented in kernel. Default thresholds would have to be tested but it is unlikely zero is the optimum value. >Besides that how does that differ from the existing reclaim mechanism? >Once your cache hits the limit, there would have to be some sort of the >reclaim to happen and then we are back to square one when the reclaim >is >making progress but you are effectively treshing over the hot working >set (e.g. code pages) By forcing OOM killer. Reclaiming memory when system becomes unresponsive is precisely what I want to avoid. >> and trigger OOM killer earlier, before most UI code is evicted from >memory? > >How does the kernel knows that important memory is evicted? I assume current memory management policy (LRU?) is sufficient to keep most frequently used pages in memory. >If you know which task is that then you can put it into a memory cgroup >with a stricter memory limit and have it killed before the overal >system >starts suffering. This is what I intended to use. But I don't know how to bypass SystemD or configure such policies via SystemD. >PSI is giving you a matric that tells you how much time you >spend on the memory reclaim. So you can start watching the system from >lower utilization already. This is a fantastic news. Really. I didn't know this is how it works. Two potential issues, though: 1. PSI (if possible) should be normalised wrt the memory reclaiming cost (SSDs have lower cost than HDDs). If not automatically then perhaps via a user configurable option. That's somewhat similar to having configurable PSI thresholds. 2. It seems PSI measures the _rate_ pages are evicted from memory. While this may correlate with the _absolute_ amount of of memory left, it is not the same. Perhaps weighting PSI with absolute amount of memory used for caches would improve this metric. Best regards, ndrw
Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
On 8 August 2019 12:48:26 BST, Michal Hocko wrote: >> >> Per default, the OOM killer will engage after 15 seconds of at least >> 80% memory pressure. These values are tunable via sysctls >> vm.thrashing_oom_period and vm.thrashing_oom_level. > >As I've said earlier I would be somehow more comfortable with a kernel >command line/module parameter based tuning because it is less of a >stable API and potential future stall detector might be completely >independent on PSI and the current metric exported. But I can live with >that because a period and level sounds quite generic. Would it be possible to reserve a fixed (configurable) amount of RAM for caches, and trigger OOM killer earlier, before most UI code is evicted from memory? In my use case, I am happy sacrificing e.g. 0.5GB and kill runaway tasks _before_ the system freezes. Potentially OOM killer would also work better in such conditions. I almost never work at close to full memory capacity, it's always a single task that goes wrong and brings the system down. The problem with PSI sensing is that it works after the fact (after the freeze has already occurred). It is not very different from issuing SysRq-f manually on a frozen system, although it would still be a handy feature for batched tasks and remote access. Best regards, ndrw