Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

2019-08-22 Thread ndrw

On 20/08/2019 07:46, Daniel Drake wrote:

To share our results so far, despite this daemon being a quick initial
implementation, we find that it is bringing excellent results, no more memory
pressure hangs. The system recovers in less than 30 seconds, usually in more
like 10-15 seconds.


That's obviously a lot better than hard freezes but I wouldn't call such 
system lock-ups an excellent result. PSI-triggered OOM killer would have 
indeed been very useful as an emergency brake, and IMHO such mechanism 
should be built in the kernel and enabled by default. But in my 
experience it does a very poor job at detecting imminent freezes on 
systems without swap or with very fast swap (zram). So far, watching 
MemAvailable (like earlyoom does) is far more reliable and accurate. 
Unfortunately, there just doesn't seem to be a kernel feature that would 
reserve a user-defined amount of memory for caches.



There's just one issue we've seen so far: a single report of psi reporting
memory pressure on a desktop system with 4GB RAM which is only running
the normal desktop components plus a single gmail tab in the web browser.
psi occasionally reports high memory pressure, so then psi-monitor steps in and
kills the browser tab, which seems erroneous.


Is it Chrome/Chromium? If so, that's a known bug 
(https://bugs.chromium.org/p/chromium/issues/detail?id=333617)


Best regards,

ndrw




Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

2019-08-10 Thread ndrw

On 09/08/2019 09:57, Michal Hocko wrote:
This is a useful feedback! What was your workload? Which kernel version? 


With 16GB zram swap and swappiness=60 I get the avg10 memory PSI numbers 
of about 10 when swap is half filled and ~30 immediately before the 
freeze. Swapping with zram has less effect on system responsiveness 
comparing to swapping to an ssd, so, if combined with the proposed PSI 
triggered OOM killer, this could be a viable solution.


Still, using swap only to make PSI sensing work when triggering OOM 
killer at non-zero available memory would do the job just as well is a 
bit of an overkill. I don't really need these extra few GB or memory, 
just want to get rid of system freezes. Perhaps we could have both 
heuristics.


Best regards,

ndrw




Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

2019-08-10 Thread ndrw

On 09/08/2019 11:50, Michal Hocko wrote:

We try to protect low amount of cache. Have a look at get_scan_count
function. But the exact amount of the cache to be protected is really
hard to know wihtout a crystal ball or understanding of the workload.
The kernel doesn't have neither of the two.


Thank you. I'm familiarizing myself with the code. Is there anyone I 
could discuss some details with? I don't want to create too much noise here.


For example, are file pages created by mmaping files and are anon page 
exclusively allocated on heap (RW data)? If so, where do "streaming IO" 
pages belong to?



We have been thinking about this problem for a long time and couldn't
come up with anything much better than we have now. PSI is the most recent
improvement in that area. If you have better ideas then patches are
always welcome.


In general, I found there are very few user accessible knobs for 
adjusting caching, especially in the pre-OOM phase. On the other hand, 
swapping, dirty page caching, have many options or can even be disabled 
completely.


For example, I would like to try disabling/limiting eviction of some/all 
file pages (for example exec pages) akin to disabling swapping, but 
there is no such mechanism. Yes, there would likely be problems with 
large RO mmapped files that would need to be addressed, but in many 
applications users would be interested in having such options.


Adjusting how aggressive/conservative the system should be with the OOM 
killer also falls into this category.



[OOM killer accuracy]

That is a completely orthogonal problem, I am afraid. So far we have
been discussing _when_ to trigger OOM killer. This is _who_ to kill. I
haven't heard any recent examples that the victim selection would be way
off and killing something obviously incorrect.


You are right. I've assumed earlyoom is more accurate because of OOM 
killer performing better on a system that isn't stalled yet (perhaps it 
does). But actually, earlyoom doesn't trigger OOM killer at all:


https://github.com/rfjakob/earlyoom#why-not-trigger-the-kernel-oom-killer

Apparently some applications (chrome and electron-based tools) set their 
oom_score_adj incorrectly - this matches my observations of OOM killer 
behavior:


https://bugs.chromium.org/p/chromium/issues/detail?id=333617


Something that other people can play with to reproduce the issue would
be more than welcome.


This is the script I used. It reliably reproduces the issue: 
https://github.com/ndrw6/import_postcodes/blob/master/import_postcodes.py 
but it has quite a few dependencies, needs some input data and, in 
general, does a lot more than just fill up the memory. I will try to 
come up with something simpler.


Best regards,

ndrw




Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

2019-08-09 Thread ndrw

On 09/08/2019 09:57, Michal Hocko wrote:

We already do have a reserve (min_free_kbytes). That gives kswapd some
room to perform reclaim in the background without obvious latencies to
allocating tasks (well CPU still be used so there is still some effect).


I tried this option in the past. Unfortunately, I didn't prevent 
freezes. My understanding is this option reserves some amount of memory 
to not be swapped out but does not prevent the kernel from evicting all 
pages from cache when more memory is needed.



Kswapd tries to keep a balance and free memory low but still with some
room to satisfy an immediate memory demand. Once kswapd doesn't catch up
with the memory demand we dive into the direct reclaim and that is where
people usually see latencies coming from.


Reclaiming memory is fine, of course, but not all the way to 0 caches. 
No caches means all executable pages, ro pages (e.g. fonts) are evicted 
from memory and have to be constantly reloaded on every user action. All 
this while competing with tasks that are using up all memory. This 
happens with of without swap, although swap does spread this issue in 
time a bit.



The main problem here is that it is hard to tell from a single
allocation latency that we have a bigger problem. As already said, the
usual trashing scenario doesn't show problem during the reclaim because
pages can be freed up very efficiently. The problem is that they are
refaulted very quickly so we are effectively rotating working set like
crazy. Compare that to a normal used-once streaming IO workload which is
generating a lot of page cache that can be recycled in a similar pace
but a working set doesn't get freed. Free memory figures will look very
similar in both cases.


Thank you for the explanation. It is indeed a difficult problem - some 
cached pages (streaming IO) will likely not be needed again and should 
be discarded asap, other (like mmapped executable/ro pages of UI 
utilities) will cause thrashing when evicted under high memory pressure. 
Another aspect is that PSI is probably not the best measure of detecting 
imminent thrashing. However, if it can at least detect a freeze that has 
already occurred and force the OOM killer that is still a lot better 
than a dead system, which is the current user experience.



Good that earlyoom works for you.


I am giving it as an example of a heuristic that seems to work very well 
for me. Something to look into. And yes, I wouldn't mind having such 
mechanism built into the kernel.



  All I am saying is that this is not
generally applicable heuristic because we do care about a larger variety
of workloads. I should probably emphasise that the OOM killer is there
as a _last resort_ hand break when something goes terribly wrong. It
operates at times when any user intervention would be really hard
because there is a lack of resources to be actionable.


It is indeed a last resort solution - without it the system is unusable. 
Still, accuracy matters because killing a wrong task does not fix the 
problem (a task hogging memory is still running) and may break the 
system anyway if something important is killed instead.


[...]


This is a useful feedback! What was your workload? Which kernel version?


I tested it by running a python script that processes a large amount of 
data in memory (needs around 15GB of RAM). I normally run 2 instances of 
that script in parallel but for testing I started 4 of them. I sometimes 
experience the same issue when using multiple regular memory intensive 
desktop applications in a manner described in the first post but that's 
harder to reproduce because of the user input needed.


[    0.00] Linux version 5.0.0-21-generic (buildd@lgw01-amd64-036) 
(gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1)) #22-Ubuntu SMP Tue Jul 2 
13:27:33 UTC 2019 (Ubuntu 5.0.0-21.22-generic 5.0.15)

AMD CPU with 4 cores, 8 threads. AMDGPU graphics stack.

Best regards,

ndrw




Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

2019-08-08 Thread ndrw

On 08/08/2019 19:59, Michal Hocko wrote:

Well, I am afraid that implementing anything like that in the kernel
will lead to many regressions and bug reports. People tend to have very
different opinions on when it is suitable to kill a potentially
important part of a workload just because memory gets low.


Are you proposing having a zero memory reserve or not having such option 
at all? I'm fine with the current default (zero reserve/margin).


I strongly prefer forcing OOM killer when the system is still running 
normally. Not just for preventing stalls: in my limited testing I found 
the OOM killer on a stalled system rather inaccurate, occasionally 
killing system services etc. I had much better experience with earlyoom.



LRU aspect doesn't help much, really. If we are reclaiming the same set
of pages becuase they are needed for the workload to operate then we are
effectivelly treshing no matter what kind of replacement policy you are
going to use.


In my case it would work fine (my system already works well with 
earlyoom, and without it it remains responsive until last couple hundred 
MB of RAM).




PSI is giving you a matric that tells you how much time you
spend on the memory reclaim. So you can start watching the system from
lower utilization already.


I've tested it on a system with 45GB of RAM, SSD, swap disabled (my 
intention was to approximate a worst-case scenario) and it didn't really 
detect stall before it happened. I can see some activity after reaching 
~42GB, the system remains fully responsive until it suddenly freezes and 
requires sysrq-f. PSI appears to increase a bit when the system is about 
to run out of memory but the change is so small it would be difficult to 
set a reliable threshold. I expect the PSI numbers to increase 
significantly after the stall (I wasn't able to capture them) but, as 
mentioned above, I was hoping for a solution that would work before the 
stall.


$ while true; do sleep 1; cat /proc/pressure/memory ; done
[starting a test script and waiting for several minutes to fill up memory]
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
some avg10=0.00 avg60=0.00 avg300=0.00 total=10389
full avg10=0.00 avg60=0.00 avg300=0.00 total=6442
some avg10=0.00 avg60=0.00 avg300=0.00 total=18950
full avg10=0.00 avg60=0.00 avg300=0.00 total=11576
some avg10=0.00 avg60=0.00 avg300=0.00 total=25655
full avg10=0.00 avg60=0.00 avg300=0.00 total=16159
some avg10=0.00 avg60=0.00 avg300=0.00 total=31438
full avg10=0.00 avg60=0.00 avg300=0.00 total=19552
some avg10=0.00 avg60=0.00 avg300=0.00 total=44549
full avg10=0.00 avg60=0.00 avg300=0.00 total=27772
some avg10=0.00 avg60=0.00 avg300=0.00 total=52520
full avg10=0.00 avg60=0.00 avg300=0.00 total=32580
some avg10=0.00 avg60=0.00 avg300=0.00 total=60451
full avg10=0.00 avg60=0.00 avg300=0.00 total=37704
some avg10=0.00 avg60=0.00 avg300=0.00 total=68986
full avg10=0.00 avg60=0.00 avg300=0.00 total=42859
some avg10=0.00 avg60=0.00 avg300=0.00 total=76598
full avg10=0.00 avg60=0.00 avg300=0.00 total=48370
some avg10=0.00 avg60=0.00 avg300=0.00 total=83080
full avg10=0.00 avg60=0.00 avg300=0.00 total=52930
some avg10=0.00 avg60=0.00 avg300=0.00 total=89384
full avg10=0.00 avg60=0.00 avg300=0.00 total=56350
some avg10=0.00 avg60=0.00 avg300=0.00 total=95293
full avg10=0.00 avg60=0.00 avg300=0.00 total=60260
some avg10=0.00 avg60=0.00 avg300=0.00 total=101566
full avg10=0.00 avg60=0.00 avg300=0.00 total=64408
some avg10=0.00 avg60=0.00 avg300=0.00 total=108131
full avg10=0.00 avg60=0.00 avg300=0.00 total=68412
some avg10=0.00 avg60=0.00 avg300=0.00 total=121932
full avg10=0.00 avg60=0.00 avg300=0.00 total=77413
some avg10=0.00 avg60=0.00 avg300=0.00 total=140807
full avg10=0.00 avg60=0.00 avg300=0.00 total=91269
some avg10=0.00 avg60=0.00 avg300=0.00 total=170494
full avg10=0.00 avg60=0.00 avg300=0.00 total=110611
[stall, sysrq-f]

Best regards,

ndrw




Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

2019-08-08 Thread ndrw . xf



On 8 August 2019 17:32:28 BST, Michal Hocko  wrote:
>
>> Would it be possible to reserve a fixed (configurable) amount of RAM
>for caches,
>
>I am afraid there is nothing like that available and I would even argue
>it doesn't make much sense either. What would you consider to be a
>cache? A kernel/userspace reclaimable memory? What about any other in
>kernel memory users? How would you setup such a limit and make it
>reasonably maintainable over different kernel releases when the memory
>footprint changes over time?

Frankly, I don't know. The earlyoom userspace tool works well enough for me so 
I assumed this functionality could be implemented in kernel. Default thresholds 
would have to be tested but it is unlikely zero is the optimum value. 

>Besides that how does that differ from the existing reclaim mechanism?
>Once your cache hits the limit, there would have to be some sort of the
>reclaim to happen and then we are back to square one when the reclaim
>is
>making progress but you are effectively treshing over the hot working
>set (e.g. code pages)

By forcing OOM killer. Reclaiming memory when system becomes unresponsive is 
precisely what I want to avoid.

>> and trigger OOM killer earlier, before most UI code is evicted from
>memory?
>
>How does the kernel knows that important memory is evicted?

I assume current memory management policy (LRU?) is sufficient to keep most 
frequently used pages in memory.

>If you know which task is that then you can put it into a memory cgroup
>with a stricter memory limit and have it killed before the overal
>system
>starts suffering.

This is what I intended to use. But I don't know how to bypass SystemD or 
configure such policies via SystemD. 

>PSI is giving you a matric that tells you how much time you
>spend on the memory reclaim. So you can start watching the system from
>lower utilization already.

This is a fantastic news. Really. I didn't know this is how it works. Two 
potential issues, though:
1. PSI (if possible) should be normalised wrt the memory reclaiming cost (SSDs 
have lower cost than HDDs). If not automatically then perhaps via a user 
configurable option. That's somewhat similar to having configurable PSI 
thresholds. 
2. It seems PSI measures the _rate_ pages are evicted from memory. While this 
may correlate with the _absolute_ amount of of memory left, it is not the same. 
Perhaps weighting PSI with absolute amount of memory used for caches would 
improve this metric.

Best regards,
ndrw


Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

2019-08-08 Thread ndrw . xf



On 8 August 2019 12:48:26 BST, Michal Hocko  wrote:
>> 
>> Per default, the OOM killer will engage after 15 seconds of at least
>> 80% memory pressure. These values are tunable via sysctls
>> vm.thrashing_oom_period and vm.thrashing_oom_level.
>
>As I've said earlier I would be somehow more comfortable with a kernel
>command line/module parameter based tuning because it is less of a
>stable API and potential future stall detector might be completely
>independent on PSI and the current metric exported. But I can live with
>that because a period and level sounds quite generic.

Would it be possible to reserve a fixed (configurable) amount of RAM for 
caches, and trigger OOM killer earlier, before most UI code is evicted from 
memory? In my use case, I am happy sacrificing e.g. 0.5GB and kill runaway 
tasks _before_ the system freezes. Potentially OOM killer would also work 
better in such conditions. I almost never work at close to full memory 
capacity, it's always a single task that goes wrong and brings the system down.

The problem with PSI sensing is that it works after the fact (after the freeze 
has already occurred). It is not very different from issuing SysRq-f manually 
on a frozen system, although it would still be a handy feature for batched 
tasks and remote access. 

Best regards, 
ndrw