Konstantin,

Thanks for your previous information. We've set aside a good bit of time to dig into this issue further and while we are not kernel developers I believe we are fairly close to the path for determining the root cause of this. We likely need your assistance to fully isolate the majority of these issues.

So, here is an example of one of our nodes and the problem child that we have found:

Excerpt from /proc/meminfo:
MemTotal:       196485548 kB
MemFree:         4268192 kB
MemAvailable:   122103908 kB
Slab:           126500392 kB
SReclaimable:   113541672 kB

So, reviewing this information here, we have 192gb of ram in this system, we have supposedly 122GB of "MemAvaliable" and of that 113GB of stored in reclaimable slab space.

So based on this alone, we should be able to rely on memory pressure to reclaim that 113GB worth of memory and very little if any memory goes to swap. However, this does not happen, as is proof from dozens of nodes digging into swap. So let's dig further.

Looking at the Slab space in detail with slabtop we are provided with the following:

Active / Total Objects (% used)    : 752665180 / 754941816 (99.7%)
 Active / Total Slabs (% used)      : 31612963 / 31612963 (100.0%)
 Active / Total Caches (% used)     : 152 / 181 (84.0%)
 Active / Total Size (% used)       : 123299873.38K / 123780755.78K (99.6%)
 Minimum / Average / Maximum Object : 0.01K / 0.16K / 15.25K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
570457209 569865854  99%    0.19K 27164629       21 108658516K dentry
166341504 166231690  99%    0.06K 2599086       64  10396344K kmalloc-64
3776724 3380646  89%    0.19K 179844       21    719376K kmalloc-192
3325284 3220909  96%    1.07K 1108428        3   4433712K ext4_inode_cache

So this is where things start to get interesting.

570457209 569865854  99%    0.19K 27164629       21 108658516K dentry

This claims we have 570 Million dentry cache entries in the kernel, and a whopping 569 Million of those are active?

But how is this possible, when taking a total inode count of all customers & containers we only have 34970144 total inodes. So somehow the kernel is caching loads non-existent dentry entries.

We can somewhat print this value by using the following file, however this was recently introduced in the 3.10 branch and does not appear to be entirely accurate.

cat /proc/sys/fs/dentry-state
569791041       569357616       45      0       47217194        0

So this claims 45 million negative entires out of 569 million, so it reports some, but not a huge percentage of them. Other servers this number actually reports drastically higher than the total dentry_cache so we will ignore the 5th column for now.

In testing we have replicated this behavior to a degree. You can create a snapshot of a VPS, mount it to a new location, stat all of the files and commit millions of additional entries to the dentry_cache. Ploop in this instance, does properly clean these entries up when the image is un-mounted so this part is good.

However, we kept testing common things we do. One common thing that we run from time to time is a live ploop compact, since ploop images can continually grow based on file usage within them and they must be re-sized from time to time. Here's an example:

Node has a total user space of 31205700 inodes.

Dentry cache starts out at the following:

cat /proc/sys/fs/dentry-state
56585091 56124555 45 0 54216828 0

We then ran a pcompact across all containers to look for orphaned space.


pcompact completed:

cat /proc/sys/fs/dentry-state
98663884 96136064 45 0 63732825 0

We have gained 42 million dentry entries from the pcompact process.


Overall, I believe in the beginning of our conversation you eluded to this problem without going into detail on it when you mentioned the following: "Most often it's some kind of homebrew backup processes - just because it's their job to read files from disk while performing a backup - and thus they generate a lot of pagecache."

Based on my research this is an over-simplification of the core issue. Once you begin to research dentry_cache and the issues surrounding it you begin to notice a pattern. Linux servers that deal with larges amounts of inodes and very frequent access schedules (i.e. lots of busy virtual machines running as web servers, temporary files, cache files etc) the problem with dentry_cache presents its self.

Now, in the dcache.c code base, you do find the following variable "vfs_cache_pressure" which can be adjusted to reclaim dentry_cache

Great, we have a solution.. except it does not work as confirmed here: https://access.redhat.com/solutions/55818

An excerpt from this solution specifically states:
"We can see high dentry_cache usage on the systems those who are running some programs which are opening and closing huge number of files. Sometimes high dentry_cache leads the system to run out of memory in such situations performance gets severly impacted as the system will start using swap space."

As well as:

Diagnostic Steps
Testing attempted using vm.vfs_cache_pressure values >100 has no effect.
Testing using the workaround of echo 2 > /proc/sys/vm/drop_caches immediately reclaims almost all dentry_cache memory.
use slabtop to monitor how much dentry_cache is in use:


So, let us continue down the rabbit hole of dentry_cache and we come across this discussion w/ Linus about the topic: https://patchwork.kernel.org/patch/9884869/#20866043

Long and short Linus suggests the responsibility to clean / prevent dentry_cache buildup of negative/bad entries falls mostly on the user space, as the example of rm was given.

This problem uniquely impacts para virtualized systems like VZ/Ploop / docker / LXC etc that share a common kernel and file system layer. Because this cache is intentionally not separated per cgroup you end up with a massive mess. *If* the kernel and memory pressure actually cleaned these entries properly we would not have any issue, but the fact is this does not work. No amount of cache_pressure removes any significant amount of the reclaimable slab space. If the cache pressure doesn't do it, that explains why an application requiring extra memory doesn't either.

As an example we took a similar node, used the above referenced /proc/sys/vm/drop_caches to purge the dentry_cache and out of 100~gb of slab space we have regained 80+GB of that. Now the dentry_cache is slowly creeping back up and the number of entries is growing 10-20 million per day, but after 5 days so far I still have 80GB of free memory for customer applications and only 40~gb of buffers/cache , most of which still seems to be dentry_cache related. Also, after dropping caches & freeing memory up the total consumed swap does begin to finally drop instead of continually grow. This particular node which had 100gb in the dentry_cache also had 20gb worth of swap. Yet now it operates beautifully with 80GB of free memory.

This problem seems to be big enough that, while unverified as of yet, RedHat has even patched their kernel to include limits for at least negative_dentry - https://access.redhat.com/solutions/4982351

There is also a separate suggested patch here that has had much discussion as well on the issue - https://lwn.net/Articles/814535/

I would appreciate your review and insight on this, I am not a kernel developer, much less a programmer but the key things I take away from this are as follows:


1) Systems are reporting a substantial amount of reclaimable memory that cannot be reclaimed

2) dentry_cache after 100~ days uptime and 30Mil inodes grows to an insane amount used numbering in the hundreds of millions of entries and consuming 100+GB of system memory. We have some nodes with close to 1 billion entries.

3) vfs_cache_pressure has no impact in recovering this space

4) Real world active applications are somehow given a lower priority than dentry_cache / Recoverable Slab space and are forced into swap instead of dentry_cache / slab clearing.

5) ploop compact seems to dramatically increase the unrecoverable dentry_cache and may be one of the core applications adding to the bloat, but many applications appear to generate dentry_bloat.


We are attempting to dig further into this using systemtap and information provided here - https://access.redhat.com/articles/2850581 however the vz kernels do not play well with this base parser and are preventing us from getting accurate details at this time. The temporary fix is to purge all of the inode/dentry cache but this is a band-aid to address and problem that shouldn't exist, but all current implementations that are supposed to automatically handle this, just quite frankly, don't work.


Thanks for your time and I look forward to your reply.



On 7/23/20 12:37 PM, Konstantin Khorenko wrote:
On 07/23/2020 06:34 PM, CoolCold wrote:
Hello!

1st - great work guys! Dealign with LXC and even LXD makes me miss my old good OpenVZ box because of tech excellence! Keep going! 2nd - my 2 cents for content - I"m not a native speaker, but still suggest some small fixes.

1. Thank you very much for the feedback!
And you are very welcome back to use OpenVZ instead of LXC again. :)

2. And many thanks for content corrections!
i've just created a wiki page for tcache - decided this info should be saved somewhere publicly available.
i've also added a section how to enabled/disable tcache for Containers.

And you are very welcome to edit the wiki page as well. :)

https://wiki.openvz.org/Tcache

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On Thu, Jul 23, 2020 at 9:52 PM Konstantin Khorenko <[email protected] <mailto:[email protected]>> wrote:

    On 07/22/2020 03:04 PM, Daniel Pearson wrote:

    >> b) you can disable tcache for this Container
    >> memcg::memory.disable_cleancache
    >>       (raise your hand if you wish me to explain what tcache is)
    >
    > I'm all for additional information as it can help to form proper
    > opinions if you don't mind providing it.

    Hope after reading it you'll catch yourself on an idea that now
    you are aware of one more
    small feature which makes VZ is really cool and that there are a
    lot of things which
    just work somewhere in the background simply (and silently)
    making it possible for you
    to utilize the hardware at maximum. :)

    Tcache
    ======

    Brief tech explanation:
    =======================
    Transcendent file cache (tcache) is a driver for cleancache
    https://www.kernel.org/doc/html/v4.18/vm/cleancache.html ,
    which stores reclaimed pages in memory unmodified. Its purpose it to
    adopt pages evicted from a memory cgroup on _local_ pressure
    (inside a Container),
    so that they can be fetched back later without costly disk accesses.

    Detailed explanation:
    =====================
    Tcache is intended increase the overall Hardware Node performance
    only

Intented "to"

    on undercommitted Nodes, i.e. sum of all Containers memory limits
    on the Node

i.e. "where total sum of all Containers memory limit values placed on the Node"

    is less than Hardware Node RAM size.

    Imagine a situation: you have a Node with 1Tb of RAM,
    you run 500 Containers on it limited by 1Gb of memory each (no
    swap for simplicity).
    Let's consider Container to be more or less identical, similar
    load, similar activity inside.
    => normally those Containers must use 500Gb of physical RAM at
    max, right,
    and 500Gb will be just free on the Node.

    You think it's simple situation - ok, the node is underloaded,
    let's put more Containers there,
    but that's not always true - it depends on what is the bottleneck
    on the Node,
    which depends on real workload of Containers running on the Node.
    But most often in real life - the disk becomes the bottleneck
    first, not the RAM, not the CPU.

    Example: let's assume all those Containers run, say, cPanel,
    which by default collect some stats
    every, say, 15 minutes - the stat collection process is run via
    crontab.

    (Side note: randomizing times of crontab jobs - is a good idea,
    but who usually does this
    for Containers? We did it for application templates we shipped in
    Virtuozzo, but lot of
    software is just installed and configured inside Containers, we
    cannot do this. And often
    Hosting Providers are not allowed to touch data in Containers -
    so most often cron jobs are
    not randomized.)

    Ok, it does not matter how, but let's assume we get such a
    workload - every, say, 15 minutes
    (it's important that data access it quite rare), each Container
    accesses many small files,
    let it be just 100 small files to gather stats and save it somewhere.
    In 500 Containers. Simultaneously.
    In parallel with other regular i/o workload.
    On HDDs.

    It's nightmare for disk subsystem, you know,  if an HDD provides
    100 IOPS,
    it will take 50000/100/60 = 8.(3) minutes(!) to handle.
    OK, there could be RAID, let it is able to handle 300 IOPS, it
    results in
    2.(7) minutes, and we forgot about other regular i/o,
    so it means every 15 minutes, the Node became almost unresponsive
    for several minutes
    until it handles all that random i/o generated by stats collection.

    You can ask - but why _every_ 15 minutes? You've read once a file
    and it resides in the
    Container pagecache!
    That's true, but here comes _15 minutes_ period. The larger
    period - the worse.
    If a Container is active enough, it just reads more and more
    files - website data,
    pictures, video clips, files of a fileserver, don't know.
    The thing is in 15 minutes it's quite possible a Container reads
    more than its RAM limit
    (remember - only 1Gb in our case!), and thus all old pagecache is
    dropped, substituted
    with the fresh one.
    And thus in 15 minutes it's quite possible you'll have to read
    all those 100 files in each
    Container from disk.

    And here comes tcache to save us: let's don't completely drop
    pagecache which is
    reclaimed from a Container (on local(!) reclaim), but save this
    pagecache in
    a special cache (tcache) on the Host in case there is free RAM on
    the Host.

    And in 15 minutes when all Containers start to access lot of
    small files again -
    those files data will be get back into Container pagecache
    without reading from
    physical disk - viola, we saves IOPS, no Node stuck anymore.

    Q: can a Container be so active (i.e. read so much from disk)
    that this "useful"
    pagecache is dropped even from tcache.

missing question mark - ?

    A: Yes. But tcache extends the "safe" period.

    Q: mainstream? LXC/Proxmox?
    A: No, it's Virtuozzo/OpenVZ specific.
        "cleancache" - the base for tcache it in mainstream, it's
    used for Xen.
        But we (VZ) wrote a driver for it and use it for Containers
    as well.

    Q: i use SSD, not HDD, does tcache help me?
    A: SSD can provide much more IOPS, thus the Node's performance
    increase caused by tcache
        is less, but still reading from RAM (tcache is in RAM) is
    faster than reading from SSD.

is less "significant"



    >> c) you can limit the max amount of memory which can be used for
    >> pagecache for this Container
    >>       memcg::memory.cache.limit_in_bytes
    >
    > This seems viable to test as well. Currently it seems to be
    utilizing a
    > high number 'unlimited' default. I assume the only way to set
    this is to
    > directly interact with the memory cgroup and not via a standard ve
    > config value?

    Yes, you are right.
    We use this setting for some internal system cgroups running
    processes
    which are known to generate a lot of pagecache which won't be
    used later for sure.

     From my perspective it's not fair to apply such a setting to a
    Container
    globally - well, CT owner pay for an amount of RAM, it should be
    able to use
    this RAM for whatever he wants to - even for pagecache,
    so limiting the pagecache for a Container is not a tweak we is
    advised to be used
    against a Container => no standard config parameter.

    Note: disabling tcache for a Container is completely fair,
    you disable just an optimization for the whole Hardware Node
    performance,
    but all RAM configured for a Container - is still available to
    the Container.
    (but also no official config value for that - most often it
    helps, not hurts)


    > I assume regardless if we utilized vSwap or not, we would
    likely still
    > experience these additional swapping issues, presumably from
    pagecache
    > applications, or would the usage of vSwap intercept some of
    these items
    > thus preventing them from being swapped to disk?

    vSwap - is the optimization for swapping process _local to a
    Container_,
    it can prevent some Container anonymous pages to be written to
    the physical swap,
    if _local_ Container reclaim decides to swapout something.

    At the moment you experience swapping on the Node level.
    Even if some Container's processes are put to the physical swap,
    it's a decision of the global reclaim mechanism,
    so it's completely unrelated to vSwap =>
    even if you assign some swappages to Containers and thus enable
    vSwap for those Containers,
    i should not influence anyhow on global Node level memory
    pressure and
    will not result in any difference in the swapping rate into
    physical swap.

    Hope that helps.

    --
    Best regards,

    Konstantin Khorenko,
    Virtuozzo Linux Kernel Team
    _______________________________________________
    Users mailing list
    [email protected] <mailto:[email protected]>
    https://lists.openvz.org/mailman/listinfo/users



--
Best regards,
[COOLCOLD-RIPN]


_______________________________________________
Users mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/users


_______________________________________________
Users mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/users


--
Sincerely
Daniel C Pearson
COO KnownHost, LLC
https://www.knownhost.com

_______________________________________________
Users mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/users

Reply via email to