Re: [Users] Some details of vSwap implementation in Virtuozzo 7: "tcache" in details

Daniel Pearson Thu, 30 Jul 2020 05:56:17 -0700

Konstantin,

Thanks for your previous information. We've set aside a good bit of timeto dig into this issue further and while we are not kernel developers Ibelieve we are fairly close to the path for determining the root causeof this. We likely need your assistance to fully isolate the majority ofthese issues.

So, here is an example of one of our nodes and the problem child that wehave found:


Excerpt from /proc/meminfo:
MemTotal:       196485548 kB
MemFree:         4268192 kB
MemAvailable:   122103908 kB
Slab:           126500392 kB
SReclaimable:   113541672 kB

So, reviewing this information here, we have 192gb of ram in thissystem, we have supposedly 122GB of "MemAvaliable" and of that 113GB ofstored in reclaimable slab space.

So based on this alone, we should be able to rely on memory pressure toreclaim that 113GB worth of memory and very little if any memory goes toswap. However, this does not happen, as is proof from dozens of nodesdigging into swap. So let's dig further.

Looking at the Slab space in detail with slabtop we are provided withthe following:


Active / Total Objects (% used)    : 752665180 / 754941816 (99.7%)
 Active / Total Slabs (% used)      : 31612963 / 31612963 (100.0%)
 Active / Total Caches (% used)     : 152 / 181 (84.0%)
 Active / Total Size (% used)       : 123299873.38K / 123780755.78K (99.6%)
 Minimum / Average / Maximum Object : 0.01K / 0.16K / 15.25K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
570457209 569865854  99%    0.19K 27164629       21 108658516K dentry
166341504 166231690  99%    0.06K 2599086       64  10396344K kmalloc-64
3776724 3380646  89%    0.19K 179844       21    719376K kmalloc-192
3325284 3220909  96%    1.07K 1108428        3   4433712K ext4_inode_cache

So this is where things start to get interesting.

570457209 569865854  99%    0.19K 27164629       21 108658516K dentry

This claims we have 570 Million dentry cache entries in the kernel, anda whopping 569 Million of those are active?

But how is this possible, when taking a total inode count of allcustomers & containers we only have 34970144 total inodes. So somehowthe kernel is caching loads non-existent dentry entries.

We can somewhat print this value by using the following file, howeverthis was recently introduced in the 3.10 branch and does not appear tobe entirely accurate.


cat /proc/sys/fs/dentry-state
569791041       569357616       45      0       47217194        0

So this claims 45 million negative entires out of 569 million, so itreports some, but not a huge percentage of them. Other servers thisnumber actually reports drastically higher than the total dentry_cacheso we will ignore the 5th column for now.

In testing we have replicated this behavior to a degree. You can createa snapshot of a VPS, mount it to a new location, stat all of the filesand commit millions of additional entries to the dentry_cache. Ploop inthis instance, does properly clean these entries up when the image isun-mounted so this part is good.

However, we kept testing common things we do. One common thing that werun from time to time is a live ploop compact, since ploop images cancontinually grow based on file usage within them and they must bere-sized from time to time. Here's an example:


Node has a total user space of 31205700 inodes.

Dentry cache starts out at the following:

cat /proc/sys/fs/dentry-state
56585091 56124555 45 0 54216828 0

We then ran a pcompact across all containers to look for orphaned space.


pcompact completed:

cat /proc/sys/fs/dentry-state
98663884 96136064 45 0 63732825 0

We have gained 42 million dentry entries from the pcompact process.

Overall, I believe in the beginning of our conversation you eluded tothis problem without going into detail on it when you mentioned thefollowing:"Most often it's some kind of homebrew backup processes - just becauseit's their job to readfiles from disk while performing a backup - and thus they generate a lotof pagecache."

Based on my research this is an over-simplification of the core issue.Once you begin to research dentry_cache and the issues surrounding ityou begin to notice a pattern. Linux servers that deal with largesamounts of inodes and very frequent access schedules (i.e. lots of busyvirtual machines running as web servers, temporary files, cache filesetc) the problem with dentry_cache presents its self.

Now, in the dcache.c code base, you do find the following variable"vfs_cache_pressure" which can be adjusted to reclaim dentry_cache

Great, we have a solution.. except it does not work as confirmed here:https://access.redhat.com/solutions/55818


An excerpt from this solution specifically states:

"We can see high dentry_cache usage on the systems those who are runningsome programs which are opening and closing huge number of files.Sometimes high dentry_cache leads the system to run out of memory insuch situations performance gets severly impacted as the system willstart using swap space."


As well as:

Diagnostic Steps
Testing attempted using vm.vfs_cache_pressure values >100 has no effect.

Testing using the workaround of echo 2 > /proc/sys/vm/drop_cachesimmediately reclaims almost all dentry_cache memory.

use slabtop to monitor how much dentry_cache is in use:

So, let us continue down the rabbit hole of dentry_cache and we comeacross this discussion w/ Linus about the topic:https://patchwork.kernel.org/patch/9884869/#20866043

Long and short Linus suggests the responsibility to clean / preventdentry_cache buildup of negative/bad entries falls mostly on the userspace, as the example of rm was given.

This problem uniquely impacts para virtualized systems like VZ/Ploop /docker / LXC etc that share a common kernel and file system layer.Because this cache is intentionally not separated per cgroup you end upwith a massive mess. *If* the kernel and memory pressure actuallycleaned these entries properly we would not have any issue, but the factis this does not work. No amount of cache_pressure removes anysignificant amount of the reclaimable slab space. If the cache pressuredoesn't do it, that explains why an application requiring extra memorydoesn't either.

As an example we took a similar node, used the above referenced/proc/sys/vm/drop_caches to purge the dentry_cache and out of 100~gb ofslab space we have regained 80+GB of that. Now the dentry_cache isslowly creeping back up and the number of entries is growing 10-20million per day, but after 5 days so far I still have 80GB of freememory for customer applications and only 40~gb of buffers/cache , mostof which still seems to be dentry_cache related.Also, after dropping caches & freeing memory up the total consumed swapdoes begin to finally drop instead of continually grow. This particularnode which had 100gb in the dentry_cache also had 20gb worth of swap.Yet now it operates beautifully with 80GB of free memory.

This problem seems to be big enough that, while unverified as of yet,RedHat has even patched their kernel to include limits for at leastnegative_dentry - https://access.redhat.com/solutions/4982351

There is also a separate suggested patch here that has had muchdiscussion as well on the issue - https://lwn.net/Articles/814535/

I would appreciate your review and insight on this, I am not a kerneldeveloper, much less a programmer but the key things I take away fromthis are as follows:

1) Systems are reporting a substantial amount of reclaimable memory thatcannot be reclaimed

2) dentry_cache after 100~ days uptime and 30Mil inodes grows to aninsane amount used numbering in the hundreds of millions of entries andconsuming 100+GB of system memory. We have some nodes with close to 1billion entries.


3) vfs_cache_pressure has no impact in recovering this space

4) Real world active applications are somehow given a lower prioritythan dentry_cache / Recoverable Slab space and are forced into swapinstead of dentry_cache / slab clearing.

5) ploop compact seems to dramatically increase the unrecoverabledentry_cache and may be one of the core applications adding to thebloat, but many applications appear to generate dentry_bloat.

We are attempting to dig further into this using systemtap andinformation provided here - https://access.redhat.com/articles/2850581however the vz kernels do not play well with this base parser and arepreventing us from getting accurate details at this time. The temporaryfix is to purge all of the inode/dentry cache but this is a band-aid toaddress and problem that shouldn't exist, but all currentimplementations that are supposed to automatically handle this, justquite frankly, don't work.



Thanks for your time and I look forward to your reply.



On 7/23/20 12:37 PM, Konstantin Khorenko wrote:

On 07/23/2020 06:34 PM, CoolCold wrote:

Hello!
1st - great work guys! Dealign with LXC and even LXD makes me miss myold good OpenVZ box because of tech excellence! Keep going!2nd - my 2 cents for content - I"m not a native speaker, but stillsuggest some small fixes.


1. Thank you very much for the feedback!
And you are very welcome back to use OpenVZ instead of LXC again. :)

2. And many thanks for content corrections!

i've just created a wiki page for tcache - decided this info should besaved somewhere publicly available.

i've also added a section how to enabled/disable tcache for Containers.

And you are very welcome to edit the wiki page as well. :)

https://wiki.openvz.org/Tcache

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On Thu, Jul 23, 2020 at 9:52 PM Konstantin Khorenko<[email protected] <mailto:[email protected]>> wrote:


    On 07/22/2020 03:04 PM, Daniel Pearson wrote:

    >> b) you can disable tcache for this Container
    >> memcg::memory.disable_cleancache
    >>       (raise your hand if you wish me to explain what tcache is)
    >
    > I'm all for additional information as it can help to form proper
    > opinions if you don't mind providing it.

    Hope after reading it you'll catch yourself on an idea that now
    you are aware of one more
    small feature which makes VZ is really cool and that there are a
    lot of things which
    just work somewhere in the background simply (and silently)
    making it possible for you
    to utilize the hardware at maximum. :)

    Tcache
    ======

    Brief tech explanation:
    =======================
    Transcendent file cache (tcache) is a driver for cleancache
    https://www.kernel.org/doc/html/v4.18/vm/cleancache.html ,
    which stores reclaimed pages in memory unmodified. Its purpose it to
    adopt pages evicted from a memory cgroup on _local_ pressure
    (inside a Container),
    so that they can be fetched back later without costly disk accesses.

    Detailed explanation:
    =====================
    Tcache is intended increase the overall Hardware Node performance
    only

Intented "to"

    on undercommitted Nodes, i.e. sum of all Containers memory limits
    on the Node

i.e. "where total sum of all Containers memory limit values placed onthe Node"


    is less than Hardware Node RAM size.

    Imagine a situation: you have a Node with 1Tb of RAM,
    you run 500 Containers on it limited by 1Gb of memory each (no
    swap for simplicity).
    Let's consider Container to be more or less identical, similar
    load, similar activity inside.
    => normally those Containers must use 500Gb of physical RAM at
    max, right,
    and 500Gb will be just free on the Node.

    You think it's simple situation - ok, the node is underloaded,
    let's put more Containers there,
    but that's not always true - it depends on what is the bottleneck
    on the Node,
    which depends on real workload of Containers running on the Node.
    But most often in real life - the disk becomes the bottleneck
    first, not the RAM, not the CPU.

    Example: let's assume all those Containers run, say, cPanel,
    which by default collect some stats
    every, say, 15 minutes - the stat collection process is run via
    crontab.

    (Side note: randomizing times of crontab jobs - is a good idea,
    but who usually does this
    for Containers? We did it for application templates we shipped in
    Virtuozzo, but lot of
    software is just installed and configured inside Containers, we
    cannot do this. And often
    Hosting Providers are not allowed to touch data in Containers -
    so most often cron jobs are
    not randomized.)

    Ok, it does not matter how, but let's assume we get such a
    workload - every, say, 15 minutes
    (it's important that data access it quite rare), each Container
    accesses many small files,
    let it be just 100 small files to gather stats and save it somewhere.
    In 500 Containers. Simultaneously.
    In parallel with other regular i/o workload.
    On HDDs.

    It's nightmare for disk subsystem, you know,  if an HDD provides
    100 IOPS,
    it will take 50000/100/60 = 8.(3) minutes(!) to handle.
    OK, there could be RAID, let it is able to handle 300 IOPS, it
    results in
    2.(7) minutes, and we forgot about other regular i/o,
    so it means every 15 minutes, the Node became almost unresponsive
    for several minutes
    until it handles all that random i/o generated by stats collection.

    You can ask - but why _every_ 15 minutes? You've read once a file
    and it resides in the
    Container pagecache!
    That's true, but here comes _15 minutes_ period. The larger
    period - the worse.
    If a Container is active enough, it just reads more and more
    files - website data,
    pictures, video clips, files of a fileserver, don't know.
    The thing is in 15 minutes it's quite possible a Container reads
    more than its RAM limit
    (remember - only 1Gb in our case!), and thus all old pagecache is
    dropped, substituted
    with the fresh one.
    And thus in 15 minutes it's quite possible you'll have to read
    all those 100 files in each
    Container from disk.

    And here comes tcache to save us: let's don't completely drop
    pagecache which is
    reclaimed from a Container (on local(!) reclaim), but save this
    pagecache in
    a special cache (tcache) on the Host in case there is free RAM on
    the Host.

    And in 15 minutes when all Containers start to access lot of
    small files again -
    those files data will be get back into Container pagecache
    without reading from
    physical disk - viola, we saves IOPS, no Node stuck anymore.

    Q: can a Container be so active (i.e. read so much from disk)
    that this "useful"
    pagecache is dropped even from tcache.

missing question mark - ?

    A: Yes. But tcache extends the "safe" period.

    Q: mainstream? LXC/Proxmox?
    A: No, it's Virtuozzo/OpenVZ specific.
        "cleancache" - the base for tcache it in mainstream, it's
    used for Xen.
        But we (VZ) wrote a driver for it and use it for Containers
    as well.

    Q: i use SSD, not HDD, does tcache help me?
    A: SSD can provide much more IOPS, thus the Node's performance
    increase caused by tcache
        is less, but still reading from RAM (tcache is in RAM) is
    faster than reading from SSD.

is less "significant"



    >> c) you can limit the max amount of memory which can be used for
    >> pagecache for this Container
    >>       memcg::memory.cache.limit_in_bytes
    >
    > This seems viable to test as well. Currently it seems to be
    utilizing a
    > high number 'unlimited' default. I assume the only way to set
    this is to
    > directly interact with the memory cgroup and not via a standard ve
    > config value?

    Yes, you are right.
    We use this setting for some internal system cgroups running
    processes
    which are known to generate a lot of pagecache which won't be
    used later for sure.

     From my perspective it's not fair to apply such a setting to a
    Container
    globally - well, CT owner pay for an amount of RAM, it should be
    able to use
    this RAM for whatever he wants to - even for pagecache,
    so limiting the pagecache for a Container is not a tweak we is
    advised to be used
    against a Container => no standard config parameter.

    Note: disabling tcache for a Container is completely fair,
    you disable just an optimization for the whole Hardware Node
    performance,
    but all RAM configured for a Container - is still available to
    the Container.
    (but also no official config value for that - most often it
    helps, not hurts)


    > I assume regardless if we utilized vSwap or not, we would
    likely still
    > experience these additional swapping issues, presumably from
    pagecache
    > applications, or would the usage of vSwap intercept some of
    these items
    > thus preventing them from being swapped to disk?

    vSwap - is the optimization for swapping process _local to a
    Container_,
    it can prevent some Container anonymous pages to be written to
    the physical swap,
    if _local_ Container reclaim decides to swapout something.

    At the moment you experience swapping on the Node level.
    Even if some Container's processes are put to the physical swap,
    it's a decision of the global reclaim mechanism,
    so it's completely unrelated to vSwap =>
    even if you assign some swappages to Containers and thus enable
    vSwap for those Containers,
    i should not influence anyhow on global Node level memory
    pressure and
    will not result in any difference in the swapping rate into
    physical swap.

    Hope that helps.

    --
    Best regards,

    Konstantin Khorenko,
    Virtuozzo Linux Kernel Team
    _______________________________________________
    Users mailing list
    [email protected] <mailto:[email protected]>
    https://lists.openvz.org/mailman/listinfo/users



--
Best regards,
[COOLCOLD-RIPN]


_______________________________________________
Users mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/users



_______________________________________________
Users mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/users



--
Sincerely
Daniel C Pearson
COO KnownHost, LLC
https://www.knownhost.com

_______________________________________________
Users mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/users

Re: [Users] Some details of vSwap implementation in Virtuozzo 7: "tcache" in details

Reply via email to