Re: [Users] Some details of vSwap implementation in Virtuozzo 7: "tcache" in details

Konstantin Khorenko Thu, 23 Jul 2020 10:41:18 -0700

On 07/23/2020 06:34 PM, CoolCold wrote:

Hello!


1st - great work guys! Dealign with LXC and even LXD makes me miss my old good 
OpenVZ box because of tech excellence! Keep going!
2nd - my 2 cents for content - I"m not a native speaker, but still suggest some 
small fixes.


1. Thank you very much for the feedback!
And you are very welcome back to use OpenVZ instead of LXC again. :)

2. And many thanks for content corrections!
i've just created a wiki page for tcache - decided this info should be saved 
somewhere publicly available.
i've also added a section how to enabled/disable tcache for Containers.

And you are very welcome to edit the wiki page as well. :)

https://wiki.openvz.org/Tcache

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On Thu, Jul 23, 2020 at 9:52 PM Konstantin Khorenko <[email protected] 
<mailto:[email protected]>> wrote:

    On 07/22/2020 03:04 PM, Daniel Pearson wrote:

    >> b) you can disable tcache for this Container
    >> memcg::memory.disable_cleancache
    >>       (raise your hand if you wish me to explain what tcache is)
    >
    > I'm all for additional information as it can help to form proper
    > opinions if you don't mind providing it.

    Hope after reading it you'll catch yourself on an idea that now you are 
aware of one more
    small feature which makes VZ is really cool and that there are a lot of 
things which
    just work somewhere in the background simply (and silently) making it 
possible for you
    to utilize the hardware at maximum. :)

    Tcache
    ======

    Brief tech explanation:
    =======================
    Transcendent file cache (tcache) is a driver for cleancache
    https://www.kernel.org/doc/html/v4.18/vm/cleancache.html ,
    which stores reclaimed pages in memory unmodified. Its purpose it to
    adopt pages evicted from a memory cgroup on _local_ pressure (inside a 
Container),
    so that they can be fetched back later without costly disk accesses.

    Detailed explanation:
    =====================
    Tcache is intended increase the overall Hardware Node performance only

Intented "to"

    on undercommitted Nodes, i.e. sum of all Containers memory limits on the 
Node

i.e. "where total sum of all Containers memory limit values placed on the Node"

    is less than Hardware Node RAM size.

    Imagine a situation: you have a Node with 1Tb of RAM,
    you run 500 Containers on it limited by 1Gb of memory each (no swap for 
simplicity).
    Let's consider Container to be more or less identical, similar load, 
similar activity inside.
    => normally those Containers must use 500Gb of physical RAM at max, right,
    and 500Gb will be just free on the Node.

    You think it's simple situation - ok, the node is underloaded, let's put 
more Containers there,
    but that's not always true - it depends on what is the bottleneck on the 
Node,
    which depends on real workload of Containers running on the Node.
    But most often in real life - the disk becomes the bottleneck first, not 
the RAM, not the CPU.

    Example: let's assume all those Containers run, say, cPanel, which by 
default collect some stats
    every, say, 15 minutes - the stat collection process is run via crontab.

    (Side note: randomizing times of crontab jobs - is a good idea, but who 
usually does this
    for Containers? We did it for application templates we shipped in 
Virtuozzo, but lot of
    software is just installed and configured inside Containers, we cannot do 
this. And often
    Hosting Providers are not allowed to touch data in Containers - so most 
often cron jobs are
    not randomized.)

    Ok, it does not matter how, but let's assume we get such a workload - 
every, say, 15 minutes
    (it's important that data access it quite rare), each Container accesses 
many small files,
    let it be just 100 small files to gather stats and save it somewhere.
    In 500 Containers. Simultaneously.
    In parallel with other regular i/o workload.
    On HDDs.

    It's nightmare for disk subsystem, you know,  if an HDD provides 100 IOPS,
    it will take 50000/100/60 = 8.(3) minutes(!) to handle.
    OK, there could be RAID, let it is able to handle 300 IOPS, it results in
    2.(7) minutes, and we forgot about other regular i/o,
    so it means every 15 minutes, the Node became almost unresponsive for 
several minutes
    until it handles all that random i/o generated by stats collection.

    You can ask - but why _every_ 15 minutes? You've read once a file and it 
resides in the
    Container pagecache!
    That's true, but here comes _15 minutes_ period. The larger period - the 
worse.
    If a Container is active enough, it just reads more and more files - 
website data,
    pictures, video clips, files of a fileserver, don't know.
    The thing is in 15 minutes it's quite possible a Container reads more than 
its RAM limit
    (remember - only 1Gb in our case!), and thus all old pagecache is dropped, 
substituted
    with the fresh one.
    And thus in 15 minutes it's quite possible you'll have to read all those 
100 files in each
    Container from disk.

    And here comes tcache to save us: let's don't completely drop pagecache 
which is
    reclaimed from a Container (on local(!) reclaim), but save this pagecache in
    a special cache (tcache) on the Host in case there is free RAM on the Host.

    And in 15 minutes when all Containers start to access lot of small files 
again -
    those files data will be get back into Container pagecache without reading 
from
    physical disk - viola, we saves IOPS, no Node stuck anymore.

    Q: can a Container be so active (i.e. read so much from disk) that this 
"useful"
    pagecache is dropped even from tcache.

missing question mark - ?

    A: Yes. But tcache extends the "safe" period.

    Q: mainstream? LXC/Proxmox?
    A: No, it's Virtuozzo/OpenVZ specific.
        "cleancache" - the base for tcache it in mainstream, it's used for Xen.
        But we (VZ) wrote a driver for it and use it for Containers as well.

    Q: i use SSD, not HDD, does tcache help me?
    A: SSD can provide much more IOPS, thus the Node's performance increase 
caused by tcache
        is less, but still reading from RAM (tcache is in RAM) is faster than 
reading from SSD.

is less "significant"



    >> c) you can limit the max amount of memory which can be used for
    >> pagecache for this Container
    >>       memcg::memory.cache.limit_in_bytes
    >
    > This seems viable to test as well. Currently it seems to be utilizing a
    > high number 'unlimited' default. I assume the only way to set this is to
    > directly interact with the memory cgroup and not via a standard ve
    > config value?

    Yes, you are right.
    We use this setting for some internal system cgroups running processes
    which are known to generate a lot of pagecache which won't be used later 
for sure.

     From my perspective it's not fair to apply such a setting to a Container
    globally - well, CT owner pay for an amount of RAM, it should be able to use
    this RAM for whatever he wants to - even for pagecache,
    so limiting the pagecache for a Container is not a tweak we is advised to 
be used
    against a Container => no standard config parameter.

    Note: disabling tcache for a Container is completely fair,
    you disable just an optimization for the whole Hardware Node performance,
    but all RAM configured for a Container - is still available to the 
Container.
    (but also no official config value for that - most often it helps, not 
hurts)


    > I assume regardless if we utilized vSwap or not, we would likely still
    > experience these additional swapping issues, presumably from pagecache
    > applications, or would the usage of vSwap intercept some of these items
    > thus preventing them from being swapped to disk?

    vSwap - is the optimization for swapping process _local to a Container_,
    it can prevent some Container anonymous pages to be written to the physical 
swap,
    if _local_ Container reclaim decides to swapout something.

    At the moment you experience swapping on the Node level.
    Even if some Container's processes are put to the physical swap,
    it's a decision of the global reclaim mechanism,
    so it's completely unrelated to vSwap =>
    even if you assign some swappages to Containers and thus enable vSwap for 
those Containers,
    i should not influence anyhow on global Node level memory pressure and
    will not result in any difference in the swapping rate into physical swap.

    Hope that helps.

    --
    Best regards,

    Konstantin Khorenko,
    Virtuozzo Linux Kernel Team
    _______________________________________________
    Users mailing list
    [email protected] <mailto:[email protected]>
    https://lists.openvz.org/mailman/listinfo/users



--
Best regards,
[COOLCOLD-RIPN]


_______________________________________________
Users mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/users

_______________________________________________
Users mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/users

Re: [Users] Some details of vSwap implementation in Virtuozzo 7: "tcache" in details

Reply via email to