On Tue, Mar 23, 2021 at 07:52:22AM -0700, Eric Ernst wrote: > On Tue, Mar 23, 2021 at 6:47 AM Vivek Goyal <[email protected]> wrote: > > > On Tue, Mar 23, 2021 at 12:55:26PM +0100, Sergio Lopez wrote: > > > On Mon, Mar 22, 2021 at 12:47:04PM -0400, Vivek Goyal wrote: > > > > On Mon, Mar 22, 2021 at 05:09:32PM +0100, Miklos Szeredi wrote: > > > > > On Mon, Mar 22, 2021 at 6:52 AM Eric Ernst <[email protected]> > > wrote: > > > > > > > > > > > > Hey ya’ll, > > > > > > > > > > > > One challenge I’ve been looking at is how to setup an appropriate > > memory cgroup limit for workloads that are leveraging virtiofs (ie, running > > pods with Kata Containers). I noticed that memory usage of the daemon > > itself can grow considerably depending on the workload; though much more > > than I’d expect. > > > > > > > > > > > > I’m running workload that simply runs a build on kernel sources > > with -j3. In doing this, the source of the linux kernel are shared via > > virtiofs (no DAX), so as the build goes on, there are a lot of files > > opened, closed, as well as created. The rss memory of virtiofsd grows into > > several hundreds of MBs. > > > > > > > > > > > > When taking a look, I’m suspecting that virtiofsd is carrying out > > the opens, but never actually closing fds. In the guest, I’m seeing fd’s on > > the order of 10-40 for all the container processes as it runs, whereas I > > see the number of fds for virtiofsd continually increasing, reaching over > > 80,000 fds. I’m guessing this isn’t expected? > > > > > > > > > > The reason could be that guest is keeping a ref on the inodes > > > > > (dcache->dentry->inode) and current implementation of server keeps an > > > > > O_PATH fd open for each inode referenced by the client. > > > > > > > > > > One way to avoid this is to use the "cache=none" option, which forces > > > > > the client to drop dentries immediately from the cache if not in use. > > > > > This is not desirable if cache is actually in use. > > > > > > > > > > The memory use of the server should still be limited by the memory > > use > > > > > of the guest: if there's memory pressure in the guest kernel, then > > it > > > > > will clean out caches, which results in the memory use decreasing in > > > > > the server as well. If the server memory use looks unbounded, that > > > > > might be indicative of too much memory used for dcache in the guest > > > > > (cat /proc/slabinfo | grep ^dentry). Can you verify? > > > > > > > > Hi Miklos, > > > > > > > > Apart from above, we identified one more issue on IRC. I asked Eric > > > > to drop caches manually in guest. (echo 3 > /proc/sys/vm/drop_caches) > > > > and while it reduced the fds open it did not seem to free up > > significant > > > > amount of memory. > > > > > > > > So question remains where is that memory. One possibility is that we > > > > have memory allocated for mapping arrays (inode and fd). These arrays > > > > only grow and never shrink. So they can lock down some memory. > > > > > > > > But still, lot of lo_inode memory should have been freed when > > > > echo 3 > /proc/sys/vm/drop_caches was done. Why all that did not > > > > show up in virtiofsd RSS usage, that's kind of little confusing. > > > > > > Are you including "RssShmem" in "RSS usage"? If so, that could be > > > misleading. When virtiofsd[-rs] touches pages that reside in the > > > memory mapping that's shared with QEMU, those pages are accounted > > > in the virtiofsd[-rs] process's RssShmem too. > > > > > > In other words, the RSS value of the virtiofsd[-rs] process may be > > > overinflated because it includes pages that are actually shared > > > (there's no a second copy of them) with the QEMU process. > > > > > > This can be observed using a tool like "smem". Here's an example > > > > > > - This virtiofsd-rs process appears to have a RSS of ~633 MiB > > > > > > root 13879 46.1 7.9 8467492 649132 pts/1 Sl+ 11:33 0:52 > > ./target/debug/virtiofsd-rs > > > root 13947 69.3 13.4 5638580 1093876 pts/0 Sl+ 11:33 1:14 > > qemu-system-x86_64 > > > > > > - In /proc/13879/status we can observe most of that memory is > > > actually RssShmem: > > > > > > RssAnon: 9624 kB > > > RssFile: 5136 kB > > > RssShmem: 634372 kB > > > > Hi Sergio, > > > > Thanks for this observation about RssShmem. I also ran virtiofsd and > > observed memory usage just now and it indeed looks like that only > > RssShmem usage is very high. > > > > RssAnon: 4884 kB > > RssFile: 1900 kB > > RssShmem: 1050244 kB > > > > And as you point out that this memory is being shared with QEMU. So > > looks like from cgroup point of view, we should put virtiofsd and > > qemu in same cgroup and have a combined memory limit so that this > > shared memory is accounting looks proper. > > > > Eric, does this sound reasonable. > > > > Sergio, Vivek -- > > Today QEMU/virtiofsd do live within the same memory cgroup, and are bound > by that same overhead I need to introduce. Good to know regarding the > sharing (this restores some sanity to my observations, thank you!), but the > real crux of the problem is two items: > 1) the FDs are held long after the application in guest is done with them > because of dentry cache in the guest (when cache=auto for virtiofsd). > 2) virtiofsd/QEMU is holding on to the memory after the fds are released
Which memory you are specifically concerned about. I am assuming you are referring to "RssAnon". I think this number continues to be high even after unmounting of virtiofs because glibc has not released memory back to system. It probably is keeping it around so that future allocations can be done faster. To confirm this, I patched virtiofsd and called malloc_trim(0) in lo_destroy(). So this should force glibc to return heap memory to system (whatever is possible). And that led to reduced RssAnon. After compilation finished: RssAnon: 11936 kB RssFile: 1964 kB RssShmem: 1036232 kB After unmounting virtiofs (with malloc_trim(0)) RssAnon: 3428 kB RssFile: 1968 kB RssShmem: 1037036 kB So RssAnon usage dropped by roughly 75% with a call to malloc_trim(). I think this can be further optimized if our mapping arrays can be shrunk too. But point is that these number look reasonably small to not worry about it. I mean for a kernel compilation, peak RssAnon is 12MB in this run. And that does not sound too bad to me. Thanks Vivek > > --Eric > > > > > > > Thanks > > Vivek > > > > > > > > - In "smem", we can see a similar amount of RSS, but the PSS is > > > roughly half the size because "smem" is splitting it up between > > > virtiofsd-rs and QEMU: > > > > > > [root@localhost ~]# smem -P virtiofsd-rs -P qemu > > > PID User Command Swap USS PSS > > RSS > > > 13879 root ./target/debug/virtiofsd-rs 0 13412 337019 > > 662392 > > > 13947 root qemu-system-x86_64 -enable- 0 434224 760096 > > 1094392 > > > > > > - If we terminate the virtiofsd-rs process, the output of "smem" now > > > shows that QEMU's PSS has grown to account for the PSS that was > > > previously assigned to virtiofsd-rs too, so we can confirm that was > > > memory shared between both processes. > > > > > > PID User Command Swap USS PSS > > RSS > > > 13947 root qemu-system-x86_64 -enable- 0 1082656 1084966 > > 1095692 > > > > > > Just to be 100% sure, I've also run "heaptrack" on a virtiofsd-rs > > > instance, and can confirm that the actual heap usage of the process > > > was around 5-6 MiB. > > > > > > Sergio. > > > > > > _______________________________________________ > > Virtio-fs mailing list > > [email protected] > > https://listman.redhat.com/mailman/listinfo/virtio-fs > > _______________________________________________ Virtio-fs mailing list [email protected] https://listman.redhat.com/mailman/listinfo/virtio-fs
