On Tue, Mar 23, 2021 at 6:47 AM Vivek Goyal <[email protected]> wrote:
> On Tue, Mar 23, 2021 at 12:55:26PM +0100, Sergio Lopez wrote: > > On Mon, Mar 22, 2021 at 12:47:04PM -0400, Vivek Goyal wrote: > > > On Mon, Mar 22, 2021 at 05:09:32PM +0100, Miklos Szeredi wrote: > > > > On Mon, Mar 22, 2021 at 6:52 AM Eric Ernst <[email protected]> > wrote: > > > > > > > > > > Hey ya’ll, > > > > > > > > > > One challenge I’ve been looking at is how to setup an appropriate > memory cgroup limit for workloads that are leveraging virtiofs (ie, running > pods with Kata Containers). I noticed that memory usage of the daemon > itself can grow considerably depending on the workload; though much more > than I’d expect. > > > > > > > > > > I’m running workload that simply runs a build on kernel sources > with -j3. In doing this, the source of the linux kernel are shared via > virtiofs (no DAX), so as the build goes on, there are a lot of files > opened, closed, as well as created. The rss memory of virtiofsd grows into > several hundreds of MBs. > > > > > > > > > > When taking a look, I’m suspecting that virtiofsd is carrying out > the opens, but never actually closing fds. In the guest, I’m seeing fd’s on > the order of 10-40 for all the container processes as it runs, whereas I > see the number of fds for virtiofsd continually increasing, reaching over > 80,000 fds. I’m guessing this isn’t expected? > > > > > > > > The reason could be that guest is keeping a ref on the inodes > > > > (dcache->dentry->inode) and current implementation of server keeps an > > > > O_PATH fd open for each inode referenced by the client. > > > > > > > > One way to avoid this is to use the "cache=none" option, which forces > > > > the client to drop dentries immediately from the cache if not in use. > > > > This is not desirable if cache is actually in use. > > > > > > > > The memory use of the server should still be limited by the memory > use > > > > of the guest: if there's memory pressure in the guest kernel, then > it > > > > will clean out caches, which results in the memory use decreasing in > > > > the server as well. If the server memory use looks unbounded, that > > > > might be indicative of too much memory used for dcache in the guest > > > > (cat /proc/slabinfo | grep ^dentry). Can you verify? > > > > > > Hi Miklos, > > > > > > Apart from above, we identified one more issue on IRC. I asked Eric > > > to drop caches manually in guest. (echo 3 > /proc/sys/vm/drop_caches) > > > and while it reduced the fds open it did not seem to free up > significant > > > amount of memory. > > > > > > So question remains where is that memory. One possibility is that we > > > have memory allocated for mapping arrays (inode and fd). These arrays > > > only grow and never shrink. So they can lock down some memory. > > > > > > But still, lot of lo_inode memory should have been freed when > > > echo 3 > /proc/sys/vm/drop_caches was done. Why all that did not > > > show up in virtiofsd RSS usage, that's kind of little confusing. > > > > Are you including "RssShmem" in "RSS usage"? If so, that could be > > misleading. When virtiofsd[-rs] touches pages that reside in the > > memory mapping that's shared with QEMU, those pages are accounted > > in the virtiofsd[-rs] process's RssShmem too. > > > > In other words, the RSS value of the virtiofsd[-rs] process may be > > overinflated because it includes pages that are actually shared > > (there's no a second copy of them) with the QEMU process. > > > > This can be observed using a tool like "smem". Here's an example > > > > - This virtiofsd-rs process appears to have a RSS of ~633 MiB > > > > root 13879 46.1 7.9 8467492 649132 pts/1 Sl+ 11:33 0:52 > ./target/debug/virtiofsd-rs > > root 13947 69.3 13.4 5638580 1093876 pts/0 Sl+ 11:33 1:14 > qemu-system-x86_64 > > > > - In /proc/13879/status we can observe most of that memory is > > actually RssShmem: > > > > RssAnon: 9624 kB > > RssFile: 5136 kB > > RssShmem: 634372 kB > > Hi Sergio, > > Thanks for this observation about RssShmem. I also ran virtiofsd and > observed memory usage just now and it indeed looks like that only > RssShmem usage is very high. > > RssAnon: 4884 kB > RssFile: 1900 kB > RssShmem: 1050244 kB > > And as you point out that this memory is being shared with QEMU. So > looks like from cgroup point of view, we should put virtiofsd and > qemu in same cgroup and have a combined memory limit so that this > shared memory is accounting looks proper. > > Eric, does this sound reasonable. > Sergio, Vivek -- Today QEMU/virtiofsd do live within the same memory cgroup, and are bound by that same overhead I need to introduce. Good to know regarding the sharing (this restores some sanity to my observations, thank you!), but the real crux of the problem is two items: 1) the FDs are held long after the application in guest is done with them because of dentry cache in the guest (when cache=auto for virtiofsd). 2) virtiofsd/QEMU is holding on to the memory after the fds are released --Eric > > Thanks > Vivek > > > > > - In "smem", we can see a similar amount of RSS, but the PSS is > > roughly half the size because "smem" is splitting it up between > > virtiofsd-rs and QEMU: > > > > [root@localhost ~]# smem -P virtiofsd-rs -P qemu > > PID User Command Swap USS PSS > RSS > > 13879 root ./target/debug/virtiofsd-rs 0 13412 337019 > 662392 > > 13947 root qemu-system-x86_64 -enable- 0 434224 760096 > 1094392 > > > > - If we terminate the virtiofsd-rs process, the output of "smem" now > > shows that QEMU's PSS has grown to account for the PSS that was > > previously assigned to virtiofsd-rs too, so we can confirm that was > > memory shared between both processes. > > > > PID User Command Swap USS PSS > RSS > > 13947 root qemu-system-x86_64 -enable- 0 1082656 1084966 > 1095692 > > > > Just to be 100% sure, I've also run "heaptrack" on a virtiofsd-rs > > instance, and can confirm that the actual heap usage of the process > > was around 5-6 MiB. > > > > Sergio. > > > _______________________________________________ > Virtio-fs mailing list > [email protected] > https://listman.redhat.com/mailman/listinfo/virtio-fs >
_______________________________________________ Virtio-fs mailing list [email protected] https://listman.redhat.com/mailman/listinfo/virtio-fs
