On Mon, Feb 2, 2015 at 11:46 AM, Tejun Heo <t...@kernel.org> wrote: > Hey, > > On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote: > >> Keeping shared inodes in common ancestor is reasonable. >> We could schedule asynchronous moving when somebody opens or mmaps >> inode from outside of its current cgroup. But it's not clear when >> inode should be moved into opposite direction: when inode should >> become private and how detect if it's no longer shared. >> >> For example each inode could keep yet another pointer to memcg where >> it will track subtree of cgroups where it was accessed in past 5 >> minutes or so. And sometimes that informations goes into moving thread. >> >> Actually I don't see other options except that time-based estimation: >> tracking all cgroups for each inode is too expensive, moving pages >> from one lru to another is expensive too. So, moving inodes back and >> forth at each access from the outside world is not an option. >> That should be rare operation which runs in background or in reclaimer. > > Right, what strategy to use for migration is up for debate, even for > moving to the common ancestor. e.g. should we do that on the first > access? In the other direction, it get more interesting. Let's say > if we decide to move back an inode to a descendant, what if that > triggers OOM condition? Do we still go through it and cause OOM in > the target? Do we even want automatic moving in this direction? > > For explicit cases, userland can do FADV_DONTNEED, I suppose. > > Thanks. > > -- > tejun
I don't have any killer objections, most of my worries are isolation concerns. If a machine has several top level memcg trying to get some form of isolation (using low, min, soft limit) then a shared libc will be moved to the root memcg where it's not protected from global memory pressure. At least with the current per page accounting such shared pages often land into some protected memcg. If two cgroups collude they can use more memory than their limit and oom the entire machine. Admittedly the current per-page system isn't perfect because deleting a memcg which contains mlocked memory (referenced by a remote memcg) moves the mlocked memory to root resulting in the same issue. But I'd argue this is more likely with the RFC because it doesn't involve the cgroup deletion/reparenting. A possible tweak to shore up the current system is to move such mlocked pages to the memcg of the surviving locker. When the machine is oom it's often nice to examine memcg state to determine which container is using the memory. Tracking down who's contributing to a shared container is non-trivial. I actually have a set of patches which add a memcg=M mount option to memory backed file systems. I was planning on proposing them, regardless of this RFC, and this discussion makes them even more appealing. If we go in this direction, then we'd need a similar notion for disk based filesystems. As Konstantin suggested, it'd be really nice to specify charge policy on a per file, or directory, or bind mount basis. This allows shared files to be deterministically charged to a known container. We'd need to flesh out the policies: e.g. if two bind mound each specify different charge targets for the same inode, I guess we just pick one. Though the nature of this catch-all shared container is strange. Presumably a machine manager would need to create it as an unlimited container (or at least as big as the sum of all shared files) so that any app which decided it wants to mlock all shared files has a way to without ooming the shared container. In the current per-page approach it's possible to lock shared libs. But the machine manager would need to decide how much system ram to set aside for this catch-all shared container. When there's large incidental sharing, then things get sticky. A periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in a small container would pull all pages to the root memcg where they are exposed to root pressure which breaks isolation. This is concerning. Perhaps the such accesses could be decorated with (O_NO_MOVEMEM). So this RFC change will introduce significant change to user space machine managers and perturb isolation. Is the resulting system better? It's not clear, it's the devil know vs devil unknown. Maybe it'd be easier if the memcg's I'm talking about were not allowed to share page cache (aka copy-on-read) even for files which are jointly visible. That would provide today's interface while avoiding the problematic sharing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/