On Mon, Apr 15 2019, Jacek Tomaka wrote: > Thanks Patrick for getting the ball rolling! > >>1/ w.r.t drop_caches, "2" is *not* "inode and dentry". The '2' bit >> causes all registered shrinkers to be run, until they report there is >> nothing left that can be discarded. If this is taking 10 minutes, >> then it seems likely that some shrinker is either very inefficient, or >> is reporting that there is more work to be done, when really there >> isn't. > > This is pretty common problem on this hardware. KNL's CPU is running > at ~1.3GHz so anything that is not multi threaded can take a few times more > than on "normal" XEON. While it would be nice to improve this (by running > it in mutliple threads), > this is not the problem here. However i can provide you with kernel call > stack > next time i see it if you are interested.
That would be interesting. About a dozen copies of cat /proc/$PID/stack taken in quick succession would be best, where $PID is the pid of the shell process which wrote to drop_caches. > > >> 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it >> reclaims anything that can be reclaimed immediately. > > Awesome. I would just like to know how much easily available memory > there is on the system without actually reclaiming it and seeing, ideally > using > normal kernel mechanisms but if lustre provides a procfs entry where i can > get it, it will solve my immediate problem. > >>4/ Patrick is right that accounting is best-effort. But we do want it >> to improve. > > Accounting looks better when Lustre is not involved ;) Seriosly, how > can i help? Should i raise a bug? Try to provide a patch? > >>Just last week there was a report >> https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/ >> about making slab-allocated objects movable. If/when that gets off >> the ground, it should help the fragmentation problem, so more of the >> pages listed as reclaimable should actually be so. > > This is a very interesting article. While memory fragmentation makes it > more > difficult to use huge pages, it is not directly related to the problem of > lustre kernel > memory allocation accounting. It will be good to see movable slabs, though. > > Also i am not sure how the high signal_cache can be explained and if > anything can be > done on the Lustre level? signal_cache should have one entry for each process (or thread-group). It holds a the signal_struct structure that is shared among the threads in a group. So 3.7 million signal_structs suggests there are 3.7 million processes on the system. I don't think Linux supports more that 4 million, so that is one very busy system. Unless... the final "put" of a task_struct happens via call_rcu - so it can be delayed a while, normally 10s of milliseconds, but it can take seconds to clear a large backlog. So if you have lots of processes being created and destroyed very quickly, then you might get a backlog of task_struct, and the associated signal_struct, waiting to be destroyed. However, if the task_struct slab were particularly big, I suspect you would have included it in the list of large slabs - but you didn't. If signal_cache has more active entries than task_struct, then something has gone seriously wrong somewhere. I doubt this problem is related to lustre. NeilBrown
signature.asc
Description: PGP signature
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org