Jacek, “Accounting looks better when Lustre is not involved ;) Seriosly, how can i help? Should i raise a bug? Try to provide a patch?” A patch is always welcome.
We require a bug in our JIRA (jira.whamcloud.com) to submit a patch against. (See here for instructions for our Gerrit: https://wiki.whamcloud.com/plugins/servlet/mobile?contentId=7111125#content/view/7111125) For this one, if we’re going Neil’s suggested direction, I’d love a little bit of convincing that other file systems mark their shrinker associated caches as reclaimable. (Though if Neil insists they *should* do so, then that counts for a lot.) No idea about the signal cache (I believe it’s for timers, and Lustre shouldn’t be using an unusual number of those), would be interested if Neil has anything to add there. - Patrick ________________________________ From: Jacek Tomaka <jac...@dug.com> Sent: Sunday, April 14, 2019 9:10:32 PM To: Patrick Farrell Cc: NeilBrown; lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] Lustre client memory and MemoryAvailable Thanks Patrick for getting the ball rolling! >1/ w.r.t drop_caches, "2" is *not* "inode and dentry". The '2' bit > causes all registered shrinkers to be run, until they report there is > nothing left that can be discarded. If this is taking 10 minutes, > then it seems likely that some shrinker is either very inefficient, or > is reporting that there is more work to be done, when really there > isn't. This is pretty common problem on this hardware. KNL's CPU is running at ~1.3GHz so anything that is not multi threaded can take a few times more than on "normal" XEON. While it would be nice to improve this (by running it in mutliple threads), this is not the problem here. However i can provide you with kernel call stack next time i see it if you are interested. > 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it > reclaims anything that can be reclaimed immediately. Awesome. I would just like to know how much easily available memory there is on the system without actually reclaiming it and seeing, ideally using normal kernel mechanisms but if lustre provides a procfs entry where i can get it, it will solve my immediate problem. >4/ Patrick is right that accounting is best-effort. But we do want it > to improve. Accounting looks better when Lustre is not involved ;) Seriosly, how can i help? Should i raise a bug? Try to provide a patch? >Just last week there was a report > https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/ > about making slab-allocated objects movable. If/when that gets off > the ground, it should help the fragmentation problem, so more of the > pages listed as reclaimable should actually be so. This is a very interesting article. While memory fragmentation makes it more difficult to use huge pages, it is not directly related to the problem of lustre kernel memory allocation accounting. It will be good to see movable slabs, though. Also i am not sure how the high signal_cache can be explained and if anything can be done on the Lustre level? Regards. Jacek Tomaka On Mon, Apr 15, 2019 at 8:55 AM Patrick Farrell <pfarr...@whamcloud.com<mailto:pfarr...@whamcloud.com>> wrote: 1. Good to know, thank you. I hadn’t looked at the code, I was unaware it runs through all the sprinklers. 2. Right, I know - the article was about it when it was an alias for reclaimable, and hence describes some of the behavior of reclaimable. 3. Interesting, that’s good to know. I would note that it doesn’t seem to be standard practice in other file systems, though I didn’t look at how many shrinkers they’re registering. Perhaps having special shrinkers is what’s unusual. BTW, re: mailing list, this is the first devel appropriate thing I’ve seen on discuss in a long while. I should instead have encouraged Jacek to use lustre-devel :) - Patrick ________________________________ From: NeilBrown <ne...@suse.com<mailto:ne...@suse.com>> Sent: Sunday, April 14, 2019 6:38:47 PM To: Patrick Farrell; Jacek Tomaka; lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> Subject: Re: [lustre-discuss] Lustre client memory and MemoryAvailable (that for the Cc Patrick - maybe I should subscribe to lustre-discuss...) 1/ w.r.t drop_caches, "2" is *not* "inode and dentry". The '2' bit causes all registered shrinkers to be run, until they report there is nothing left that can be discarded. If this is taking 10 minutes, then it seems likely that some shrinker is either very inefficient, or is reporting that there is more work to be done, when really there isn't. lustre registers 5 shrinkers. Any memcache which is not affected by those shrinkers should *not* be marked SLAB_RECLAIM_ACCOUNT (unless they are indirectly shrunk by a system shrinker - e.g. if they are slaves to the icache or dcache). Any which are probably can be. 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it reclaims anything that can be reclaimed immediately. It doesn't trigger write-back, and it doesn't start the oom-killer, but all caches are flushed of everything that is not currently in use, and does not need to be written-out first. If you run "sync" first, there should be nothing to write out, so it should drop a lot more. 2/ GFP_TEMPORARY is gone, it was never really well defined. Best to ignore it. 3/ I don't *think* __GFP_RECLAIMABLE has a very big effect. It primarily tries to keep non-reclaimable allocations together so they don't cause too much fragmentation. To do this, it groups them separately from reclaimable allocations. So if a shrinker is expected to do anything useful, then it makes sense to tag the related slabs as RECLAIMABLE. 4/ Patrick is right that accounting is best-effort. But we do want it to improve. Just last week there was a report https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/ about making slab-allocated objects movable. If/when that gets off the ground, it should help the fragmentation problem, so more of the pages listed as reclaimable should actually be so. NeilBrown On Sun, Apr 14 2019, Patrick Farrell wrote: > echo 1 > drop_caches does not generate memory pressure - it requests that the > page cache be cleared. It would not be expected to affect slab caches much. > > You could try 3 (1+2 in this case, where 2 is inode and dentry). That might > do a bit more because some (maybe many?) of those objects you're looking at > would go away if the associated inodes or dentries were removed. But > fundamentally, drop caches does not generate memory pressure, and does not > force reclaim. It drops specific, identified caches. > > The only way to force *reclaim* is memory pressure. > > Your note that a lot more memory than expected was freed under pressure does > tell us something, though. > > It's conceivable Lustre needs to set SLAB_RECLAIM_ACCOUNT on more of its slab > caches, so this piqued my curiosity. My conclusion is no, here's why: > > The one quality reference I was quickly able to find suggests setting > SLAB_RECLAIM_ACCOUNT wouldn't be so simple: > https://lwn.net/Articles/713076/ > > GFP_TEMPORARY is - in practice - just another name for __GFP_RECLAIMABLE, and > setting SLAB_RECLAIM_ACCOUNT is equivalent to setting __GFP_RECLAIMABLE. > That article suggests caution is needed, as this should only be used for > memory that is certain to be easily available, because using this flag > changes the allocation behavior on the assumption that the memory can be > quickly freed at need. That is often not true of these Lustre objects. > > An easy way to learn more about this sort of question is to compare to other > actively developed file systems in the kernel... > > Comparing to other file systems, we see that in general, only the inode cache > is allocated with SLAB_RECLAIM_ACCOUNT (it varies a bit). > > XFS, for example, has only one use of KM_ZONE_RECLAIM, its name for this flag > - the inode cache: > " > xfs_inode_zone = > kmem_zone_init_flags(sizeof(xfs_inode_t), "xfs_inode", > KM_ZONE_HWALIGN | KM_ZONE_RECLAIM | KM_ZONE_SPREAD, > xfs_fs_inode_init_once); > " > > btrfs is the same, just the inode cache. EXT4 has a *few* more caches marked > this way, but not everything. > > So, no - I don't think so. It would be atypical for Lustre to set > SLAB_RECLAIM_ACCOUNT on its slab caches for internal objects. Presumably > this sort of thing is not considered reclaimable enough for this purpose. > > I believe if you tried similar tests with other complex file systems (XFS > might be a good start), you'd see broadly similar behavior. (Lustre is > probably a bit worse because it has a more complex internal object model, so > more slab caches.) > > VM accounting is distinctly imperfect. The design is such that it's often > impossible to know how much memory could be made available without actually > going and trying to free it. There are good, intrinsic reasons for some of > that, and some of that is design artifacts... > > I've copied in Neil Brown, who I think only reads lustre-devel, just in case > he has some particular input on this. > > Regards, > - Patrick > ________________________________ > From: lustre-discuss > <lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>> > on behalf of Jacek Tomaka <jac...@dug.com<mailto:jac...@dug.com>> > Sent: Sunday, April 14, 2019 3:12:51 AM > To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> > Subject: Re: [lustre-discuss] Lustre client memory and MemoryAvailable > > Actually i think it is just a bug with the way slab caches are created. Some > of them should be passed a flag that they are reclaimable. > i.e. something like: > https://patchwork.kernel.org/patch/9360819/ > > Regards. > Jacek Tomaka > > On Sun, Apr 14, 2019 at 3:27 PM Jacek Tomaka > <jac...@dug.com<mailto:jac...@dug.com><mailto:jac...@dug.com<mailto:jac...@dug.com>>> > wrote: > Hello, > > TL;DR; > Is there a way to figure out how much memory Lustre will make available under > memory pressure? > > Details: > We are running lustre client on a machine with 128GB of memory (Centos 7) > Intel Phi KNL machines and at certain situations we see that there can be > ~10GB+ of memory allocated on the kernel side i.e. : > > vvp_object_kmem 3535336 3536986 176 46 2 : tunables 0 0 0 > : slabdata 76891 76891 0 > ll_thread_kmem 33511 33511 344 47 4 : tunables 0 0 0 : > slabdata 713 713 0 > lov_session_kmem 34760 34760 592 55 8 : tunables 0 0 0 : > slabdata 632 632 0 > osc_extent_kmem 3549831 3551232 168 48 2 : tunables 0 0 0 > : slabdata 73984 73984 0 > osc_thread_kmem 14012 14116 2832 11 8 : tunables 0 0 0 : > slabdata 1286 1286 0 > osc_object_kmem 3546640 3548350 304 53 4 : tunables 0 0 0 > : slabdata 66950 66950 0 > signal_cache 3702537 3707144 1152 28 8 : tunables 0 0 0 > : slabdata 132398 132398 0 > > /proc/meminfo: > MemAvailable: 114196044 kB > Slab: 11641808 kB > SReclaimable: 1410732 kB > SUnreclaim: 10231076 kB > > After executing > > echo 1 >/proc/sys/vm/drop_caches > > the slabinfo values don't change but when i actually generate memory pressure > by: > > java -Xmx117G -Xms117G -XX:+AlwaysPreTouch -version > > lots of memory gets freed: > vvp_object_kmem 127650 127880 176 46 2 : tunables 0 0 0 : > slabdata 2780 2780 0 > ll_thread_kmem 33558 33558 344 47 4 : tunables 0 0 0 : > slabdata 714 714 0 > lov_session_kmem 34815 34815 592 55 8 : tunables 0 0 0 : > slabdata 633 633 0 > osc_extent_kmem 128640 128880 168 48 2 : tunables 0 0 0 : > slabdata 2685 2685 0 > osc_thread_kmem 14038 14116 2832 11 8 : tunables 0 0 0 : > slabdata 1286 1286 0 > osc_object_kmem 82998 83263 304 53 4 : tunables 0 0 0 : > slabdata 1571 1571 0 > signal_cache 38734 44268 1152 28 8 : tunables 0 0 0 : > slabdata 1581 1581 0 > > /proc/meminfo: > MemAvailable: 123146076 kB > Slab: 1959160 kB > SReclaimable: 334276 kB > SUnreclaim: 1624884 kB > > The similar effect to generating memory pressure we see when executing: > > echo 3 >/proc/sys/vm/drop_caches > > But this can take very long time (10 minutes). > > So essentially on a machine using Lustre client, MemAvailable is no longer a > good predictor of the amount of memory that can be allocated. > Is there a way to query Lustre and compensate for lustre cache memory that > will be made available on memory pressure? > > Regards. > -- > Jacek Tomaka > Geophysical Software Developer > > > [http://drive.google.com/uc?export=view&id=0B4X9ixpc-ZU_NHV0WnluaXp5ZkE] > > DownUnder GeoSolutions > > 76 Kings Park Road > West Perth 6005 WA, Australia > tel +61 8 9287 4143<tel:+61%208%209287%204143> > jac...@dug.com<mailto:jac...@dug.com><mailto:jac...@dug.com> > www.dug.com<http://www.dug.com><http://www.dug.com> > > > -- > Jacek Tomaka > Geophysical Software Developer > > > [http://drive.google.com/uc?export=view&id=0B4X9ixpc-ZU_NHV0WnluaXp5ZkE] > > DownUnder GeoSolutions > > 76 Kings Park Road > West Perth 6005 WA, Australia > tel +61 8 9287 4143<tel:+61%208%209287%204143> > jac...@dug.com<mailto:jac...@dug.com><mailto:jac...@dug.com> > www.dug.com<http://www.dug.com><http://www.dug.com> -- Jacek Tomaka Geophysical Software Developer [http://drive.google.com/uc?export=view&id=0B4X9ixpc-ZU_NHV0WnluaXp5ZkE] DownUnder GeoSolutions 76 Kings Park Road West Perth 6005 WA, Australia tel +61 8 9287 4143<tel:+61%208%209287%204143> jac...@dug.com<mailto:jac...@dug.com> www.dug.com<http://www.dug.com>
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org