Hi Andreas, Thanks for your response. I will try to run the leak-finder script and hopefully it will point us in the right direction. This only seems to be happening on some of my clients:
-- client112: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0 client108: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0 client110: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0 client107: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0 client111: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0 client109: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0 client102: ll_obdo_cache 5 38 208 19 1 : tunables 120 60 8 : slabdata 2 2 1 client114: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0 client105: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0 client103: ll_obdo_cache 0 0 208 19 1 : tunables 120 60 8 : slabdata 0 0 0 client104: ll_obdo_cache 0 433506280 208 19 1 : tunables 120 60 8 : slabdata 0 22816120 0 client116: ll_obdo_cache 0 457366746 208 19 1 : tunables 120 60 8 : slabdata 0 24071934 0 client113: ll_obdo_cache 0 456778867 208 19 1 : tunables 120 60 8 : slabdata 0 24040993 0 client106: ll_obdo_cache 0 456372267 208 19 1 : tunables 120 60 8 : slabdata 0 24019593 0 client115: ll_obdo_cache 0 449929310 208 19 1 : tunables 120 60 8 : slabdata 0 23680490 0 client101: ll_obdo_cache 0 454318101 208 19 1 : tunables 120 60 8 : slabdata 0 23911479 0 -- Hopefully this should help. Not sure which application might be causing the leaks. Currently R is the only app that users seem to be using heavily on these clients. Will let you know what I find. Thanks again, -J On Mon, Apr 19, 2010 at 9:04 PM, Andreas Dilger <[email protected]>wrote: > On 2010-04-19, at 11:16, Jagga Soorma wrote: > >> What is the known problem with the DLM LRU size? >> > > It is mostly a problem on the server, actually. > > Here is what my slabinfo/meminfo look like on one of the clients. I >> don't see anything out of the ordinary: >> >> (then again there are no jobs currently running on this system) >> >> slabinfo - version: 2.1 >> # name <active_objs> <num_objs> <objsize> <objperslab> >> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata >> <active_slabs> <num_slabs> <sharedavail> >> > > ll_async_page 326589 328572 320 12 1 : tunables 54 27 8 >> : slabdata 27381 27381 0 >> > > This shows you have 326589 pages in the lustre filesystem cache, or about > 1275MB of data. That shouldn't be too much for a system with 192GB of > RAM... > > lustre_inode_cache 769 772 896 4 1 : tunables 54 27 >> 8 : slabdata 193 193 0 >> ldlm_locks 2624 3688 512 8 1 : tunables 54 27 8 >> : slabdata 461 461 0 >> ldlm_resources 2002 3340 384 10 1 : tunables 54 27 8 >> : slabdata 334 334 0 >> > > Only about 2600 locks on 770 files is fine (this is what the DLM LRU size > would affect, if it were out of control, which it isn't). > > ll_obdo_cache 0 452282156 208 19 1 : tunables 120 60 >> 8 : slabdata 0 23804324 0 >> > > This is really out of whack. The "obdo" struct should normally only be > allocated for a short time and then freed again, but here you have 452M of > them using over 90GB of RAM. It looks like a leak of some kind, which is a > bit surprising since we have fairly tight checking for memory leaks in the > Lustre code. > > Are you running some unusual workload that is maybe walking an unusual code > path? What you can do to track down memory leaks is enable Lustre memory > tracing, increase the size of the debug buffer to catch enough tracing to be > useful, and then run your job to see what is causing the leak, dump the > kernel debug log, and then run leak-finder.pl (attached, and also in > Lustre sources): > > client# lctl set_param debug=+malloc > client# lctl set_param debug_mb=256 > client$ {run job} > client# sync > client# lctl dk /tmp/debug > client# perl leak-finder.pl < /tmp/debug 2>&1 | grep "Leak.*oa" > client# lctl set_param debug=-malloc > client# lctl set_param debug_mb=32 > > Since this is a running system, it will report spurious leaks for some > kinds of allocations that remain in memory for some time (e.g. cached pages, > inodes, etc), but with the exception of uncommitted RPCs (of which there > should be none after the sync) there should not be any leaked obdo. > > On 2010-04-19, at 10:43, Jagga Soorma <[email protected]> wrote: >> >>> My users are reporting some issues with memory on our lustre 1.8.1 >>> clients. It looks like when they submit a single job at a time the run time >>> was about 4.5 minutes. However, when they ran multiple jobs (10 or less) on >>> a client with 192GB of memory on a single node the run time for each job was >>> exceeding 3-4X the run time for the single process. They also noticed that >>> the swap space kept climbing even though there was plenty of free memory on >>> the system. Could this possibly be related to the lustre client? Does it >>> reserve any memory that is not accessible by any other process even though >>> it might not be in use? >>> >> >> > Cheers, Andreas > -- > Andreas Dilger > Principal Engineer, Lustre Group > Oracle Corporation Canada Inc. >
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
