Hello there

This mail is a followup from a same-named thread in the SAGE Member's
mailing list.

THE SHORT VERSION:

I've probably found a kernel memory leak in OpenSolaris 2009.06. I ran
the findleaks command in mdb on the crash dumps, which yeld:

> > bronto at brabham:/var/crash/brabham# echo '::findleaks' | mdb unix.0 
> > vmcore.0 | tee findleaks.out
> > CACHE             LEAKED           BUFCTL CALLER
> > ffffff0142828860       1 ffffff014c022a38 AcpiOsAllocate+0x1c
> > ffffff01428265a0       2 ffffff014c1f5148 AcpiOsAllocate+0x1c
> > ffffff0142828860       1 ffffff014c022be8 AcpiOsAllocate+0x1c
> > ffffff0149b85020       1 ffffff015b00fe98 rootnex_coredma_allochdl+0x84
> > ffffff0149b85020       1 ffffff015b00e038 rootnex_coredma_allochdl+0x84
> > ffffff0149b85020       1 ffffff015b00e110 rootnex_coredma_allochdl+0x84
> > ffffff0149b85020       1 ffffff015b0055e8 rootnex_coredma_allochdl+0x84
> > ffffff0149b85020       1 ffffff015b00e2c0 rootnex_coredma_allochdl+0x84
> > ffffff0149b85020       1 ffffff015b00fdc0 rootnex_coredma_allochdl+0x84
> > ------------------------------------------------------------------------
> >            Total      10 buffers, 15424 bytes




Any hint?


THE LONG STORY


I am using OpenSolaris 2009.06; my workstation is an HP-Compaq dc7800,
with 2GB of RAM and a SATA disk of about 200GB; the video card is an
ATI (I know, I know...); the swap space is 2GB.

When the problem first showed up months ago, that was my main
workstation. Now I am using another one, mounting my home filesystem
from the OpenSolaris machine.  When used as a workstation, that also
ran an
Ubuntu Linux virtual machine on VirtualBox 3.0.4 for all those
applications that I couldn't find on OpenSolaris (e.g.: Skype).

When the system is freshly booted, it works like charm: it's is quick,
responsive, even enjoyable to use. 24 hours later, it is slower
already, and swaps a lot.
24 more hours, and it's barely usable. It was quite clear to me that
the machine was going low on memory.

The first step was then to close the greedier applications when I go
out, and restart them the next morning (e.g.: thunderbird, firefox,
virtual machine...), but that didn't change. Anyway, I spotted a
number of interrupts that was a bit too high, even while the system
was doing almost nothing. That problem improved when I disabled
VT-x/AMD-V setting for the linux virtual machine.

I also tried to restart my X session to have the X server restarted,
and also to disable the GDM alltogether to surely get a fresh X server
every time using "startx" from the command line.

Nonetheless, the memory problem was still there.

At the point I saved the output of "ps -o pid,ppid,vsz,args -e | sort
-nr -k3", restarted the system, re'run the ps and compared the two
outputs: I got no evidence of ever-growing processes, just slight
changes in size. Also tried with prstat -s size: no success.

Then I kept a vmstat running. When I left, I closed all applications
but the terminal window where vmstat was running:I had 2080888 kB swap
and 426932 kB RAM free. The morning after, the numbers were 1709776 kB
swap and 55460 kB RAM free.

At that point, I thought that the problem might be what ps doesn't
show. Maybe the drivers. Unfortunately, modinfo didn't shed any light.
The output of "modinfo | perl -alne '$size = $F[2] =~ m{[0-9a-f]}i?
hex($F[2]) : qq{>>>} ;  print qq{$size\t$F[2]\t$F[0]\t$F[5]}' | sort
-nr" showed identical size values for almost all the entries.

Following the suggestions of my colleagues at SAGE, I tried mdb's
memstat. With a fresh system it said:


> > Page Summary                Pages                MB  %Tot
> > ------------     ----------------  ----------------  ----
> > Kernel                     123331               481   24%
> > ZFS File Data               65351               255   13%
> > Anon                       264183              1031   51%
> > Exec and libs                4623                18    1%
> > Page cache                  38954               152    8%
> > Free (cachelist)             8410                32    2%
> > Free (freelist)              8249                32    2%
> >
> > Total                      513101              2004
> > Physical                   513100              2004




Later:


> > Page Summary                Pages                MB  %Tot
> > ------------     ----------------  ----------------  ----
> > Kernel                     205125               801   40%
> > ZFS File Data                1536                 6    0%
> > Anon                       281519              1099   55%
> > Exec and libs                1714                 6    0%
> > Page cache                  11927                46    2%
> > Free (cachelist)             6212                24    1%
> > Free (freelist)              5068                19    1%
> >
> > Total                      513101              2004
> > Physical                   513100              2004


and that didn't change a lot when I closed Virtual Box:


> > Page Summary                Pages                MB  %Tot
> > ------------     ----------------  ----------------  ----
> > Kernel                     201160               785   39%
> > ZFS File Data               35228               137    7%
> > Anon                       143764               561   28%
> > Exec and libs                1860                 7    0%
> > Page cache                  12347                48    2%
> > Free (cachelist)            20978                81    4%
> > Free (freelist)             97764               381   19%
> >
> > Total                      513101              2004
> > Physical                   513100              2004




Following Jason King's suggestion I also ran some dtrace scripts, but
that still didn't took us much forward. This one:

> > bronto at brabham:~$ pfexec dtrace -n 'fbt::kmem_alloc:entry { @a[execname, 
> > stack()] = sum(args[0]); } END { trunc(@a, 20) }'



showed a lot of ZFS-related information, of which Jason said:


> > That definately doesn't look right to me.. One other thing just to
> > eliminate.. I think the ARC (zfs's cache) should show up in the ZFS
> > File Data, but to be sure, http://cuddletech.com/arc_summary/ is a
> > handy perl script you can run which will (among other things) tell you
> > how much cache ZFS is using (though it should be releasing that as
> > other apps require it, which wouldn't explain the behavior you're
> > seeing, but just to cross it off the list, it might be worthwhile to
> > try)

and

> > I'd still go for generating the crash dump (reboot -d) and shoot an
> > email to mdb-discuss (maybe with the ::findleaks output attached -- it
> > can be a bit long) to see if anyone can give you some pointers on
> > tracking down the source.

arc_summary yeld these results:

> > bronto at brabham:~/bin$ ./arc_summary.pl
> > System Memory:
> >         Physical RAM:  2004 MB
> >         Free Memory :  54 MB
> >         LotsFree:      31 MB
> >
> > ZFS Tunables (/etc/system):
> >
> > ARC Size:
> >         Current Size:             351 MB (arcsize)
> >         Target Size (Adaptive):   336 MB (c)
> >         Min Size (Hard Limit):    187 MB (zfs_arc_min)
> >         Max Size (Hard Limit):    1503 MB (zfs_arc_max)
> >
> > ARC Size Breakdown:
> >         Most Recently Used Cache Size:           6%    21 MB (p)
> >         Most Frequently Used Cache Size:        93%    315 MB (c-p)
> >
> > ARC Efficency:
> >         Cache Access Total:             51330312
> >         Cache Hit Ratio:      99%       51130451       [Defined State
> > for buffer]
> >         Cache Miss Ratio:      0%       199861         [Undefined
> > State for Buffer]
> >         REAL Hit Ratio:       99%       50874745       [MRU/MFU Hits Only]
> >
> >         Data Demand   Efficiency:    94%
> >         Data Prefetch Efficiency:    33%
> >
> >        CACHE HITS BY CACHE LIST:
> >          Anon:                        0%        209685                 [ 
> > New Customer, First Cache Hit ]
> >          Most Recently Used:          1%        579174 (mru) [ Return 
> > Customer ]
> >          Most Frequently Used:       98%        50295571 (mfu) [ Frequent 
> > Customer ]
> >          Most Recently Used Ghost:    0%        23712 (mru_ghost)      [ 
> > Return Customer Evicted, Now Back ]
> >          Most Frequently Used Ghost:  0%        22309 (mfu_ghost) [ 
> > Frequent Customer Evicted, Now Back ]
> >        CACHE HITS BY DATA TYPE:
> >          Demand Data:                 1%        923119
> >          Prefetch Data:               0%        6709
> >          Demand Metadata:            96%        49210017
> >          Prefetch Metadata:           1%        990606
> >        CACHE MISSES BY DATA TYPE:
> >          Demand Data:                28%        56627
> >          Prefetch Data:               6%        13460
> >          Demand Metadata:            51%        102980
> >          Prefetch Metadata:          13%        26794
> > ---------------------------------------------


In a moment of low load at $WORK  I've finally generated the crash
dumps and ran findleaks on it, and the result is on top of this email.

Any hint?

Ciao
--bronto

Reply via email to