Hello there This mail is a followup from a same-named thread in the SAGE Member's mailing list.
THE SHORT VERSION: I've probably found a kernel memory leak in OpenSolaris 2009.06. I ran the findleaks command in mdb on the crash dumps, which yeld: > > bronto at brabham:/var/crash/brabham# echo '::findleaks' | mdb unix.0 > > vmcore.0 | tee findleaks.out > > CACHE LEAKED BUFCTL CALLER > > ffffff0142828860 1 ffffff014c022a38 AcpiOsAllocate+0x1c > > ffffff01428265a0 2 ffffff014c1f5148 AcpiOsAllocate+0x1c > > ffffff0142828860 1 ffffff014c022be8 AcpiOsAllocate+0x1c > > ffffff0149b85020 1 ffffff015b00fe98 rootnex_coredma_allochdl+0x84 > > ffffff0149b85020 1 ffffff015b00e038 rootnex_coredma_allochdl+0x84 > > ffffff0149b85020 1 ffffff015b00e110 rootnex_coredma_allochdl+0x84 > > ffffff0149b85020 1 ffffff015b0055e8 rootnex_coredma_allochdl+0x84 > > ffffff0149b85020 1 ffffff015b00e2c0 rootnex_coredma_allochdl+0x84 > > ffffff0149b85020 1 ffffff015b00fdc0 rootnex_coredma_allochdl+0x84 > > ------------------------------------------------------------------------ > > Total 10 buffers, 15424 bytes Any hint? THE LONG STORY I am using OpenSolaris 2009.06; my workstation is an HP-Compaq dc7800, with 2GB of RAM and a SATA disk of about 200GB; the video card is an ATI (I know, I know...); the swap space is 2GB. When the problem first showed up months ago, that was my main workstation. Now I am using another one, mounting my home filesystem from the OpenSolaris machine. When used as a workstation, that also ran an Ubuntu Linux virtual machine on VirtualBox 3.0.4 for all those applications that I couldn't find on OpenSolaris (e.g.: Skype). When the system is freshly booted, it works like charm: it's is quick, responsive, even enjoyable to use. 24 hours later, it is slower already, and swaps a lot. 24 more hours, and it's barely usable. It was quite clear to me that the machine was going low on memory. The first step was then to close the greedier applications when I go out, and restart them the next morning (e.g.: thunderbird, firefox, virtual machine...), but that didn't change. Anyway, I spotted a number of interrupts that was a bit too high, even while the system was doing almost nothing. That problem improved when I disabled VT-x/AMD-V setting for the linux virtual machine. I also tried to restart my X session to have the X server restarted, and also to disable the GDM alltogether to surely get a fresh X server every time using "startx" from the command line. Nonetheless, the memory problem was still there. At the point I saved the output of "ps -o pid,ppid,vsz,args -e | sort -nr -k3", restarted the system, re'run the ps and compared the two outputs: I got no evidence of ever-growing processes, just slight changes in size. Also tried with prstat -s size: no success. Then I kept a vmstat running. When I left, I closed all applications but the terminal window where vmstat was running:I had 2080888 kB swap and 426932 kB RAM free. The morning after, the numbers were 1709776 kB swap and 55460 kB RAM free. At that point, I thought that the problem might be what ps doesn't show. Maybe the drivers. Unfortunately, modinfo didn't shed any light. The output of "modinfo | perl -alne '$size = $F[2] =~ m{[0-9a-f]}i? hex($F[2]) : qq{>>>} ; print qq{$size\t$F[2]\t$F[0]\t$F[5]}' | sort -nr" showed identical size values for almost all the entries. Following the suggestions of my colleagues at SAGE, I tried mdb's memstat. With a fresh system it said: > > Page Summary Pages MB %Tot > > ------------ ---------------- ---------------- ---- > > Kernel 123331 481 24% > > ZFS File Data 65351 255 13% > > Anon 264183 1031 51% > > Exec and libs 4623 18 1% > > Page cache 38954 152 8% > > Free (cachelist) 8410 32 2% > > Free (freelist) 8249 32 2% > > > > Total 513101 2004 > > Physical 513100 2004 Later: > > Page Summary Pages MB %Tot > > ------------ ---------------- ---------------- ---- > > Kernel 205125 801 40% > > ZFS File Data 1536 6 0% > > Anon 281519 1099 55% > > Exec and libs 1714 6 0% > > Page cache 11927 46 2% > > Free (cachelist) 6212 24 1% > > Free (freelist) 5068 19 1% > > > > Total 513101 2004 > > Physical 513100 2004 and that didn't change a lot when I closed Virtual Box: > > Page Summary Pages MB %Tot > > ------------ ---------------- ---------------- ---- > > Kernel 201160 785 39% > > ZFS File Data 35228 137 7% > > Anon 143764 561 28% > > Exec and libs 1860 7 0% > > Page cache 12347 48 2% > > Free (cachelist) 20978 81 4% > > Free (freelist) 97764 381 19% > > > > Total 513101 2004 > > Physical 513100 2004 Following Jason King's suggestion I also ran some dtrace scripts, but that still didn't took us much forward. This one: > > bronto at brabham:~$ pfexec dtrace -n 'fbt::kmem_alloc:entry { @a[execname, > > stack()] = sum(args[0]); } END { trunc(@a, 20) }' showed a lot of ZFS-related information, of which Jason said: > > That definately doesn't look right to me.. One other thing just to > > eliminate.. I think the ARC (zfs's cache) should show up in the ZFS > > File Data, but to be sure, http://cuddletech.com/arc_summary/ is a > > handy perl script you can run which will (among other things) tell you > > how much cache ZFS is using (though it should be releasing that as > > other apps require it, which wouldn't explain the behavior you're > > seeing, but just to cross it off the list, it might be worthwhile to > > try) and > > I'd still go for generating the crash dump (reboot -d) and shoot an > > email to mdb-discuss (maybe with the ::findleaks output attached -- it > > can be a bit long) to see if anyone can give you some pointers on > > tracking down the source. arc_summary yeld these results: > > bronto at brabham:~/bin$ ./arc_summary.pl > > System Memory: > > Physical RAM: 2004 MB > > Free Memory : 54 MB > > LotsFree: 31 MB > > > > ZFS Tunables (/etc/system): > > > > ARC Size: > > Current Size: 351 MB (arcsize) > > Target Size (Adaptive): 336 MB (c) > > Min Size (Hard Limit): 187 MB (zfs_arc_min) > > Max Size (Hard Limit): 1503 MB (zfs_arc_max) > > > > ARC Size Breakdown: > > Most Recently Used Cache Size: 6% 21 MB (p) > > Most Frequently Used Cache Size: 93% 315 MB (c-p) > > > > ARC Efficency: > > Cache Access Total: 51330312 > > Cache Hit Ratio: 99% 51130451 [Defined State > > for buffer] > > Cache Miss Ratio: 0% 199861 [Undefined > > State for Buffer] > > REAL Hit Ratio: 99% 50874745 [MRU/MFU Hits Only] > > > > Data Demand Efficiency: 94% > > Data Prefetch Efficiency: 33% > > > > CACHE HITS BY CACHE LIST: > > Anon: 0% 209685 [ > > New Customer, First Cache Hit ] > > Most Recently Used: 1% 579174 (mru) [ Return > > Customer ] > > Most Frequently Used: 98% 50295571 (mfu) [ Frequent > > Customer ] > > Most Recently Used Ghost: 0% 23712 (mru_ghost) [ > > Return Customer Evicted, Now Back ] > > Most Frequently Used Ghost: 0% 22309 (mfu_ghost) [ > > Frequent Customer Evicted, Now Back ] > > CACHE HITS BY DATA TYPE: > > Demand Data: 1% 923119 > > Prefetch Data: 0% 6709 > > Demand Metadata: 96% 49210017 > > Prefetch Metadata: 1% 990606 > > CACHE MISSES BY DATA TYPE: > > Demand Data: 28% 56627 > > Prefetch Data: 6% 13460 > > Demand Metadata: 51% 102980 > > Prefetch Metadata: 13% 26794 > > --------------------------------------------- In a moment of low load at $WORK I've finally generated the crash dumps and ran findleaks on it, and the result is on top of this email. Any hint? Ciao --bronto