I am having a very odd problem, and so far the folks at Oracle Support have not provided a working solution, so I am asking the crowd here while still pursuing it via Oracle Support.
The system is a T2000 running 10U9 with CPU-2010-01and two J4400 loaded with 1 TB SATA drives. There is one zpool on the J4400 (3 x 15 disk vdev + 3 hot spare). This system is the target for zfs send / recv replication from our production server.The OS is UFS on local disk. While I was on vacation this T2000 hung with "out of resource" errors. Other staff tried rebooting, which hung the box. Then they rebooted off of an old BE (10U9 without CPU-2010-01). Oracle Support had them apply a couple patches and an IDR to address zfs "stability and reliability problems" as well as set the following in /etc/system set zfs:zfs_arc_max = 0x700000000 (which is 28 GB) set zfs:arc_meta_limit = 0x700000000 (which is 28 GB) The system has 32 GB RAM and 32 (virtual) CPUs. They then tried importing the zpool and the system hung (after many hours) with the same "out of resource" error. At this point they left the problem for me :-( I removed the zfs.cache from the 10U9 + CPU 2010-10 BE and booted from that. I then applied the IDR (IDR146118-12 )and the zfs patch it depended on (145788-03). I did not include the zfs arc and zfs arc meta limits as I did not think they relevant. A zpool import shows the pool is OK and a sampling with zdb -l of the drives shows good labels. I started importing the zpool and after many hours it hung the system with "out of resource" errors. I had a number of tools running to see what was going on. The only thing this system is doing is importing the zpool. ARC had climbed to about 8 GB and then declined to 3 GB by the time the system hung. This tells me that there is something else consuming RAM and the ARC is releasing it. The hung TOP screen showed the largest user process only had 148 MB allocated (and much less resident). VMSTAT showed a scan rate of over 900,000 (NOT a typo) and almost 8 GB of free swap (so whatever is using memory cannot be paged out). So my guess is that there is a kernel module that is consuming all (and more) of the RAM in the box. I am looking for a way to query how much RAM each kernel module is using and script that in a loop (which will hang when the box runs out of RAM next). I am very open to suggestions here. Since this is the recv end of replication, I assume there was a zfs recv going on at the time the system initially hung. I know there was a 3+ TB snapshot replicating (via a 100 Mbps WAN link) when I left for vacation, that may have still been running. I also assume that any partial snapshots (% instead of @) are being removed when the pool is imported. But what could be causing a partial snapshot removal, even of a very large snapshot, to run the system out of RAM ? What caused the initial hang of the system (I assume due to out of RAM) ? I did not think there was a limit to the size of either a snapshot or a zfs recv. Hung TOP screen: load averages: 91.43, 33.48, 18.989 xxx-xxx1 18:45:34 84 processes: 69 sleeping, 12 running, 1 zombie, 2 on cpu CPU states: 95.2% idle, 0.5% user, 4.4% kernel, 0.0% iowait, 0.0% swap Memory: 31.9G real, 199M free, 267M swap in use, 7.7G swap free PID USERNAME THR PR NCE SIZE RES STATE TIME FLTS CPU COMMAND 533 root 51 59 0 148M 30.6M run 520:21 0 9.77% java 1210 yyyyyy 1 0 0 5248K 1048K cpu25 2:08 0 2.23% xload 14720 yyyyyy 1 59 0 3248K 1256K cpu24 1:56 0 0.03% top 154 root 1 59 0 4024K 1328K sleep 1:17 0 0.02% vmstat 1268 yyyyyy 1 59 0 4248K 1568K sleep 1:26 0 0.01% iostat ... VMSTAT: kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr m0 m1 m2 m3 in sy cs us sy id 0 0 112 8117096 211888 55 46 0 0 425 0 912684 0 0 0 0 976 166 836 0 2 98 0 0 112 8117096 211936 53 51 6 0 394 0 926702 0 0 0 0 976 167 833 0 2 98 ARC size (B): 4065882656 -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer: Frankenstein, A New Musical (http://www.facebook.com/event.php?eid=123170297765140) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss