Hi Tovarishch Jim, I had similar issue with my box and it was related to the NFS locks. I assume you are using it due to the Linux backups. The solution was posted by Chip on the mailing list. Copy of his solution below:
"I've seen issues like this when you run out of NFS locks. NFSv3 in Illumos is really slow at releasing locks. On all my NFS servers I do: sharectl set -p lockd_listen_backlog=256 nfs sharectl set -p lockd_servers=2048 nfs Everywhere I can, I use NFSv4 instead of v3. It handles lock much better." All the Best Yavor On Thu, Oct 22, 2015 at 11:59 AM, Jim Klimov <j...@cos.ru> wrote: > Hello all, > > I have this HP-Z400 workstation with 16Gb ECC(should be) RAM running > OmniOS bloody, which acts as a backup server for our production systems > (regularly rsync'ing large files off Linux boxes, and rotating ZFS > auto-snapshots to keep its space free). Sometimes it also runs replicas of > infrastructure (DHCP, DNS) and was set up as a VirtualBox + phpVirtualBox > host to test that out, but no VMs running. > > So the essential loads are ZFS snapshots and ZFS scrubs :) > > And it freezes roughly every week. Stops responding to ping, attempts to > log in via SSH or physical console - it processes keypresses on the latter, > but does not present a login prompt. It used to be stable, and such regular > hangs began around summertime. > > > > My primary guess would be for flaky disks, maybe timing out under load or > going to sleep or whatever... But I have yet to prove it, or any other > theory. Maybe just CPU is overheating due to regular near-100% load with > disk I/O... At least I want to rule out OS errors and rule out (or point > out) operator/box errors as much as possible - which is something I can > change to try and fix ;) > > Before I proceed to TL;DR screenshots, I'd overview what I see: > > * In the "top" output, processes owned by zfssnap lead most of the time... > But even the SSH shell is noticeably slow to respond (1 sec per line when > just pressing enter to clear the screen to prepare nice screenshots). > > * SMART was not enabled on 3TB mirrored "pool" SATA disks (is now, long > tests initiated), but was in place on the "rpool" SAS disk where it logged > some corrected ECC errors - but none uncorrected. > > Maybe the cabling should be reseated. > > * iostat shows disks are generally not busy (they don't audibly rattle nor > visibly blink all the time, either) > > * zpool scrubs return clean > > * there are partitions of the system rpool disk (10K RPM SAS) used as log > and cache devices for the main data pool on 3TB SATA disks. The system disk > is fast and underutilized, so what the heck ;) And it was not a problem for > the first year of this system's honest and stable workouts. These devices > are pretty empty at the moment. > > > > I have enabled deadman panics according to Wiki, but none have happened so > far: > > # cat /etc/system | egrep -v '(^\*|^$)' > set snooping=1 > set pcplusmp:apic_panic_on_nmi=1 > set apix:apic_panic_on_nmi = 1 > > > > > > In the "top" output, processes owned by zfssnap lead most of the time: > > > > last pid: 22599; load avg: 12.9, 12.2, 11.2; up > 0+09:52:11 > 18:34:41 > 140 processes: 125 sleeping, 13 running, 2 on cpu > CPU states: 0.0% idle, 22.9% user, 77.1% kernel, 0.0% iowait, 0.0% swap > Memory: 16G phys mem, 1765M free mem, 2048M total swap, 2048M free swap > Seconds to delay: > PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND > 21389 zfssnap 1 43 2 863M 860M run 5:04 35.61% zfs > 22360 zfssnap 1 52 2 118M 115M run 0:37 16.50% zfs > 21778 zfssnap 1 52 2 563M 560M run 3:15 13.17% zfs > 21278 zfssnap 1 52 2 947M 944M run 5:32 6.91% zfs > 21881 zfssnap 1 43 2 433M 431M run 2:31 5.41% zfs > 21852 zfssnap 1 52 2 459M 456M run 2:39 5.16% zfs > 21266 zfssnap 1 43 2 906M 903M run 5:18 3.95% zfs > 21757 zfssnap 1 43 2 597M 594M run 3:26 2.91% zfs > 21274 zfssnap 1 52 2 930M 927M cpu/0 5:27 2.78% zfs > 22588 zfssnap 1 43 2 30M 27M run 0:08 2.48% zfs > 22580 zfssnap 1 52 2 49M 46M run 0:14 0.71% zfs > 22038 root 1 59 0 5312K 3816K cpu/1 0:01 0.10% top > 22014 root 1 59 0 8020K 4988K sleep 0:00 0.02% sshd > > > > Average "iostats" are not that busy: > > > > # zpool iostat -Td 5 > Thu Oct 22 18:24:59 CEST 2015 > capacity operations bandwidth > pool alloc free read write read write > ---------- ----- ----- ----- ----- ----- ----- > pool 2.52T 207G 802 116 28.3M 840K > rpool 33.0G 118G 0 4 4.52K 58.7K > ---------- ----- ----- ----- ----- ----- ----- > > Thu Oct 22 18:25:04 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 10 0 97.9K > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:09 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:14 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 9 0 93.5K > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:19 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:24 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:29 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:34 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:25:39 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 16 0 374K > ---------- ----- ----- ----- ----- ----- ----- > ... > > Thu Oct 22 18:33:49 CEST 2015 > pool 2.52T 207G 0 0 0 0 > rpool 33.0G 118G 0 11 0 94.5K > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:33:54 CEST 2015 > pool 2.52T 207G 0 13 819 80.0K > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:33:59 CEST 2015 > pool 2.52T 207G 0 129 0 1.06M > rpool 33.0G 118G 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > Thu Oct 22 18:34:04 CEST 2015 > pool 2.52T 207G 0 55 0 503K > rpool 33.0G 118G 0 11 0 97.9K > ---------- ----- ----- ----- ----- ----- ----- > ... > > just occasional bursts of work. > > I've now enabled SMART on the disks (2*3Tb mirror "pool" and 1*300Gb > "rpool") and ran some short tests and triggered long tests (hopefully > they'd succeed by tomorrow); current results are: > > > # for D in /dev/rdsk/c0*s0; do echo "===== $D :"; smartctl -d sat,12 -a $D > ; done ; for D in /dev/rdsk/c4*s0 ; do echo "===== $D :"; smartctl -d scsi > -a $D ; done > ===== /dev/rdsk/c0t3d0s0 : > smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) > Copyright (C) 2002-12, Bruce Allen, Christian Franke, > www.smartmontools.org > > === START OF INFORMATION SECTION === > Device Model: WDC WD3003FZEX-00Z4SA0 > Serial Number: WD-WCC5D1KKU0PA > LU WWN Device Id: 5 0014ee 2610716b7 > Firmware Version: 01.01A01 > User Capacity: 3,000,592,982,016 bytes [3.00 TB] > Sector Sizes: 512 bytes logical, 4096 bytes physical > Rotation Rate: 7200 rpm > Device is: Not in smartctl database [for details use: -P showall] > ATA Version is: ACS-2 (minor revision not indicated) > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) > Local Time is: Thu Oct 22 18:45:28 2015 CEST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x82) Offline data collection activity > was completed without error. > Auto Offline Data Collection: > Enabled. > Self-test execution status: ( 249) Self-test routine in progress... > 90% of test remaining. > Total time to complete Offline > data collection: (32880) seconds. > Offline data collection > capabilities: (0x7b) SMART execute Offline immediate. > Auto Offline data collection > on/off support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 357) minutes. > Conveyance self-test routine > recommended polling time: ( 5) minutes. > SCT capabilities: (0x7035) SCT Status supported. > SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED > WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > Always - 0 > 3 Spin_Up_Time 0x0027 246 154 021 Pre-fail > Always - 6691 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > Always - 14 > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age > Always - 0 > 9 Power_On_Hours 0x0032 094 094 000 Old_age > Always - 4869 > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > Always - 0 > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 14 > 16 Unknown_Attribute 0x0022 130 070 000 Old_age > Always - 2289651870502 > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > Always - 12 > 193 Load_Cycle_Count 0x0032 200 200 000 Old_age > Always - 2 > 194 Temperature_Celsius 0x0022 117 111 000 Old_age > Always - 35 > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > Always - 0 > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > Offline - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Short offline Completed without error 00% > 4869 - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute delay. > > ===== /dev/rdsk/c0t5d0s0 : > smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) > Copyright (C) 2002-12, Bruce Allen, Christian Franke, > www.smartmontools.org > > === START OF INFORMATION SECTION === > Model Family: Seagate SV35 > Device Model: ST3000VX000-1ES166 > Serial Number: Z500S3L8 > LU WWN Device Id: 5 000c50 079e3757b > Firmware Version: CV26 > User Capacity: 3,000,592,982,016 bytes [3.00 TB] > Sector Sizes: 512 bytes logical, 4096 bytes physical > Rotation Rate: 7200 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b > SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) > Local Time is: Thu Oct 22 18:45:28 2015 CEST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x00) Offline data collection activity > was never started. > Auto Offline Data Collection: > Disabled. > Self-test execution status: ( 249) Self-test routine in progress... > 90% of test remaining. > Total time to complete Offline > data collection: ( 80) seconds. > Offline data collection > capabilities: (0x73) SMART execute Offline immediate. > Auto Offline data collection > on/off support. > Suspend Offline collection upon new > command. > No Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 1) minutes. > Extended self-test routine > recommended polling time: ( 325) minutes. > Conveyance self-test routine > recommended polling time: ( 2) minutes. > SCT capabilities: (0x10b9) SCT Status supported. > SCT Error Recovery Control > supported. > SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED > WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail > Always - 8600880 > 3 Spin_Up_Time 0x0003 096 094 000 Pre-fail > Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age > Always - 19 > 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail > Always - 342685681 > 9 Power_On_Hours 0x0032 096 096 000 Old_age > Always - 4214 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age > Always - 19 > 184 End-to-End_Error 0x0032 100 100 099 Old_age > Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age > Always - 0 > 188 Command_Timeout 0x0032 100 100 000 Old_age > Always - 0 > 189 High_Fly_Writes 0x003a 028 028 000 Old_age > Always - 72 > 190 Airflow_Temperature_Cel 0x0022 069 065 045 Old_age > Always - 31 (Min/Max 29/32) > 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age > Always - 0 > 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age > Always - 19 > 193 Load_Cycle_Count 0x0032 100 100 000 Old_age > Always - 28 > 194 Temperature_Celsius 0x0022 031 040 000 Old_age > Always - 31 (0 20 0 0 0) > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age > Always - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Extended offline Self-test routine in progress 90% > 4214 - > # 2 Short offline Completed without error 00% > 4214 - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute delay. > > ===== /dev/rdsk/c4t5000CCA02A1292DDd0s0 : > smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build) > Copyright (C) 2002-12, Bruce Allen, Christian Franke, > www.smartmontools.org > > Vendor: HITACHI > Product: HUS156030VLS600 > Revision: HPH1 > User Capacity: 300,000,000,000 bytes [300 GB] > Logical block size: 512 bytes > Logical Unit id: 0x5000cca02a1292dc > Serial number: LVVA6NHS > Device type: disk > Transport protocol: SAS > Local Time is: Thu Oct 22 18:45:29 2015 CEST > Device supports SMART and is Enabled > Temperature Warning Enabled > SMART Health Status: OK > > Current Drive Temperature: 45 C > Drive Trip Temperature: 70 C > Manufactured in week 14 of year 2012 > Specified cycle count over device lifetime: 50000 > Accumulated start-stop cycles: 80 > Elements in grown defect list: 0 > Vendor (Seagate) cache information > Blocks sent to initiator = 2340336504406016 > > Error counter log: > Errors Corrected by Total Correction > Gigabytes Total > ECC rereads/ errors algorithm > processed uncorrected > fast | delayed rewrites corrected invocations [10^9 > bytes] errors > read: 0 888890 0 888890 0 > 29326.957 0 > write: 0 961315 0 961315 0 > 6277.560 0 > > Non-medium error count: 283 > > SMART Self-test log > Num Test Status segment LifeTime > LBA_first_err [SK ASC ASQ] > Description number (hours) > # 1 Background long Self test in progress ... - > NOW - [- - -] > # 2 Background long Aborted (device reset ?) - > 14354 - [- - -] > # 3 Background short Completed - > 14354 - [- - -] > # 4 Background long Aborted (device reset ?) - > 14354 - [- - -] > # 5 Background long Aborted (device reset ?) - > 14354 - [- - -] > > Long (extended) Self Test duration: 2506 seconds [41.8 minutes] > > > > The zpool scrub results and general layout: > > > > # zpool status -v > pool: pool > state: ONLINE > scan: scrub repaired 0 in 164h13m with 0 errors on Thu Oct 22 18:13:33 > 2015 > config: > > NAME STATE READ WRITE CKSUM > pool ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > c0t3d0 ONLINE 0 0 0 > c0t5d0 ONLINE 0 0 0 > logs > c4t5000CCA02A1292DDd0p2 ONLINE 0 0 0 > cache > c4t5000CCA02A1292DDd0p3 ONLINE 0 0 0 > > errors: No known data errors > > pool: rpool > state: ONLINE > status: Some supported features are not enabled on the pool. The pool can > still be used, but some features are unavailable. > action: Enable all features using 'zpool upgrade'. Once this is done, > the pool may no longer be accessible by software that does not > support > the features. See zpool-features(5) for details. > scan: scrub repaired 0 in 3h3m with 0 errors on Thu Oct 8 04:12:35 2015 > config: > > NAME STATE READ WRITE CKSUM > rpool ONLINE 0 0 0 > c4t5000CCA02A1292DDd0s0 ONLINE 0 0 0 > > errors: No known data errors > > # zpool list -v > NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP > DEDUP HEALTH ALTROOT > pool 2.72T 2.52T 207G - 68% 92% > 1.36x ONLINE / > mirror 2.72T 2.52T 207G - 68% 92% > c0t3d0 - - - - - - > c0t5d0 - - - - - - > log - - - - - - > c4t5000CCA02A1292DDd0p2 8G 148K 8.00G - 0% 0% > cache - - - - - - > c4t5000CCA02A1292DDd0p3 120G 1.80G 118G - 0% 1% > rpool 151G 33.0G 118G - 76% 21% > 1.00x ONLINE - > c4t5000CCA02A1292DDd0s0 151G 33.0G 118G - 76% 21% > > Note the long scrub time may have included the downtime while the system > was frozen until it was rebooted. > > > > Thanks in advance for the fresh pairs of eyeballs, > Jim Klimov > > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss@lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss > >
_______________________________________________ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss