Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
Hi, very periodically, we see I/O hangs for about 10 seconds, roughly once per minute. Each time this happens, the I/O rate simply drops to zero, and all disk access hangs; this is also very noticeable on the shell, for NFS clients etc. Everything else (networking, kernel, …) seems to continue normally. Environment: FreeBSD 9.1R GENERIC on amd64, using ZFS, on a ARC1320 PCIe with 24x Seagate ST33000650SS (3rd party arcsas.ko driver). It's easy to observe these hangs under write load, e.g. with 'zpool iostat 1': void22.4T 42.6T 34 2.73K 1.07M 293M void22.4T 42.6T 20 2.74K 623K 289M void22.4T 42.6T144 2.62K 4.83M 279M void22.4T 42.6T 13 2.60K 437K 283M void22.4T 42.6T 0 0 0 0 -- hang starts void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0296 4.00K 34.2M -- hang ends void22.4T 42.6T 2 2.64K 73.8K 288M void22.4T 42.6T 8 3.12K 278K 329M Each time this happens, there is a completely unexplained spike of interrupts on uhci0: 'systat -vm' then displays numbers around 270k. # vmstat -i | grep -E '(arcsas|uhci0|Total)' irq16: uhci0 1227020890 67708 irq24: arcsas0 12045211664 Total 1266417827 69882 Things to note: - Booting an USB-less kernel or disabling all USB in the BIOS doesn't change a thing (no interrupt spikes to be seen, but the hangs remain) - The hangs / interrupt spikes happen just as often when the system is idle - Board is a Supermicro x8dth - There's two igb cards - Root is ZFS as well (separate pool though) - BIOS, Areca FW and driver already are latest versions - Putting the controller to a different slot doesn't change the behaviour - We have two identical systems and both show the exact same symptoms, so flaky hardware is probably not the issue Any ideas would be appreciated. Thanks, D. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
Hi, Am 19.06.2013 um 15:28 schrieb Ronald Klop: First send more information about the system: - The content of /var/run/dmesg.boot. - Install /usr/ports/sysutils/zfs-stats and send the output of zfs-stats -a. - Send the output of zpool status + zpool list. not sure if I should put them all in this mail? -- I've put them here: http://pub.neveragain.de/arcsas/sysinfo.txt - Did you configure compression or dedup on the pool? - Do you keep a lot of snapshots? - Do you run a cronjob every minute which does something with the pool? Gathers statistics or something like that. There's only a handful of datasets (three on one machine, six on the other), and currently no snapshots. No deduplication. Some datasets on one machine have compression, the other machine doesn't have compression turned on for any dataset. No minutely cronjobs, automated logons, nothing alike. Thanks! D. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
Am 19.06.2013 um 16:28 schrieb Steven Hartland: Any timeouts show in /var/log/messages or in the areca event log? System logs don't show anything suspicious. Areca CLI utility - event info is empty as well. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
Am 19.06.2013 um 16:47 schrieb Steven Hartland: I'm not familar with that model of the areca but have you tried with the standard OS driver or does it not support that card? The ARC1320 (non-raid) unfortunately isn't supported by the in-tree driver. Also when you see hangs can you access the disk directly or not e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ? Interesting idea. The dd then hangs right until everything else resumes as well. ^T during hang says: load: 12.39 cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 1632k ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
Am 19.06.2013 um 17:16 schrieb Jeremy Chadwick j...@koitsu.org: Which model of the ARC1320 are you using (there are 2). It has four internal connectors, so it should be the ARC-1320ix-16. No port multipliers. Also when you see hangs can you access the disk directly or not e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ? Interesting idea. The dd then hangs right until everything else resumes as well. ^T during hang says: load: 12.39 cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 1632k Is this ***while** you have immense amounts of ZFS write I/O going to those drives (your zpool iostat was showing ~250-300MB/sec to the pool)? [...] It's important to note that the interrupt spikes (and the I/O hangs) happen just as frequently on an idle system. Having a bunch of dd processes writiing + iostat just visualizes it better. So, with or without actual write load: dd with if=/dev/daX (arcsas device) hangs when the interrupt counters for uhci0 soar for these ~10 seconds phases, as shown above. Noteworthy: dd'ing from if=/dev/ada1 (onboard controller) during such a hang phase returns immediately, i.e. works fine. (ada1 is part of ZFS -- the other 'zroot' pool -- but is not an arcsas device, so a driver issue sounds more likely). Can you please try putting this in /boot/loader.conf + reboot and see if the behaviour for you changes? vfs.zfs.no_write_throttle=1 This produces quite interesting burst numbers, but does not affect the problem behaviour at all. Am 19.06.2013 um 17:10 schrieb Steven Hartland kill...@multiplay.co.uk: You might want to try adding a seperate disk (different type) to the controller which isn't used and perform the same test to try and eliminate disk's as the source of the issue. That's currently not an option, as the zpool already contains data; but I tried against a disk on another controller, see above. Also see what gstat -d shows during this? Do you see a big spike of activity either side? The picture is pretty much the same as with zpool iostat: Healthy values, all disks from 70-100% busy; during a hang phase, every column just drops to zero -- except for L(q), which remains frozen at some low value for the duration of the hang (e.g. 4 or 10). Sample outputs here: http://pub.neveragain.de/arcsas/gstat.txt Thanks, D. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org