Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
Hi, very periodically, we see I/O hangs for about 10 seconds, roughly once per minute. Each time this happens, the I/O rate simply drops to zero, and all disk access hangs; this is also very noticeable on the shell, for NFS clients etc. Everything else (networking, kernel, …) seems to continue normally. Environment: FreeBSD 9.1R GENERIC on amd64, using ZFS, on a ARC1320 PCIe with 24x Seagate ST33000650SS (3rd party arcsas.ko driver). It's easy to observe these hangs under write load, e.g. with 'zpool iostat 1': void22.4T 42.6T 34 2.73K 1.07M 293M void22.4T 42.6T 20 2.74K 623K 289M void22.4T 42.6T144 2.62K 4.83M 279M void22.4T 42.6T 13 2.60K 437K 283M void22.4T 42.6T 0 0 0 0 -- hang starts void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0296 4.00K 34.2M -- hang ends void22.4T 42.6T 2 2.64K 73.8K 288M void22.4T 42.6T 8 3.12K 278K 329M Each time this happens, there is a completely unexplained spike of interrupts on uhci0: 'systat -vm' then displays numbers around 270k. # vmstat -i | grep -E '(arcsas|uhci0|Total)' irq16: uhci0 1227020890 67708 irq24: arcsas0 12045211664 Total 1266417827 69882 Things to note: - Booting an USB-less kernel or disabling all USB in the BIOS doesn't change a thing (no interrupt spikes to be seen, but the hangs remain) - The hangs / interrupt spikes happen just as often when the system is idle - Board is a Supermicro x8dth - There's two igb cards - Root is ZFS as well (separate pool though) - BIOS, Areca FW and driver already are latest versions - Putting the controller to a different slot doesn't change the behaviour - We have two identical systems and both show the exact same symptoms, so flaky hardware is probably not the issue Any ideas would be appreciated. Thanks, D. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
On Wed, 19 Jun 2013 15:01:14 +0200, Dennis Kögel d...@neveragain.de wrote: Hi, very periodically, we see I/O hangs for about 10 seconds, roughly once per minute. Each time this happens, the I/O rate simply drops to zero, and all disk access hangs; this is also very noticeable on the shell, for NFS clients etc. Everything else (networking, kernel, …) seems to continue normally. Environment: FreeBSD 9.1R GENERIC on amd64, using ZFS, on a ARC1320 PCIe with 24x Seagate ST33000650SS (3rd party arcsas.ko driver). It's easy to observe these hangs under write load, e.g. with 'zpool iostat 1': void22.4T 42.6T 34 2.73K 1.07M 293M void22.4T 42.6T 20 2.74K 623K 289M void22.4T 42.6T144 2.62K 4.83M 279M void22.4T 42.6T 13 2.60K 437K 283M void22.4T 42.6T 0 0 0 0 -- hang starts void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0 0 0 0 void22.4T 42.6T 0296 4.00K 34.2M -- hang ends void22.4T 42.6T 2 2.64K 73.8K 288M void22.4T 42.6T 8 3.12K 278K 329M Each time this happens, there is a completely unexplained spike of interrupts on uhci0: 'systat -vm' then displays numbers around 270k. # vmstat -i | grep -E '(arcsas|uhci0|Total)' irq16: uhci0 1227020890 67708 irq24: arcsas0 12045211664 Total 1266417827 69882 Things to note: - Booting an USB-less kernel or disabling all USB in the BIOS doesn't change a thing (no interrupt spikes to be seen, but the hangs remain) - The hangs / interrupt spikes happen just as often when the system is idle - Board is a Supermicro x8dth - There's two igb cards - Root is ZFS as well (separate pool though) - BIOS, Areca FW and driver already are latest versions - Putting the controller to a different slot doesn't change the behaviour - We have two identical systems and both show the exact same symptoms, so flaky hardware is probably not the issue Any ideas would be appreciated. Thanks, D. First send more information about the system: - The content of /var/run/dmesg.boot. - Install /usr/ports/sysutils/zfs-stats and send the output of zfs-stats -a. - Send the output of zpool status + zpool list. - Did you configure compression or dedup on the pool? - Do you keep a lot of snapshots? - Do you run a cronjob every minute which does something with the pool? Gathers statistics or something like that. Ronald. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
Hi, Am 19.06.2013 um 15:28 schrieb Ronald Klop: First send more information about the system: - The content of /var/run/dmesg.boot. - Install /usr/ports/sysutils/zfs-stats and send the output of zfs-stats -a. - Send the output of zpool status + zpool list. not sure if I should put them all in this mail? -- I've put them here: http://pub.neveragain.de/arcsas/sysinfo.txt - Did you configure compression or dedup on the pool? - Do you keep a lot of snapshots? - Do you run a cronjob every minute which does something with the pool? Gathers statistics or something like that. There's only a handful of datasets (three on one machine, six on the other), and currently no snapshots. No deduplication. Some datasets on one machine have compression, the other machine doesn't have compression turned on for any dataset. No minutely cronjobs, automated logons, nothing alike. Thanks! D. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
Any timeouts show in /var/log/messages or in the areca event log? - Original Message - From: Dennis Kögel d...@neveragain.de Am 19.06.2013 um 15:28 schrieb Ronald Klop: First send more information about the system: - The content of /var/run/dmesg.boot. - Install /usr/ports/sysutils/zfs-stats and send the output of zfs-stats -a. - Send the output of zpool status + zpool list. not sure if I should put them all in this mail? -- I've put them here: http://pub.neveragain.de/arcsas/sysinfo.txt - Did you configure compression or dedup on the pool? - Do you keep a lot of snapshots? - Do you run a cronjob every minute which does something with the pool? Gathers statistics or something like that. There's only a handful of datasets (three on one machine, six on the other), and currently no snapshots. No deduplication. Some datasets on one machine have compression, the other machine doesn't have compression turned on for any dataset. No minutely cronjobs, automated logons, nothing alike. This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
Am 19.06.2013 um 16:28 schrieb Steven Hartland: Any timeouts show in /var/log/messages or in the areca event log? System logs don't show anything suspicious. Areca CLI utility - event info is empty as well. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
- Original Message - From: Dennis Kögel d...@neveragain.de Am 19.06.2013 um 16:28 schrieb Steven Hartland: Any timeouts show in /var/log/messages or in the areca event log? System logs don't show anything suspicious. Areca CLI utility - event info is empty as well. I'm not familar with that model of the areca but have you tried with the standard OS driver or does it not support that card? Also when you see hangs can you access the disk directly or not e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ? Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
Am 19.06.2013 um 16:47 schrieb Steven Hartland: I'm not familar with that model of the areca but have you tried with the standard OS driver or does it not support that card? The ARC1320 (non-raid) unfortunately isn't supported by the in-tree driver. Also when you see hangs can you access the disk directly or not e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ? Interesting idea. The dd then hangs right until everything else resumes as well. ^T during hang says: load: 12.39 cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 1632k ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
- Original Message - From: Dennis Kögel d...@neveragain.de I'm not familar with that model of the areca but have you tried with the standard OS driver or does it not support that card? The ARC1320 (non-raid) unfortunately isn't supported by the in-tree driver. Also when you see hangs can you access the disk directly or not e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ? Interesting idea. The dd then hangs right until everything else resumes as well. ^T during hang says: load: 12.39 cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 1632k So it sounds like your seeing device level hangs which indicates either a driver, HW, controller FW or disk level issue. You might want to try adding a seperate disk (different type) to the controller which isn't used and perform the same test to try and eliminate disk's as the source of the issue. Also see what gstat -d shows during this? Do you see a big spike of activity either side? Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
On Wed, Jun 19, 2013 at 05:02:20PM +0200, Dennis Kgel wrote: Am 19.06.2013 um 16:47 schrieb Steven Hartland: I'm not familar with that model of the areca but have you tried with the standard OS driver or does it not support that card? The ARC1320 (non-raid) unfortunately isn't supported by the in-tree driver. Which model of the ARC1320 are you using (there are 2). I'm having trouble understanding their chart too: http://www.areca.us/products/sasnoneraid6g.htm Because the controllers claim to support up to 128 disks, via break-out cables, but I'm not sure. You aren't using any port multipliers, are you? Also when you see hangs can you access the disk directly or not e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ? Interesting idea. The dd then hangs right until everything else resumes as well. ^T during hang says: load: 12.39 cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 1632k Is this ***while** you have immense amounts of ZFS write I/O going to those drives (your zpool iostat was showing ~250-300MB/sec to the pool)? It's very important to note that the stats you showed were during writes. What we're trying to figure out here is where the blocking (waiting) is happening: a) the ZFS layer b) the storage driver layer ('arcsat', the 3rd-party unofficial driver) c) the CAM layer d) the GEOM layer e) something with the disk(s) f) something with memory I/O going on (say between the storage driver and ZFS, for lack of better way to phrase it) I have a very big Email written for you, but I wanted to let certain answers to Ronald's questions come out first. -rw---1 jdc users 5576 Jun 19 06:49 dennis_kgel_response.txt I need to re-word this and take into consideration some of the new stuff said up to now, but I don't know if I'll ahve the time for this (you should see my desktop right now, I have literally 4 IM messages to answer and my Email box is non-stop). The one I want to get out of the way right now is this: Can you please try putting this in /boot/loader.conf + reboot and see if the behaviour for you changes? vfs.zfs.no_write_throttle=1 Warning: this may actually exacerbate the problem worse, depending on what the nature/root cause is. Right now I'm of the opinion ZFS is actually doing the Right Thing(tm) and that the issue may be in Areca's driver, but that's hearsay until I have proof. But the write throttling stuff added semi-recently (by the Illumos folks, this is not a FreeBSD feature) has had some reports of problems where disabling it helped immensely. Important: 24 disks off a single controller is a lot of bandwidth. That controller may be overwhelmed, in which case you would see exactly this kind of behaviour as the controller is screaming GOD HELP ME, I'M TRYING TO DO ALL THIS STUFF AND YOU KEEP THROWING I/O AT ME. :-) This is also why I ask about port multiplier usage. -- | Jeremy Chadwick j...@koitsu.org | | UNIX Systems Administratorhttp://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
Am 19.06.2013 um 17:16 schrieb Jeremy Chadwick j...@koitsu.org: Which model of the ARC1320 are you using (there are 2). It has four internal connectors, so it should be the ARC-1320ix-16. No port multipliers. Also when you see hangs can you access the disk directly or not e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ? Interesting idea. The dd then hangs right until everything else resumes as well. ^T during hang says: load: 12.39 cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 1632k Is this ***while** you have immense amounts of ZFS write I/O going to those drives (your zpool iostat was showing ~250-300MB/sec to the pool)? [...] It's important to note that the interrupt spikes (and the I/O hangs) happen just as frequently on an idle system. Having a bunch of dd processes writiing + iostat just visualizes it better. So, with or without actual write load: dd with if=/dev/daX (arcsas device) hangs when the interrupt counters for uhci0 soar for these ~10 seconds phases, as shown above. Noteworthy: dd'ing from if=/dev/ada1 (onboard controller) during such a hang phase returns immediately, i.e. works fine. (ada1 is part of ZFS -- the other 'zroot' pool -- but is not an arcsas device, so a driver issue sounds more likely). Can you please try putting this in /boot/loader.conf + reboot and see if the behaviour for you changes? vfs.zfs.no_write_throttle=1 This produces quite interesting burst numbers, but does not affect the problem behaviour at all. Am 19.06.2013 um 17:10 schrieb Steven Hartland kill...@multiplay.co.uk: You might want to try adding a seperate disk (different type) to the controller which isn't used and perform the same test to try and eliminate disk's as the source of the issue. That's currently not an option, as the zpool already contains data; but I tried against a disk on another controller, see above. Also see what gstat -d shows during this? Do you see a big spike of activity either side? The picture is pretty much the same as with zpool iostat: Healthy values, all disks from 70-100% busy; during a hang phase, every column just drops to zero -- except for L(q), which remains frozen at some low value for the duration of the hang (e.g. 4 or 10). Sample outputs here: http://pub.neveragain.de/arcsas/gstat.txt Thanks, D. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org