Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Dennis Kögel
Hi,

very periodically, we see I/O hangs for about 10 seconds, roughly once per 
minute.

Each time this happens, the I/O rate simply drops to zero, and all disk access 
hangs; this is also very noticeable on the shell, for NFS clients etc. 
Everything else (networking, kernel, …) seems to continue normally.

Environment: FreeBSD 9.1R GENERIC on amd64, using ZFS, on a ARC1320 PCIe with 
24x Seagate ST33000650SS (3rd party arcsas.ko driver).

It's easy to observe these hangs under write load, e.g. with 'zpool iostat 1':

void22.4T  42.6T 34  2.73K  1.07M   293M
void22.4T  42.6T 20  2.74K   623K   289M
void22.4T  42.6T144  2.62K  4.83M   279M
void22.4T  42.6T 13  2.60K   437K   283M
void22.4T  42.6T  0  0  0  0 -- hang starts
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0296  4.00K  34.2M -- hang ends
void22.4T  42.6T  2  2.64K  73.8K   288M
void22.4T  42.6T  8  3.12K   278K   329M

Each time this happens, there is a completely unexplained spike of interrupts 
on uhci0: 'systat -vm' then displays numbers around 270k.

# vmstat -i | grep -E '(arcsas|uhci0|Total)'
irq16: uhci0  1227020890  67708
irq24: arcsas0  12045211664
Total 1266417827  69882

Things to note:

- Booting an USB-less kernel or disabling all USB in the BIOS doesn't change a 
thing (no interrupt spikes to be seen, but the hangs remain)
- The hangs / interrupt spikes happen just as often when the system is idle
- Board is a Supermicro x8dth
- There's two igb cards
- Root is ZFS as well (separate pool though)
- BIOS, Areca FW and driver already are latest versions
- Putting the controller to a different slot doesn't change the behaviour
- We have two identical systems and both show the exact same symptoms, so flaky 
hardware is probably not the issue

Any ideas would be appreciated.

Thanks,
D.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Dennis Kögel
Hi,

Am 19.06.2013 um 15:28 schrieb Ronald Klop:
 First send more information about the system:
 - The content of /var/run/dmesg.boot.
 - Install /usr/ports/sysutils/zfs-stats and send the output of zfs-stats -a.
 - Send the output of zpool status + zpool list.

not sure if I should put them all in this mail? -- I've put them here:

http://pub.neveragain.de/arcsas/sysinfo.txt

 - Did you configure compression or dedup on the pool?
 - Do you keep a lot of snapshots?
 - Do you run a cronjob every minute which does something with the pool? 
 Gathers statistics or something like that.

There's only a handful of datasets (three on one machine, six on the other), 
and currently no snapshots. No deduplication.
Some datasets on one machine have compression, the other machine doesn't have 
compression turned on for any dataset.

No minutely cronjobs, automated logons, nothing alike.

Thanks!
D.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Dennis Kögel
Am 19.06.2013 um 16:28 schrieb Steven Hartland:
 Any timeouts show in /var/log/messages or in the areca event log?

System logs don't show anything suspicious.

Areca CLI utility - event info is empty as well.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Dennis Kögel
Am 19.06.2013 um 16:47 schrieb Steven Hartland:
 I'm not familar with that model of the areca but have you tried
 with the standard OS driver or does it not support that card?

The ARC1320 (non-raid) unfortunately isn't supported by the in-tree driver.

 Also when you see hangs can you access the disk directly or not
 e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ?

Interesting idea. The dd then hangs right until everything else resumes as well.

^T during hang says: load: 12.39  cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 
1632k

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Dennis Kögel
Am 19.06.2013 um 17:16 schrieb Jeremy Chadwick j...@koitsu.org:
 Which model of the ARC1320 are you using (there are 2).

It has four internal connectors, so it should be the ARC-1320ix-16.

No port multipliers.

 Also when you see hangs can you access the disk directly or not
 e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ?
 
 Interesting idea. The dd then hangs right until everything else resumes as 
 well.
 
 ^T during hang says: load: 12.39  cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 
 1632k
 
 Is this ***while** you have immense amounts of ZFS write I/O going to
 those drives (your zpool iostat was showing ~250-300MB/sec to the pool)?
 [...]

It's important to note that the interrupt spikes (and the I/O hangs) happen 
just as frequently on an idle system.
Having a bunch of dd processes writiing + iostat just visualizes it better.

So, with or without actual write load: dd with if=/dev/daX (arcsas device) 
hangs when the interrupt counters for uhci0 soar for these ~10 seconds phases, 
as shown above.

Noteworthy: dd'ing from if=/dev/ada1 (onboard controller) during such a hang 
phase returns immediately, i.e. works fine. (ada1 is part of ZFS -- the other 
'zroot' pool -- but is not an arcsas device, so a driver issue sounds more 
likely).

 Can you please try putting this in /boot/loader.conf + reboot and
 see if the behaviour for you changes?
 
 vfs.zfs.no_write_throttle=1

This produces quite interesting burst numbers, but does not affect the problem 
behaviour at all.

Am 19.06.2013 um 17:10 schrieb Steven Hartland kill...@multiplay.co.uk:
 You might want to try adding a seperate disk (different type)
 to the controller which isn't used and perform the same test to
 try and eliminate disk's as the source of the issue.

That's currently not an option, as the zpool already contains data; but I tried 
against a disk on another controller, see above.

 Also see what gstat -d shows during this? Do you see a big spike
 of activity either side?

The picture is pretty much the same as with zpool iostat: Healthy values, all 
disks from 70-100% busy; during a hang phase, every column just drops to zero 
-- except for L(q), which remains frozen at some low value for the duration of 
the hang (e.g. 4 or 10).
Sample outputs here: http://pub.neveragain.de/arcsas/gstat.txt

Thanks,
D.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org