Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Dennis Kögel
Hi,

very periodically, we see I/O hangs for about 10 seconds, roughly once per 
minute.

Each time this happens, the I/O rate simply drops to zero, and all disk access 
hangs; this is also very noticeable on the shell, for NFS clients etc. 
Everything else (networking, kernel, …) seems to continue normally.

Environment: FreeBSD 9.1R GENERIC on amd64, using ZFS, on a ARC1320 PCIe with 
24x Seagate ST33000650SS (3rd party arcsas.ko driver).

It's easy to observe these hangs under write load, e.g. with 'zpool iostat 1':

void22.4T  42.6T 34  2.73K  1.07M   293M
void22.4T  42.6T 20  2.74K   623K   289M
void22.4T  42.6T144  2.62K  4.83M   279M
void22.4T  42.6T 13  2.60K   437K   283M
void22.4T  42.6T  0  0  0  0 -- hang starts
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0296  4.00K  34.2M -- hang ends
void22.4T  42.6T  2  2.64K  73.8K   288M
void22.4T  42.6T  8  3.12K   278K   329M

Each time this happens, there is a completely unexplained spike of interrupts 
on uhci0: 'systat -vm' then displays numbers around 270k.

# vmstat -i | grep -E '(arcsas|uhci0|Total)'
irq16: uhci0  1227020890  67708
irq24: arcsas0  12045211664
Total 1266417827  69882

Things to note:

- Booting an USB-less kernel or disabling all USB in the BIOS doesn't change a 
thing (no interrupt spikes to be seen, but the hangs remain)
- The hangs / interrupt spikes happen just as often when the system is idle
- Board is a Supermicro x8dth
- There's two igb cards
- Root is ZFS as well (separate pool though)
- BIOS, Areca FW and driver already are latest versions
- Putting the controller to a different slot doesn't change the behaviour
- We have two identical systems and both show the exact same symptoms, so flaky 
hardware is probably not the issue

Any ideas would be appreciated.

Thanks,
D.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Ronald Klop

On Wed, 19 Jun 2013 15:01:14 +0200, Dennis Kögel d...@neveragain.de wrote:


Hi,

very periodically, we see I/O hangs for about 10 seconds, roughly once  
per minute.


Each time this happens, the I/O rate simply drops to zero, and all disk  
access hangs; this is also very noticeable on the shell, for NFS clients  
etc. Everything else (networking, kernel, …) seems to continue normally.


Environment: FreeBSD 9.1R GENERIC on amd64, using ZFS, on a ARC1320 PCIe  
with 24x Seagate ST33000650SS (3rd party arcsas.ko driver).


It's easy to observe these hangs under write load, e.g. with 'zpool  
iostat 1':


void22.4T  42.6T 34  2.73K  1.07M   293M
void22.4T  42.6T 20  2.74K   623K   289M
void22.4T  42.6T144  2.62K  4.83M   279M
void22.4T  42.6T 13  2.60K   437K   283M
void22.4T  42.6T  0  0  0  0 -- hang starts
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0  0  0  0
void22.4T  42.6T  0296  4.00K  34.2M -- hang ends
void22.4T  42.6T  2  2.64K  73.8K   288M
void22.4T  42.6T  8  3.12K   278K   329M

Each time this happens, there is a completely unexplained spike of  
interrupts on uhci0: 'systat -vm' then displays numbers around 270k.


# vmstat -i | grep -E '(arcsas|uhci0|Total)'
irq16: uhci0  1227020890  67708
irq24: arcsas0  12045211664
Total 1266417827  69882

Things to note:

- Booting an USB-less kernel or disabling all USB in the BIOS doesn't  
change a thing (no interrupt spikes to be seen, but the hangs remain)
- The hangs / interrupt spikes happen just as often when the system is  
idle

- Board is a Supermicro x8dth
- There's two igb cards
- Root is ZFS as well (separate pool though)
- BIOS, Areca FW and driver already are latest versions
- Putting the controller to a different slot doesn't change the behaviour
- We have two identical systems and both show the exact same symptoms,  
so flaky hardware is probably not the issue


Any ideas would be appreciated.

Thanks,
D.


First send more information about the system:
- The content of /var/run/dmesg.boot.
- Install /usr/ports/sysutils/zfs-stats and send the output of zfs-stats  
-a.

- Send the output of zpool status + zpool list.
- Did you configure compression or dedup on the pool?
- Do you keep a lot of snapshots?
- Do you run a cronjob every minute which does something with the pool?  
Gathers statistics or something like that.


Ronald.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Dennis Kögel
Hi,

Am 19.06.2013 um 15:28 schrieb Ronald Klop:
 First send more information about the system:
 - The content of /var/run/dmesg.boot.
 - Install /usr/ports/sysutils/zfs-stats and send the output of zfs-stats -a.
 - Send the output of zpool status + zpool list.

not sure if I should put them all in this mail? -- I've put them here:

http://pub.neveragain.de/arcsas/sysinfo.txt

 - Did you configure compression or dedup on the pool?
 - Do you keep a lot of snapshots?
 - Do you run a cronjob every minute which does something with the pool? 
 Gathers statistics or something like that.

There's only a handful of datasets (three on one machine, six on the other), 
and currently no snapshots. No deduplication.
Some datasets on one machine have compression, the other machine doesn't have 
compression turned on for any dataset.

No minutely cronjobs, automated logons, nothing alike.

Thanks!
D.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Steven Hartland

Any timeouts show in /var/log/messages or in the areca event log?

- Original Message - 
From: Dennis Kögel d...@neveragain.de

Am 19.06.2013 um 15:28 schrieb Ronald Klop:

First send more information about the system:
- The content of /var/run/dmesg.boot.
- Install /usr/ports/sysutils/zfs-stats and send the output of zfs-stats -a.
- Send the output of zpool status + zpool list.


not sure if I should put them all in this mail? -- I've put them here:

http://pub.neveragain.de/arcsas/sysinfo.txt


- Did you configure compression or dedup on the pool?
- Do you keep a lot of snapshots?
- Do you run a cronjob every minute which does something with the pool? Gathers 
statistics or something like that.


There's only a handful of datasets (three on one machine, six on the other), 
and currently no snapshots. No deduplication.
Some datasets on one machine have compression, the other machine doesn't have 
compression turned on for any dataset.

No minutely cronjobs, automated logons, nothing alike.




This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Dennis Kögel
Am 19.06.2013 um 16:28 schrieb Steven Hartland:
 Any timeouts show in /var/log/messages or in the areca event log?

System logs don't show anything suspicious.

Areca CLI utility - event info is empty as well.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Steven Hartland
- Original Message - 
From: Dennis Kögel d...@neveragain.de




Am 19.06.2013 um 16:28 schrieb Steven Hartland:

Any timeouts show in /var/log/messages or in the areca event log?


System logs don't show anything suspicious.

Areca CLI utility - event info is empty as well.


I'm not familar with that model of the areca but have you tried
with the standard OS driver or does it not support that card?

Also when you see hangs can you access the disk directly or not
e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ?

   Regards
   Steve 




This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Dennis Kögel
Am 19.06.2013 um 16:47 schrieb Steven Hartland:
 I'm not familar with that model of the areca but have you tried
 with the standard OS driver or does it not support that card?

The ARC1320 (non-raid) unfortunately isn't supported by the in-tree driver.

 Also when you see hangs can you access the disk directly or not
 e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ?

Interesting idea. The dd then hangs right until everything else resumes as well.

^T during hang says: load: 12.39  cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 
1632k

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Steven Hartland
- Original Message - 
From: Dennis Kögel d...@neveragain.de

 I'm not familar with that model of the areca but have you tried
 with the standard OS driver or does it not support that card?

The ARC1320 (non-raid) unfortunately isn't supported by the in-tree driver.

 Also when you see hangs can you access the disk directly or not
 e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ?

Interesting idea. The dd then hangs right until everything else resumes as well.

^T during hang says: load: 12.39  cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 
1632k


So it sounds like your seeing device level hangs which indicates
either a driver, HW, controller FW or disk level issue.

You might want to try adding a seperate disk (different type)
to the controller which isn't used and perform the same test to
try and eliminate disk's as the source of the issue.

Also see what gstat -d shows during this? Do you see a big spike
of activity either side?

   Regards
   Steve 




This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Jeremy Chadwick
On Wed, Jun 19, 2013 at 05:02:20PM +0200, Dennis Kgel wrote:
 Am 19.06.2013 um 16:47 schrieb Steven Hartland:
  I'm not familar with that model of the areca but have you tried
  with the standard OS driver or does it not support that card?
 
 The ARC1320 (non-raid) unfortunately isn't supported by the in-tree driver.

Which model of the ARC1320 are you using (there are 2).  I'm having
trouble understanding their chart too:

http://www.areca.us/products/sasnoneraid6g.htm

Because the controllers claim to support up to 128 disks, via break-out
cables, but I'm not sure.

You aren't using any port multipliers, are you?

  Also when you see hangs can you access the disk directly or not
  e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ?
 
 Interesting idea. The dd then hangs right until everything else resumes as 
 well.
 
 ^T during hang says: load: 12.39  cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 
 1632k

Is this ***while** you have immense amounts of ZFS write I/O going to
those drives (your zpool iostat was showing ~250-300MB/sec to the pool)?

It's very important to note that the stats you showed were during
writes.

What we're trying to figure out here is where the blocking (waiting) is
happening:

a) the ZFS layer
b) the storage driver layer ('arcsat', the 3rd-party unofficial driver)
c) the CAM layer
d) the GEOM layer
e) something with the disk(s)
f) something with memory I/O going on (say between the storage driver
   and ZFS, for lack of better way to phrase it)

I have a very big Email written for you, but I wanted to let certain
answers to Ronald's questions come out first.

-rw---1 jdc   users 5576 Jun 19 06:49 dennis_kgel_response.txt

I need to re-word this and take into consideration some of the new stuff
said up to now, but I don't know if I'll ahve the time for this (you
should see my desktop right now, I have literally 4 IM messages to
answer and my Email box is non-stop).

The one I want to get out of the way right now is this:

Can you please try putting this in /boot/loader.conf + reboot and
see if the behaviour for you changes?

vfs.zfs.no_write_throttle=1

Warning: this may actually exacerbate the problem worse, depending on
what the nature/root cause is.  Right now I'm of the opinion ZFS is
actually doing the Right Thing(tm) and that the issue may be in Areca's
driver, but that's hearsay until I have proof.  But the write throttling
stuff added semi-recently (by the Illumos folks, this is not a FreeBSD
feature) has had some reports of problems where disabling it helped
immensely.

Important: 24 disks off a single controller is a lot of bandwidth.
That controller may be overwhelmed, in which case you would see
exactly this kind of behaviour as the controller is screaming GOD HELP
ME, I'M TRYING TO DO ALL THIS STUFF AND YOU KEEP THROWING I/O AT ME.
:-)  This is also why I ask about port multiplier usage.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)

2013-06-19 Thread Dennis Kögel
Am 19.06.2013 um 17:16 schrieb Jeremy Chadwick j...@koitsu.org:
 Which model of the ARC1320 are you using (there are 2).

It has four internal connectors, so it should be the ARC-1320ix-16.

No port multipliers.

 Also when you see hangs can you access the disk directly or not
 e.g. dd if=/dev/da0 of=/dev/null bs=1m count=10 ?
 
 Interesting idea. The dd then hangs right until everything else resumes as 
 well.
 
 ^T during hang says: load: 12.39  cmd: dd 7847 [physrd] 6.36r 0.00u 0.00s 0% 
 1632k
 
 Is this ***while** you have immense amounts of ZFS write I/O going to
 those drives (your zpool iostat was showing ~250-300MB/sec to the pool)?
 [...]

It's important to note that the interrupt spikes (and the I/O hangs) happen 
just as frequently on an idle system.
Having a bunch of dd processes writiing + iostat just visualizes it better.

So, with or without actual write load: dd with if=/dev/daX (arcsas device) 
hangs when the interrupt counters for uhci0 soar for these ~10 seconds phases, 
as shown above.

Noteworthy: dd'ing from if=/dev/ada1 (onboard controller) during such a hang 
phase returns immediately, i.e. works fine. (ada1 is part of ZFS -- the other 
'zroot' pool -- but is not an arcsas device, so a driver issue sounds more 
likely).

 Can you please try putting this in /boot/loader.conf + reboot and
 see if the behaviour for you changes?
 
 vfs.zfs.no_write_throttle=1

This produces quite interesting burst numbers, but does not affect the problem 
behaviour at all.

Am 19.06.2013 um 17:10 schrieb Steven Hartland kill...@multiplay.co.uk:
 You might want to try adding a seperate disk (different type)
 to the controller which isn't used and perform the same test to
 try and eliminate disk's as the source of the issue.

That's currently not an option, as the zpool already contains data; but I tried 
against a disk on another controller, see above.

 Also see what gstat -d shows during this? Do you see a big spike
 of activity either side?

The picture is pretty much the same as with zpool iostat: Healthy values, all 
disks from 70-100% busy; during a hang phase, every column just drops to zero 
-- except for L(q), which remains frozen at some low value for the duration of 
the hang (e.g. 4 or 10).
Sample outputs here: http://pub.neveragain.de/arcsas/gstat.txt

Thanks,
D.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org