So, I'm still having problems with intermittent hangs on write with my ZFS 
pool.  Details from my original post are below.  Since posting that, I've gone 
back and forth with a number of you, and gotten a lot of useful advice, but I'm 
still trying to get to the root of the problem so I can correct it.  Since the 
original post I have:

-Gathered a great deal of information in the form of kernel thread dumps, 
zio_state dumps, and live crash dumps while the problem is happening.
-Been advised that my ruling out of dedupe was probably premature, as I still 
likely have a good deal of deduplicated data on-disk.
-Checked just about every log and counter that might indicate a hardware error, 
without finding one.

I was wondering at this point if someone could give me some pointers on the 
1. Given the dumps and diagnostic data I've gathered so far, is there a way I 
can determine for certain where in the ZFS driver I'm spending so much time 
hanging?  At the very least I'd like to try to determine whether it is, in-fact 
a deduplication issue.
2. If it is, in fact, a deduplication issue, would my only recourse be a new 
pool and a send/receive operation?  The data we're storing is VMFS volumes for 
ESX.  We're tossing around the idea of creating new volumes in the same pool 
(now that dedupe is off) and migrating VMs over in small batches.  The theory 
is that we would be writing non-deduped data this way, and when we were done we 
could remove the deduplicated volumes.  Is this sound?

Thanks again for all the help!


> Howdy,
> We're having a ZFS performance issue over here that I
> was hoping you guys could help me troubleshoot.  We
> have a ZFS pool made up of 24 disks, arranged into 7
> raid-z devices of 4 disks each.  We're using it as an
> iSCSI back-end for VMWare and some Oracle RAC
> clusters.
> Under normal circumstances performance is very good
> both in benchmarks and under real-world use.  Every
> couple days, however, I/O seems to hang for anywhere
> between several seconds and several minutes.  The
> hang seems to be a complete stop of all write I/O.
>  The following zpool iostat illustrates:
> pool0       2.47T  5.13T    120      0   293K      0
> pool0       2.47T  5.13T    127      0   308K      0
> pool0       2.47T  5.13T    131      0   322K      0
> pool0       2.47T  5.13T    144      0   347K      0
> pool0       2.47T  5.13T    135      0   331K      0
> pool0       2.47T  5.13T    122      0   295K      0
> pool0       2.47T  5.13T    135      0   330K      0
> While this is going on our VMs all hang, as do any
> "zfs create" commands or attempts to touch/create
> files in the zfs pool from the local system.  After
> several minutes the system "un-hangs" and we see very
> high write rates before things return to normal
> across the board.
> Some more information about our configuration:  We're
> running OpenSolaris svn-134.  ZFS is at version 22.
> Our disks are 15kRPM 300gb Seagate Cheetahs, mounted
> in Promise J610S Dual enclosures, hanging off a Dell
> SAS 5/e controller.  We'd tried out most of this
> configuration previously on OpenSolaris 2009.06
> without running into this problem.  The only thing
> that's new, aside from the newer OpenSolaris/ZFS is
>  a set of four SSDs configured as log disks.
> At first we blamed de-dupe, but we've disabled that.
> Next we suspected the SSD log disks, but we've seen
>  the problem with those removed, as well.
> Has anyone seen anything like this before?  Are there
> any tools we can use to gather information during the
> hang which might be useful in determining what's
> going wrong?
> Thanks for any insights you may have.
> -Charles
