So, I'm still having problems with intermittent hangs on write with my ZFS pool. Details from my original post are below. Since posting that, I've gone back and forth with a number of you, and gotten a lot of useful advice, but I'm still trying to get to the root of the problem so I can correct it. Since the original post I have:
-Gathered a great deal of information in the form of kernel thread dumps, zio_state dumps, and live crash dumps while the problem is happening. -Been advised that my ruling out of dedupe was probably premature, as I still likely have a good deal of deduplicated data on-disk. -Checked just about every log and counter that might indicate a hardware error, without finding one. I was wondering at this point if someone could give me some pointers on the following: 1. Given the dumps and diagnostic data I've gathered so far, is there a way I can determine for certain where in the ZFS driver I'm spending so much time hanging? At the very least I'd like to try to determine whether it is, in-fact a deduplication issue. 2. If it is, in fact, a deduplication issue, would my only recourse be a new pool and a send/receive operation? The data we're storing is VMFS volumes for ESX. We're tossing around the idea of creating new volumes in the same pool (now that dedupe is off) and migrating VMs over in small batches. The theory is that we would be writing non-deduped data this way, and when we were done we could remove the deduplicated volumes. Is this sound? Thanks again for all the help! -Charles > Howdy, > > We're having a ZFS performance issue over here that I > was hoping you guys could help me troubleshoot. We > have a ZFS pool made up of 24 disks, arranged into 7 > raid-z devices of 4 disks each. We're using it as an > iSCSI back-end for VMWare and some Oracle RAC > clusters. > > Under normal circumstances performance is very good > both in benchmarks and under real-world use. Every > couple days, however, I/O seems to hang for anywhere > between several seconds and several minutes. The > hang seems to be a complete stop of all write I/O. > The following zpool iostat illustrates: > > pool0 2.47T 5.13T 120 0 293K 0 > pool0 2.47T 5.13T 127 0 308K 0 > pool0 2.47T 5.13T 131 0 322K 0 > pool0 2.47T 5.13T 144 0 347K 0 > pool0 2.47T 5.13T 135 0 331K 0 > pool0 2.47T 5.13T 122 0 295K 0 > pool0 2.47T 5.13T 135 0 330K 0 > > While this is going on our VMs all hang, as do any > "zfs create" commands or attempts to touch/create > files in the zfs pool from the local system. After > several minutes the system "un-hangs" and we see very > high write rates before things return to normal > across the board. > > Some more information about our configuration: We're > running OpenSolaris svn-134. ZFS is at version 22. > Our disks are 15kRPM 300gb Seagate Cheetahs, mounted > in Promise J610S Dual enclosures, hanging off a Dell > SAS 5/e controller. We'd tried out most of this > configuration previously on OpenSolaris 2009.06 > without running into this problem. The only thing > that's new, aside from the newer OpenSolaris/ZFS is > a set of four SSDs configured as log disks. > > At first we blamed de-dupe, but we've disabled that. > Next we suspected the SSD log disks, but we've seen > the problem with those removed, as well. > > Has anyone seen anything like this before? Are there > any tools we can use to gather information during the > hang which might be useful in determining what's > going wrong? > > Thanks for any insights you may have. > > -Charles -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss