[zfs-discuss] New twist on the faulted zpools
I have run into a more serious and scary situation after our array outage yesterday. As I posted earlier today, I came in this morning and found 9 LUNs off line (our of over 120). Not a big deal, as the rest of the array was OK (and still is), and the other arrays are fine. Everything is mirrored across arrays. I started "zpool replace"ing bad LUNs with some excess capacity we have. The first two went fine, the third is still resilvering. The fourth, on the other hand, has been a nightmare. Here is the current state: pool: deadbeef state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: resilver in progress for 2h4m, 0.07% done, 3186h1m to go config: NAME STATE READ WRITE CKSUM deadbeef UNAVAIL 0 0 0 insufficient replicas mirror-0 DEGRADED 0 0 0 c5t600C0FF009278536638D9B07d0ONLINE 0 0 0 replacing-1 DEGRADED 0 0 0 c5t600C0FF00922614781B19005d0 UNAVAIL 0 0 0 corrupted data c5t600C0FF009277F7905F6DD05d0 ONLINE 0 0 0 38K resilvered mirror-1 UNAVAIL 0 0 0 corrupted data c5t600C0FF00927852FB91AD301d0ONLINE 0 0 0 c5t600C0FF00922614781B19006d0ONLINE 0 0 0 14K resilvered mirror-2 ONLINE 0 0 0 c5t600C0FF009277F6FA1A14C06d0ONLINE 0 0 0 31K resilvered c5t600015D60200B361d0ONLINE 0 0 0 mirror-3 DEGRADED 0 0 0 replacing-0 DEGRADED 0 0 0 c5t600C0FF0092261491D9A9F09d0 UNAVAIL 0 0 0 cannot open c5t600015D60200B365d0 ONLINE 0 0 0 32.9M resilvered c5t600C0FF009277F7905F6DD02d0ONLINE 0 0 0 2.50K resilvered errors: 134 data errors, use '-v' for a list Now, of all these UNAVAIL and FAULTed devices only one is actually bad, c5t600C0FF0092261491D9A9F09d0 is from the raid set that is dead. Now, when the array was cold booted yesterday there was a temporary outage of the LUNs from the other two raidsets as well (c5t600C0FF00922614781B19005d0 and c5t600C0FF00922614781B19006d0). We have seen this before, and usually we just do a 'zpool clear' of the device and a resilver gets us back where we need to be. This time has been different... I did a 'zpool clear deadbeef c5t600C0FF00922614781B19005d0' and the zpool immediately went UNAVAIL with c5t600C0FF009278536638D9B07d0 going UNAVAIL. I did a 'zpool clear deadbeef c5t600C0FF009278536638D9B07d0' and it came right back. At that point I confirmed that I could read from both c5t600C0FF009278536638D9B07d0 and c5t600C0FF00922614781B19005d0 using dd. I also let the resilver in progress complete, which it did in about an hour with no issues. I then did the zpool replace on c5t600C0FF0092261491D9A9F09d0 in mirror-3 (the really dead device) and I was rewarded with an UNAVAIL pool again. I cleared a number of known good devices and the got the pool back. At this point I assumed the zfs label on the c5t600C0FF00922614781B19005d0 had gotten somehow corrupted so I tried a zpool replace of it with itself and even with -f it would not let me. So I tried replacing it with a different LUN, as you can see above. That was when it all went into the crapper and has stayed there. zpool clear does not even return (and can't be killed). mirror-1 reports UNAVAIL but both halves report ONLINE. I am afraid to EXPORT in case it won't IMPORT, but I have also started the process to restore from the replicated copy of the data from a remote site. After lunch I will probably try and EXPORT / IMPORT and see if that gets me anywhere. NOTE: there are 16 other pools on this server, one of which is resilvering, one of which still has bad LUNs I need to replace, and the rest are fine. The pool has a capacity of 1.5 TB and is about 1.37 TB used, the remaining pool to cleanup is 8 TB used out of 9 TB and we really can't afford to have these kinds of problems with that one. -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Re: [zfs-discuss] Same device node appearing twice in same mirror; one faulted, one not...
Hi Alex More scary than interesting to me. What kind of hardware and which Solaris release? Do you know what steps lead up to this problem? Any recent hardware changes? This output should tell you which disks were in this pool originally: # zpool history tank If the history identifies tank's actual disks, maybe you can determine which disk is masquerading as c5t1d0. If that doesn't work, accessing the individual disk entries in format should tell which one is the problem, if its only one. I would like to see the output of this command: # zdb -l /dev/dsk/c5t1d0s0 Make sure you have a good backup of your data. If you need to pull a disk to check cabling, or rule out controller issues, you should probably export this pool first. Have a good backup. Others have resolved minor device issues by exporting/importing the pool but with format/zpool commands hanging on your system, I'm not confident that this operation will work for you. Thanks, Cindy On 05/19/11 12:17, Alex wrote: I thought this was interesting - it looks like we have a failing drive in our mirror, but the two device nodes in the mirror are the same: pool: tank state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: scrub completed after 1h9m with 0 errors on Sat May 14 03:09:45 2011 config: NAMESTATE READ WRITE CKSUM tankDEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 c5t1d0 ONLINE 0 0 0 c5t1d0 FAULTED 0 0 0 corrupted data c5t1d0 does indeed only appear once in the "format" list. I wonder how to go about correcting this if I can't uniquely identify the failing drive. "format" takes forever to spill its guts, and the zpool commands all hang.. clearly there is hardware error here, probably causing that, but not sure how to identify which disk to pull. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is Dedup processing parallelized?
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Jim Klimov > > 1) The process is rather slow (I think due to dedup involved - > even though, by my calculations, the whole DDT can fit in > my 8Gb RAM), Please see: http://opensolaris.org/jive/thread.jspa?messageID=516567 In particular: > New problem: > I'm following all the advice I summarized into the OP of this thread, and [In other words, complete DDT fits in ram] > testing on a test system. (A laptop). And it's just not working. I am > jumping into the dedup performance abyss far, far eariler than predicted... and: I have another post, which doesn't seem to have found its way to this list. So I just resent it. Here's a snippet: > This is a workstation with 6 core processor, 16G ram, and a single 1TB > hard disk. > In the default configuration, arc_meta_limit is 3837MB. And as I increase > the number of unique blocks in the data pool, it is perfectly clear that > performance jumps off a cliff when arc_meta_used starts to reach that > level, which is approx 880,000 to 1,030,000 unique blocks. FWIW, this > means, without evil tuning, a 16G server is only sufficient to run dedup > on approx 33GB to 125GB unique data without severe performance > degradation > # zdb -D -e 1601233584937321596 > DDT-sha256-zap-ditto: 68 entries, size 1807 on disk, 240 in core > DDT-sha256-zap-duplicate: 1970815 entries, size 1134 on disk, 183 in core > DDT-sha256-zap-unique: 4376290 entries, size 1158 on disk, 187 in core > > dedup = 1.38, compress = 1.07, copies = 1.01, dedup * compress / copies > = 1.46 > > # zdb -D -e dcpool > DDT-sha256-zap-ditto: 388 entries, size 380 on disk, 200 in core > DDT-sha256-zap-duplicate: 5421787 entries, size 311 on disk, 176 in core > DDT-sha256-zap-unique: 16841361 entries, size 284 on disk, 145 in core > > dedup = 1.34, compress = 1.00, copies = 1.00, dedup * compress / copies > = 1.34 > > # echo ::sizeof ddt_entry_t | mdb -k > sizeof (ddt_entry_t) = 0x178 As you can see in that other thread, I am exploring dedup performance too, and finding that this method of calculation is totally ineffective. Number of blocks times size of ddt_entry, as you have seen, produces a reasonable number, but the experimentally measured results are nowhere near this. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Edward Ned Harvey > > New problem: > > I'm following all the advice I summarized into the OP of this thread, and > testing on a test system. (A laptop). And it's just not working. I am > jumping into the dedup performance abyss far, far eariler than predicted... (resending this message, because it doesn't seem to have been delivered the first time. If this is a repeat, please ignore.) Now I'm repeating all these tests on a system that more closely resembles a server. This is a workstation with 6 core processor, 16G ram, and a single 1TB hard disk. In the default configuration, arc_meta_limit is 3837MB. And as I increase the number of unique blocks in the data pool, it is perfectly clear that performance jumps off a cliff when arc_meta_used starts to reach that level, which is approx 880,000 to 1,030,000 unique blocks. FWIW, this means, without evil tuning, a 16G server is only sufficient to run dedup on approx 33GB to 125GB unique data without severe performance degradation. I'm calling "severe degradation" anything that's an order of magnitude or worse. (That's 40K average block size * 880,000 unique blocks, and 128K average block size * 1,030,000 unique blocks.) So clearly this needs to be addressed, if dedup is going to be super-awesome moving forward. But I didn't quit there. So then I tweak the arc_meta_limit. Set to 7680MB. And repeat the test. This time, the edge of the cliff is not so clearly defined, something like 1,480,000 to 1,620,000 blocks. But the problem is - arc_meta_used never even comes close to 7680MB. At all times, I still have at LEAST 2G unused free mem. I have 16G physical mem, but at all times, I always have at least 2G free. my arcstats:c_max is 15G. But my arc size never exceeds 8.7G my arc_meta_limit is 7680 MB, but my arc_meta_used never exceeds 3647 MB. So what's the holdup? All of the above is, of course, just a summary. If you want complete overwhelming details, here they are: http://dl.dropbox.com/u/543241/dedup%20tests/readme.txt http://dl.dropbox.com/u/543241/dedup%20tests/datagenerate.c http://dl.dropbox.com/u/543241/dedup%20tests/getmemstats.sh http://dl.dropbox.com/u/543241/dedup%20tests/parse.py http://dl.dropbox.com/u/543241/dedup%20tests/runtest.sh http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp ut-1st-pass.txt http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp ut-1st-pass-parsed.xlsx http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp ut-2nd-pass.txt http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp ut-2nd-pass-parsed.xlsx ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Monitoring disk seeks
On 05/19/2011 07:47 PM, Richard Elling wrote: > On May 19, 2011, at 5:35 AM, Sašo Kiselkov wrote: > >> Hi all, >> >> I'd like to ask whether there is a way to monitor disk seeks. I have an >> application where many concurrent readers (>50) sequentially read a >> large dataset (>10T) at a fairly low speed (8-10 Mbit/s). I can monitor >> read/write ops using iostat, but that doesn't tell me how contiguous the >> data is, i.e. when iostat reports "500" read ops, does that translate to >> 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks! > > In general, this is hard to see from the OS. In Solaris, the default I/O > flowing through sd gets sorted based on LBA before being sent to the > disk. If the disks gets more than 1 concurrent I/O request (10 is the default > for Solaris-based ZFS) then the disk can resort or otherwise try to optimize > the media accesses. > > As others have mentioned, iopattern is useful for looking a sequential > patterns. I've made some adjustments for the version at > http://www.richardelling.com/Home/scripts-and-programs-1/iopattern > > You can see low-level SCSI activity using scsi.d, but I usually uplevel that > to using "iosnoop -Dast" which shows each I/O and its response time. > Note that the I/Os can complete out-of-order on many devices. The only > device I know that is so fast and elegant that it always completes in-order > is the DDRdrive. > > For detailed analysis of iosnoop data, you will appreciate a real statistics > package. I use JMP, but others have good luck with R. > -- richard Thank you, the iopattern script seems to be quite close to what I wanted. The percentage split between random and sequential I/O is pretty much what I needed to know. Regards, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Is Dedup processing parallelized?
Hi all, On my oi_148a system I'm now in the process of "evacuating" data from my "dcpool" (an iSCSI device with a ZFS pool inside), which is hosted in my physical "pool" on harddisks (6-disk raidz2). The "dcpool" was configured to dedup all data inside it, and the volume "pool/dcpool" was compressed as to separate the two processes. I decided to scrap this experiment, and now I'm copying back my data by reading files from "dcpool" and writing it back into compressed+deduped datasets in "pool". I often see two interesting conditions in this setup: 1) The process is rather slow (I think due to dedup involved - even though, by my calculations, the whole DDT can fit in my 8Gb RAM), however the kernel processing time often peaks out at close to 50%, and there is often quite a bit of idle time. I have a dual-core box, so it makes sense to believe that some system cycle is not using more than one core. Does anyone know if DDT tree walk or search for available block ranges in metaslabs or whatever lengthy cycles there can be - if any of these are done in a sequential fashion? Below is my current DDT sizing. I still do not know which value to trust as the DDT entry size in RAM - the one returned by MDB or by ZDB (otherwise - what are those in-core and on-disk values? I've asked before but got no replies...) # zdb -D -e 1601233584937321596 DDT-sha256-zap-ditto: 68 entries, size 1807 on disk, 240 in core DDT-sha256-zap-duplicate: 1970815 entries, size 1134 on disk, 183 in core DDT-sha256-zap-unique: 4376290 entries, size 1158 on disk, 187 in core dedup = 1.38, compress = 1.07, copies = 1.01, dedup * compress / copies = 1.46 # zdb -D -e dcpool DDT-sha256-zap-ditto: 388 entries, size 380 on disk, 200 in core DDT-sha256-zap-duplicate: 5421787 entries, size 311 on disk, 176 in core DDT-sha256-zap-unique: 16841361 entries, size 284 on disk, 145 in core dedup = 1.34, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.34 # echo ::sizeof ddt_entry_t | mdb -k sizeof (ddt_entry_t) = 0x178 Since I'm writing to "pool" (queried by GUID number above), my box's performance primarily depends on its DDT - I guess. In worst case that's 6.4mil entries times 376 bytes = 2.4Gb, which is well below my computer's 8Gb RAM (and fits the ARC metadata report below). However the "dcpool"'s current DDT is clearly big, about 23mil entries * 376 bytes = 8.6Gb. 2) As seen below, the ARC including metadata currently takes up 3.7Gb. According to prstat, all of the global zone processes use 180Mb. ZFS is the only filesystem on this box. So the second question is: Who uses the other 4Gb of system RAM? This picture occurs consistently on every system uptime, as long as I use the pool for reading and/or writing extensively, and it seems that this is some sort of kernel buffering or workspace memory or whatever (cached metaslab allocation tables, maybe?), and it is not part of ARC - but it is even bigger. What is it? Can it be controlled (as to not decrease performance when ARC and/or DDT need more RAM) or at least queried? # ./tuning/arc_summary.pl | egrep -v 'mdb|set zfs:' | head -18 | grep ": "; echo ::arc | mdb -k | grep meta_ Physical RAM: 8183 MB Free Memory : 993 MB LotsFree: 127 MB Current Size: 3705 MB (arcsize) Target Size (Adaptive): 3705 MB (c) Min Size (Hard Limit):3072 MB (zfs_arc_min) Max Size (Hard Limit):6656 MB (zfs_arc_max) Most Recently Used Cache Size: 90%3342 MB (p) Most Frequently Used Cache Size: 9%362 MB (c-p) arc_meta_used = 2617 MB arc_meta_limit= 6144 MB arc_meta_max = 4787 MB Thanks for any insights, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris vs FreeBSD question
Original Message- From: Frank Van Damme Sent: Friday, May 20, 2011 6:25 AM >Op 20-05-11 01:17, Chris Forgeron schreef: >> I ended up switching back to FreeBSD after using Solaris for some time >> because I was getting tired of weird pool corruptions and the like. > >Did you ever manage to recover the data you blogged about on Sunday, February >6, 2011? Oh yes, I didn't follow up on that. I'll have to that now.. here's the recap. Yes, I did get most of it back, thanks to a lot of effort from George Wilson (great guy, and I'm very indebted to him) . However, any data that was in play at the time of the fault was irreversibly damaged and couldn't be restored. Any data that wasn't active at the time of the crash was perfectly fine, it just needed to be copied out of the pool into a new pool. George had to mount my pool for me, as it was beyond non-ZFS-programmer skills to mount. Unfortunately Solaris would dump after about 24 hours, requiring a second mounting by George. It was also slower than cold molasses to copy anything in it's faulted state. If I was getting 1 Meg/Sec, I was lucky. You can imaging that creates an issue when you're trying to evacuate a few TB of data through a slow pipe like that. After it dumped again, I didn't bother George for a third remounting (or I tried very half-heartedly, the guy was already into this for a lot of time, and we all have our day jobs), and abandoned the data that was still stranded on the faulted pool. I copied my most wanted data first, so what I abandoned was a personal collection of movies that I could always re-rip. I was still experimenting with ZFS at the time, so I wasn't using snapshots for backup, just conventional image backups of the VM's that were running. Snapshots would have had a good chance of protecting my data from the fault that I ran into. I was originally blaming my Areca 1880 card, as I was working with Areca tech support on a more stable driver for Solaris, and was on the 3rd revision of a driver with them. However, in the end it wasn't the Areca, as I was very familiar with it's tricks - The Areca would hang (about once every day or two), but it wouldn't take out the pool. After removing the Arcea and going with just LSI 2008 based controllers, I had one final fault about 3 weeks later that corrupted another pool (luckily it was just a backup pool). At that point, the swearing in the server room reached a peak, I booted back into FreeBSD, and haven't looked back. Originally when I used the Areca controller with FreeBSD, I didn't have any problems for about 2 months. I've had only small FreeBSD issues since then, nothing else has changed on my hardware. So the only claim I can make is that in my environment, on my hardware, I've had better stability with FreeBSD. One of the speed slow-downs with FreeBSD from my comparison tests was the O_SYNC method that ESX uses to mount a NFS store. I edited the FreeBSD NFS source to always do a async write, regardless of the O_SYNC from the client, and that perked FreeBSD up a lot for speed, making it fairly close to what I was getting on Solaris. FreeBSD is now using a 4.1 NFS server by default as of the last month, and I'm just starting my stability tests with using a new FreeBSD-9 build to see if I can run newer code. I'll do speed tests again, and will probably make the same hack to the 4.1 NFS code to force async writes. I'll post to my blog and the FreeBSD lists when that occurs, as it's out of scope for this list. I do like Solaris - After some initial discomfort about the different way things were being done, I do see the overall design and idea, and I now have a wish list of features I'd like see ported to FreeBSD. I think I'll have a Solaris based box setup again for testing. We'll see what time allows. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faulted Pool Question
On Fri, May 20, 2011 at 12:53 AM, Richard Elling wrote: > On May 19, 2011, at 2:09 PM, Paul Kraus wrote: >> Is there a way (other than zpool online) to kick ZFS into >> rescanning the LUNs ? > > zpool clear poolname I am unclear on when clear is the right command vs online. I have not gotten consistent information from Oracle. Can Richard (or someone else) please summarize here, thanks. >> If I had realized the entire 3511 array had gone away and that we >> would be restarting it, I would NOT have attempted to replace the >> faulted LUN and we would probably be OK. > > yes Yeah, hindsight and all that. But at the moment I hit return on the zpool replace we still only had one of three trays faulted on the 3511 ... sigh. >> P.S. The other zpools on the box are still up and running. The ones >> that had deviceson the faulted 3511 are degraded but online, the ones >> that did not have devices on the faulted 3511 are OK. Because of these >> other zpools we can't really reboot the box or pull the FC >> connections. > > Reboot isn't needed, this isn't a PeeCee :-) Oracle support recommended a reboot (which did clear the ZFS issue). I was not at the office to try to get a better solution out of Oracle. Now this morning, the original tray in the 3511 that failed is offline again, but this time it is not the bug we have run into, but a genuine failure of more than one drive in a RAID set. So now I am zpool replacing the faulted LUNs (and have asked that no one reboot any 3511's until I am done :-) -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris vs FreeBSD question
Op 20-05-11 01:17, Chris Forgeron schreef: > I ended up switching back to FreeBSD after using Solaris for some time > because I was getting tired of weird pool corruptions and the like. Did you ever manage to recover the data you blogged about on Sunday, February 6, 2011? -- No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss