Re: [zfs-discuss] Single disk parity
Richard Elling wrote: There are many error correcting codes available. RAID2 used Hamming codes, but that's just one of many options out there. Par2 uses configurable strength Reed-Solomon to get multi bit error correction. The par2 source is available, although from a ZFS perspective is hindered by the CDDL-GPL license incompatibility problem. It is possible to write a FUSE filesystem using Reed-Solomon (like par2) as the underlying protection. A quick search of the FUSE website turns up the Reed-Solomon FS (a FUSE-based filesystem): "Shielding your files with Reed-Solomon codes" http://ttsiodras.googlepages.com/rsbep.html While most FUSE work is on Linux, and there is a ZFS-FUSE project for it, there has also been FUSE work done for OpenSolaris: http://www.opensolaris.org/os/project/fuse/ BTW, if you do have the case where unprotected data is not readable, then I have a little DTrace script that I'd like you to run which would help determine the extent of the corruption. This is one of those studies which doesn't like induced errors ;-) http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon Is this intended as general monitoring script or only after one has otherwise experienced corruption problems? It is intended to try to answer the question of whether the errors we see in real life might be single bit errors. I do not believe they will be single bit errors, but we don't have the data. To be pedantic, wouldn't protected data also be affected if all copies are damaged at the same time, especially if also damaged in the same place? Yep. Which is why there is RFE CR 6674679, complain if all data copies are identical and corrupt. -- richard There is a related but an unlikely scenario, that is also probably not covered yet. I'm not sure what kind of common cause would lead to it. Maybe a disk array turning into swiss cheese with bad sectors suddenly showing up on multiple drives? Its probability increases with larger logical block sizes (e.g. 128k blocks are at higher risk than 4k blocks; a block being the smallest piece of storage real estate used by the filesystem). It is the edge case of multiple damaged copies where the damage is unreadable bad sectors on different corresponding sectors of a block. This could be recovered from by copying the readable sectors from each copy and filling in the holes using the appropriate sectors from the other copies. The final result, a rebuilt block, should pass the checksum tests assuming there were not any other problems with the still readable sectors. --- A bad sector specific recovery technique is to instruct the disk to return raw read data rather than trying to correct it. The READ LONG command can do this (though the specs say it only works on 28 bit LBA). (READ LONG corresponds to writes done with WRITE LONG (28 bit) or WRITE UNCORRECTABLE EXT (48 bit). Linux HDPARM uses these write commands when it is used to create bad sectors with the --make-bad-sector command. The resulting sectors are low level logically bad where the sector's data and ECC do not match; they are not physically bad). With multiple read attempts, a statistical distribution of the likely 'true' contents of the sector can be found. Spinrite claims to do this. Linux 'HDPARM --read-sector' can sometimes return data from nominally bad sectors too but it doesn't have a built in statistical recovery method (a wrapper script could probably solve that). I don't know if HDPARM --read sector uses READ LONG or not. HDPARM man page: http://linuxreviews.org/man/hdparm/ Good description of IDE commands including READ LONG and WRITE LONG (specs say they are 28 bit only) http://www.repairfaq.org/filipg/LINK/F_IDE-tech.html SCSI versions of READ LONG and WRITE LONG http://en.wikipedia.org/wiki/SCSI_Read_Commands#Read_Long http://en.wikipedia.org/wiki/SCSI_Write_Commands#Write_Long Here is a report by forum member "qubit" modifying his Linux taskfile driver to use READ LONG for data recovery purposes, and his subsequent analysis: http://forums.storagereview.net/index.php?showtopic=5910 http://www.tech-report.com/news_reply.x/3035 http://techreport.com/ja.zz?comments=3035&page=5 -- quote -- 318. Posted at 07:00 am on Jun 6th 2002 by qubit My DTLA-307075 (75GB 75GXP) went bad 6 months ago. But I didn't write off the data as being unrecoverable. I used WinHex to make a ghost image of the drive onto my new larger one, zeroing out the bad sectors in the target while logging each bad sector. (There were bad sectors in the FAT so I combined the good parts from FATs 1 and 2.) At this point I had a working mirror of the drive that went bad, with zeroed-out 512 byte holes in files where the bad sectors were. Then I set the 75GXP aside, because I knew it was possible to recover some of the data *on* the bad sectors, but I didn't have the tools to do it. So I decided to wait until then to RMA it. I did
[zfs-discuss] Slow Resilvering Performance
I know this topic has been discussed many times... but what the hell makes zpool resilvering so slow? I'm running OpenSolaris 2009.06. I have had a large number of problematic disks due to a bad production batch, leading me to resilver quite a few times, progressively replacing each disk as it dies (and now preemptively removing disks.) My complaint is that resilvering ends up taking... days! The average write rate to the disk being resilvered is 1 to 3 MB/sec. You can see zpool status and iostat -v output here: http://pastebin.com/mcbb8dfd When I read files off the zpool, I get quite a few MB/sec even in a degraded state, although the zpool is idle while resilvering in this case - no snapshots or anything happening on it. The system has 3 GB of RAM and a 2.8 GHz dual core CPU which is always >90% idle while resilvering. The number of I/O operations per second is nowhere near the disk's limits. Scrubbing takes 3-4 hours at the most, so it's clearly not a read bottleneck. Even if I have a configuration where only one disk is being replaced (and all others are OK), I never pass the 1-3 MB/sec limit. What is going on? I have had to resilver 4 times so far, and I have to resilver at least once more. Each resilvering takes a day or two, and I cant see why... it's not CPU, it's not sustained read throughput, it's not IOPS, so what is it?? Galen ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs root, jumpstart and flash archives
Worked great during test jumpstart, thanks! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot receive new filesystem stream: invalid backup stream
Alexander Skwar wrote: Hallo. I'm trying do "zfs send -R" from a S10 U6 Sparc system to a Solaris 10 U7 Sparc system. The filesystem in question is running version 1. Here's what I did: $ fs=data/oracle ; snap=transfer.hot-b ; sudo zfs send -R $...@$snap | sudo rsh winds07-bge0 "zfs create rpool/trans/winds00r/${fs%%/*} || : ; zfs recv -u -v -F -d rpool/trans/winds00r/${fs%%/*}" receiving full stream of data/ora...@transfer.initial-hot into rpool/trans/winds00r/data/ora...@transfer.initial-hot received 15.0KB stream in 6 seconds (2.50KB/sec) receiving incremental stream of data/ora...@transfer.hot-b into rpool/trans/winds00r/data/ora...@transfer.hot-b received 312B stream in 3 seconds (104B/sec) receiving full stream of data/oracle/u...@transfer.initial-hot into rpool/trans/winds00r/data/oracle/u...@transfer.initial-hot cannot receive new filesystem stream: invalid backup stream Why did the "send" or "receive" fail with "invalid backup stream"? Is sending from U6 to U7 not supported? It is supported, but I don't think -R is supported with a version 1 filesystem. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problem with mounting ZFS from USB drive
Thanks for the great tips. I did some more testing and indeed it was a version issue. The pool was created under: # zpool upgrade This system is currently running ZFS version 14. whereas I tried it on systems with versions 10 and 12. It could be imported on a newer system using -f option. I suppose it did not auto mount the pool because it had the same name as existing pools. Is this a known issue with ZFS ? I assume because of this portability issue ZFS is not really suitable for use on removable media such as USB drives that are intended to be mounted on different hosts which may have different Solaris versions. It may be better to stick with UFS for this case. /KarlD -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very slow ZFS write speed to raw zvol
Writes using the character interface (/dev/zvol/rdsk) are synchronous. If you want caching, you can go through the block interface (/dev/zvol/dsk) instead. - Eric -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] cannot receive new filesystem stream: invalid backup stream
Hallo. I'm trying do "zfs send -R" from a S10 U6 Sparc system to a Solaris 10 U7 Sparc system. The filesystem in question is running version 1. Here's what I did: $ fs=data/oracle ; snap=transfer.hot-b ; sudo zfs send -R $...@$snap | sudo rsh winds07-bge0 "zfs create rpool/trans/winds00r/${fs%%/*} || : ; zfs recv -u -v -F -d rpool/trans/winds00r/${fs%%/*}" receiving full stream of data/ora...@transfer.initial-hot into rpool/trans/winds00r/data/ora...@transfer.initial-hot received 15.0KB stream in 6 seconds (2.50KB/sec) receiving incremental stream of data/ora...@transfer.hot-b into rpool/trans/winds00r/data/ora...@transfer.hot-b received 312B stream in 3 seconds (104B/sec) receiving full stream of data/oracle/u...@transfer.initial-hot into rpool/trans/winds00r/data/oracle/u...@transfer.initial-hot cannot receive new filesystem stream: invalid backup stream Why did the "send" or "receive" fail with "invalid backup stream"? Is sending from U6 to U7 not supported? Alexander PS: Yes, it's about Solaris 10. There are also a lot of other threads about S10, so it doesn't seem like S10 is out-of-topic here. Additionally, the ZFS "gurus" are to be found here, so this makes this list even more appropriate for such a question, I'd think. But if this question is wrong here, please be so kind and inform me about the right place for such a question. Thanks a lot! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SMART problems with AOC-SAT2-MV8 / marvell88sx driver
I don't know how relevant this is to you on Nexenta, but I can tell you that the driver support for that card improved tremendously with OpenSolaris 2008.11. All of our hot swap problems went away with that release but the change wasn't documented anywhere that I could see. It might be worth simply booting off a 2009.06 live CD and seeing if any of these commands work. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Status of zpool remove in raidz and non-redundant stripes
You're right - in my company (a very big one) we just stumbled across this as well and we're strongly considering not using ZFS because of it. It's easy to type zpool add when you meant zpool replace - and then you can go rebuild your box because it was the root pool. Nice. At the very least, "zpool add" should have more warnings. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] SMART problems with AOC-SAT2-MV8 / marvell88sx driver
I've been trying to get either smartctl or sg3_utils to report properly. They both have the same low-level problems which leads me to suspect either I'm doing something wrong OR there is a problem in the marvell88sx / sd / SATA etc framework. I can access drive name/serial number of all drives on the SAT2 card and the temperature is returned from the IE log page on my two drives that support temperature. (The SATA disks on the motherboard that are attached to the cmdk driver work, but SMART fails because they are treated as ATA drives so ignore those) What appears broken is I cannot access any of the logs such as the typical counters that I *know* exist. If we look at sg3_utils output (prettier than smartctl) src/sg_logs /dev/rdsk/c1t0d0s0 ATA WDC WD15EADS-00R 0A01 Supported log pages: 0x00Supported log pages 0x00Supported log pages 0x03Error counters (read) 0x04Error counters (read reverse) 0x00Supported log pages 0x10Self-test results 0x2fInformational exceptions (SMART) 0x30Performance counters (Hitachi) First strange matter is that I would EXPECT to see a list in numerical order with 0x00 appearing only once! Well no matter I thought, but then a query on page 3 say tells me I have an illegal field in the CBD. sg_logs -p 3 /dev/rdsk/c1t0d0s0 ATA WDC WD15EADS-00R 0A01 log_sense: field in cdb illegal (smartctl said the same thing). Turns out only the last 4 pages (00, 10, 2f, 30) actually work. The first 4 data values almost seem as if the returned data (12 bytes total: 4 header, 8 data) should actually be 8 header for some reason (although this is not correct for a LOG sense parameter block). The log sense command is run to find the expected data length and it returns "12". Delving further I found (but I'm now out of my depth) that there was a total of 3072 bytes apparently coming back from the log sense command (according to both sg3 and smartctl) !!!??? sg3 shows it neatly. sg_logs -v /dev/rdsk/c1t0d0s0 inquiry cdb: 12 00 00 00 24 00 ATA WDC WD15EADS-00R 0A01 log sense cdb: 4d 00 40 00 00 00 00 00 04 00 log sense: requested 4 bytes but got -3068 bytes log sense cdb: 4d 00 40 00 00 00 00 00 0c 00 log sense: requested 12 bytes but got -3060 bytes Supported log pages: 0x00Supported log pages 0x00Supported log pages 0x03Error counters (read) 0x04Error counters (read reverse) 0x00Supported log pages 0x10Self-test results 0x2fInformational exceptions (SMART) 0x30Performance counters (Hitachi) Not sure about the -sign - maybe a wrap ? esp given 3068 + 4 == 3060 + 12 == 3072 I hacked smartctl src to give me its count: logSense: pagelen =12 (# bytes driver stated would be returned) Resid = 3072 (#bytes actually returned) Sooo I'm stuck and I still can't access my drive stats! :-( Any help would be very gratefully received. cheers Kim == System Details uname -a -> SunOS bigbertha 5.11 NexentaOS_20081207 i86pc i386 i86pc Solaris motherboard: Intel Server SE7320SP2 SATA card: Supermicro AOC-SAT2-MV8 PCI-X prtpicl extract ... pci8086,25ae (pci, d80132) pci11ab,11ab (obp-device, d80155) disk (block, d8017f) disk (block, d801a2) disk (block, d801c5) disk (block, d801e8) disk (block, d8020b) ... pci-ide (pci-ide, d80405) ide (obp-device, d8042c) ide (obp-device, d80435) pci-ide (pci-ide, d8043e) ide (ide, d80466) cmdk (block, d80477) ide (ide, d8048e) sd (cdrom, d804a5) /etc/path_to_inst "/p...@0,0/pci8086,2...@1c/pci11ab,1...@3" 0 "marvell88sx" "/p...@0,0/pci8086,2...@1c/pci11ab,1...@3/d...@0,0" 3 "sd" "/p...@0,0/pci8086,2...@1c/pci11ab,1...@3/d...@2,0" 4 "sd" "/p...@0,0/pci8086,2...@1c/pci11ab,1...@3/d...@3,0" 5 "sd" "/p...@0,0/pci8086,2...@1c/pci11ab,1...@3/d...@4,0" 6 "sd" "/p...@0,0/pci8086,2...@1c/pci11ab,1...@3/d...@1,0" 9 "sd" "/p...@0,0/pci-...@1f,1" 0 "pci-ide" "/p...@0,0/pci-...@1f,1/i...@0" 0 "ata" "/p...@0,0/pci-...@1f,1/i...@0/s...@0,0" 1 "sd" "/p...@0,0/pci-...@1f,1/i...@1" 1 "ata" "/p...@0,0/pci-...@1f,2" 1 "pci-ide" "/p...@0,0/pci-...@1f,2/i...@0" 2 "ata" "/p...@0,0/pci-...@1f,2/i...@0/c...@0,0" 0 "cmdk" "/p...@0,0/pci-...@1f,2/i...@0/s...@0,0" 7 "sd" "/p...@0,0/pci-...@1f,2/i...@1" 3 "ata" "/p...@0,0/pci-...@1f,2/i...@1/s...@0,0" 0 "sd" "/p...@0,0/pci-...@1f,2/i...@1/s...@1,0" 8 "sd" "/p...@0,0/pci-...@1f,2/i...@1/c...@0,0" 1 "cmdk" and for what it's worth 85 -rwxr-xr-x 1 root sys 85592 Dec 8 2008 /kernel/drv/amd64/ marvell88sx 58 -rwxr-xr-x 1 root
Re: [zfs-discuss] Question about user/group quotas
Greg Mason wrote: Thanks for the link Richard, I guess the next question is, how safe would it be to run snv_114 in production? Running something that would be technically "unsupported" makes a few folks here understandably nervous... You mentioned you run Linux clients. Are they all under a support contract ? Do you actually have a support contract for OpenSolaris 2009.06 (if not then personally I'd say zero difference but I'm an OpenSolaris developer and I'm used to living on the latest builds). -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss