Re: [zfs-discuss] Single disk parity

2009-07-10 Thread Haudy Kazemi

Richard Elling wrote:
There are many error correcting codes available.  RAID2 used Hamming 
codes, but that's just one of many options out there.  Par2 uses 
configurable strength Reed-Solomon to get multi bit error 
correction.  The par2 source is available, although from a ZFS 
perspective is hindered by the CDDL-GPL license incompatibility problem.


It is possible to write a FUSE filesystem using Reed-Solomon (like 
par2) as the underlying protection.  A quick search of the FUSE 
website turns up the Reed-Solomon FS (a FUSE-based filesystem):
"Shielding your files with Reed-Solomon codes" 
http://ttsiodras.googlepages.com/rsbep.html


While most FUSE work is on Linux, and there is a ZFS-FUSE project for 
it, there has also been FUSE work done for OpenSolaris:

http://www.opensolaris.org/os/project/fuse/


BTW, if you do have the case where unprotected data is not
readable, then I have a little DTrace script that I'd like you to run
which would help determine the extent of the corruption.  This is
one of those studies which doesn't like induced errors ;-)
http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon
Is this intended as general monitoring script or only after one has 
otherwise experienced corruption problems?



It is intended to try to answer the question of whether the errors we see
in real life might be single bit errors.  I do not believe they will 
be single

bit errors, but we don't have the data.

To be pedantic, wouldn't protected data also be affected if all 
copies are damaged at the same time, especially if also damaged in 
the same place?


Yep.  Which is why there is RFE CR 6674679, complain if all data
copies are identical and corrupt.
-- richard


There is a related but an unlikely scenario, that is also probably not 
covered yet.  I'm not sure what kind of common cause would lead to it.  
Maybe a disk array turning into swiss cheese with bad sectors suddenly 
showing up on multiple drives?  Its probability increases with larger 
logical block sizes (e.g. 128k blocks are at higher risk than 4k blocks; 
a block being the smallest piece of storage real estate used by the 
filesystem).  It is the edge case of multiple damaged copies where the 
damage is unreadable bad sectors on different corresponding sectors of a 
block.  This could be recovered from by copying the readable sectors 
from each copy and filling in the holes using the appropriate sectors 
from the other copies.  The final result, a rebuilt block, should pass 
the checksum tests assuming there were not any other problems with the 
still readable sectors.


---

A bad sector specific recovery technique is to instruct the disk to 
return raw read data rather than trying to correct it.  The READ LONG 
command can do this (though the specs say it only works on 28 bit LBA).  
(READ LONG corresponds to writes done with WRITE LONG (28 bit) or WRITE 
UNCORRECTABLE EXT (48 bit).  Linux HDPARM uses these write commands when 
it is used to create bad sectors with the --make-bad-sector command.  
The resulting sectors are low level logically bad where the sector's 
data and ECC do not match; they are not physically bad).  With multiple 
read attempts, a statistical distribution of the likely 'true' contents 
of the sector can be found.  Spinrite claims to do this.  Linux 'HDPARM 
--read-sector' can sometimes return data from nominally bad sectors too 
but it doesn't have a built in statistical recovery method (a wrapper 
script could probably solve that).  I don't know if HDPARM --read sector 
uses READ LONG or not.

HDPARM man page: http://linuxreviews.org/man/hdparm/

Good description of IDE commands including READ LONG and WRITE LONG 
(specs say they are 28 bit only)

http://www.repairfaq.org/filipg/LINK/F_IDE-tech.html
SCSI versions of READ LONG and WRITE LONG
http://en.wikipedia.org/wiki/SCSI_Read_Commands#Read_Long
http://en.wikipedia.org/wiki/SCSI_Write_Commands#Write_Long

Here is a report by forum member "qubit" modifying his Linux taskfile 
driver to use READ LONG for data recovery purposes, and his subsequent 
analysis:


http://forums.storagereview.net/index.php?showtopic=5910
http://www.tech-report.com/news_reply.x/3035
http://techreport.com/ja.zz?comments=3035&page=5

-- quote --
318. Posted at 07:00 am on Jun 6th 2002 by qubit

My DTLA-307075 (75GB 75GXP) went bad 6 months ago. But I didn't write 
off the data as being unrecoverable. I used WinHex to make a ghost image 
of the drive onto my new larger one, zeroing out the bad sectors in the 
target while logging each bad sector. (There were bad sectors in the FAT 
so I combined the good parts from FATs 1 and 2.) At this point I had a 
working mirror of the drive that went bad, with zeroed-out 512 byte 
holes in files where the bad sectors were.


Then I set the 75GXP aside, because I knew it was possible to recover 
some of the data *on* the bad sectors, but I didn't have the tools to do 
it. So I decided to wait until then to RMA it.


I did 

[zfs-discuss] Slow Resilvering Performance

2009-07-10 Thread Galen
I know this topic has been discussed many times... but what the hell  
makes zpool resilvering so slow? I'm running OpenSolaris 2009.06.


I have had a large number of problematic disks due to a bad production  
batch, leading me to resilver quite a few times, progressively  
replacing each disk as it dies (and now preemptively removing disks.)  
My complaint is that resilvering ends up taking... days! The average  
write rate to the disk being resilvered is 1 to 3 MB/sec.


You can see zpool status and iostat -v output here:
http://pastebin.com/mcbb8dfd

When I read files off the zpool, I get quite a few MB/sec even in a  
degraded state, although the zpool is idle while resilvering in this  
case - no snapshots or anything happening on it. The system has 3 GB  
of RAM and a 2.8 GHz dual core CPU which is always >90% idle while  
resilvering. The number of I/O operations per second is nowhere near  
the disk's limits. Scrubbing takes 3-4 hours at the most, so it's  
clearly not a read bottleneck. Even if I have a configuration where  
only one disk is being replaced (and all others are OK), I never pass  
the 1-3 MB/sec limit.


What is going on? I have had to resilver 4 times so far, and I have to  
resilver at least once more. Each resilvering takes a day or two, and  
I cant see why... it's not CPU, it's not sustained read throughput,  
it's not IOPS, so what is it??


Galen
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-10 Thread Mark Michael
Worked great during test jumpstart, thanks!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot receive new filesystem stream: invalid backup stream

2009-07-10 Thread Ian Collins

Alexander Skwar wrote:

Hallo.

I'm trying do "zfs send -R" from a S10 U6 Sparc system to a Solaris 10 U7 Sparc
system. The filesystem in question is running version 1.

Here's what I did:

$ fs=data/oracle ; snap=transfer.hot-b ; sudo zfs send -R $...@$snap |
sudo rsh winds07-bge0 "zfs create rpool/trans/winds00r/${fs%%/*} || :
; zfs recv -u -v -F -d rpool/trans/winds00r/${fs%%/*}"
receiving full stream of data/ora...@transfer.initial-hot into
rpool/trans/winds00r/data/ora...@transfer.initial-hot
received 15.0KB stream in 6 seconds (2.50KB/sec)
receiving incremental stream of data/ora...@transfer.hot-b into
rpool/trans/winds00r/data/ora...@transfer.hot-b
received 312B stream in 3 seconds (104B/sec)
receiving full stream of data/oracle/u...@transfer.initial-hot into
rpool/trans/winds00r/data/oracle/u...@transfer.initial-hot
cannot receive new filesystem stream: invalid backup stream

Why did the "send" or "receive" fail with "invalid backup stream"? Is
sending from
U6 to U7 not supported?

  
It is supported, but I don't think -R is supported with a version 1 
filesystem.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problem with mounting ZFS from USB drive

2009-07-10 Thread Karl Dalen
Thanks for the great tips. I did some more testing and
indeed it was a version issue. The pool was created under:
# zpool upgrade
This system is currently running ZFS version 14.

whereas I tried it on systems with versions 10 and 12.
It could be imported on a newer system using -f option.
I suppose it did not auto mount the pool because it
had the same name as existing pools. Is this a known
issue with ZFS ?

I assume because of this portability issue ZFS is not really suitable
for use on removable media such as USB drives that are intended to
be mounted on different hosts which may have different Solaris versions.
It may be better to stick with UFS for this case.

/KarlD
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very slow ZFS write speed to raw zvol

2009-07-10 Thread Eric C. Taylor
Writes using the character interface (/dev/zvol/rdsk) are synchronous.
If you want caching, you can go through the block interface
(/dev/zvol/dsk) instead.

-  Eric
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] cannot receive new filesystem stream: invalid backup stream

2009-07-10 Thread Alexander Skwar
Hallo.

I'm trying do "zfs send -R" from a S10 U6 Sparc system to a Solaris 10 U7 Sparc
system. The filesystem in question is running version 1.

Here's what I did:

$ fs=data/oracle ; snap=transfer.hot-b ; sudo zfs send -R $...@$snap |
sudo rsh winds07-bge0 "zfs create rpool/trans/winds00r/${fs%%/*} || :
; zfs recv -u -v -F -d rpool/trans/winds00r/${fs%%/*}"
receiving full stream of data/ora...@transfer.initial-hot into
rpool/trans/winds00r/data/ora...@transfer.initial-hot
received 15.0KB stream in 6 seconds (2.50KB/sec)
receiving incremental stream of data/ora...@transfer.hot-b into
rpool/trans/winds00r/data/ora...@transfer.hot-b
received 312B stream in 3 seconds (104B/sec)
receiving full stream of data/oracle/u...@transfer.initial-hot into
rpool/trans/winds00r/data/oracle/u...@transfer.initial-hot
cannot receive new filesystem stream: invalid backup stream

Why did the "send" or "receive" fail with "invalid backup stream"? Is
sending from
U6 to U7 not supported?

Alexander

PS: Yes, it's about Solaris 10. There are also a lot of other threads
about S10, so
it doesn't seem like S10 is out-of-topic here. Additionally, the ZFS
"gurus" are to be
found here, so this makes this list even more appropriate for such a
question, I'd
think.
But if this question is wrong here, please be so kind and inform me
about the right
place for such a question. Thanks a lot!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SMART problems with AOC-SAT2-MV8 / marvell88sx driver

2009-07-10 Thread Ross
I don't know how relevant this is to you on Nexenta, but I can tell you that 
the driver support for that card improved tremendously with OpenSolaris 
2008.11.  All of our hot swap problems went away with that release but the 
change wasn't documented anywhere that I could see.

It might be worth simply booting off a 2009.06 live CD and seeing if any of 
these commands work.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Status of zpool remove in raidz and non-redundant stripes

2009-07-10 Thread Wout Mertens
You're right - in my company (a very big one) we just stumbled across this as 
well and we're strongly considering not using ZFS because of it.

It's easy to type zpool add when you meant zpool replace - and then you can go 
rebuild your box because it was the root pool. Nice.

At the very least, "zpool add" should have more warnings.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] SMART problems with AOC-SAT2-MV8 / marvell88sx driver

2009-07-10 Thread Kim Lester



I've been trying to get either smartctl or sg3_utils to report properly.
They both have the same low-level problems which leads me to suspect  
either I'm doing something wrong OR there is a problem in the  
marvell88sx / sd / SATA etc framework.
I can access drive name/serial number of all drives on the SAT2 card  
and the temperature is returned from the IE log page on my two drives  
that support temperature.
(The SATA disks on the motherboard that are attached to the cmdk  
driver work, but SMART fails because they are treated as ATA drives so  
ignore those)


What appears broken is I cannot access any of the logs such as the  
typical counters that I *know* exist.


If we look at sg3_utils output (prettier than smartctl)

src/sg_logs  /dev/rdsk/c1t0d0s0
ATA   WDC WD15EADS-00R  0A01
Supported log pages:
0x00Supported log pages
0x00Supported log pages
0x03Error counters (read)
0x04Error counters (read reverse)
0x00Supported log pages
0x10Self-test results
0x2fInformational exceptions (SMART)
0x30Performance counters (Hitachi)

First strange matter is that I would EXPECT to see a list in numerical  
order with 0x00 appearing only once!
Well no matter I thought, but then a query on page 3 say tells me I  
have an illegal field in the CBD.


sg_logs -p 3 /dev/rdsk/c1t0d0s0
ATA   WDC WD15EADS-00R  0A01
log_sense: field in cdb illegal

(smartctl said the same thing).

Turns out only the last 4 pages (00, 10, 2f, 30) actually work. The  
first 4 data values almost seem as if the returned data (12 bytes  
total: 4 header, 8 data) should actually be 8 header for some reason  
(although this is not correct for a LOG sense parameter block).
The log sense command is run to find the expected data length and it  
returns "12".


Delving further I found (but I'm now out of my depth) that there was a  
total of 3072 bytes apparently coming back from the log sense command   
(according to both sg3 and smartctl) !!!???

sg3 shows it neatly.

sg_logs -v /dev/rdsk/c1t0d0s0
inquiry cdb: 12 00 00 00 24 00
ATA   WDC WD15EADS-00R  0A01
log sense cdb: 4d 00 40 00 00 00 00 00 04 00
log sense: requested 4 bytes but got -3068 bytes
log sense cdb: 4d 00 40 00 00 00 00 00 0c 00
log sense: requested 12 bytes but got -3060 bytes
Supported log pages:
0x00Supported log pages
0x00Supported log pages
0x03Error counters (read)
0x04Error counters (read reverse)
0x00Supported log pages
0x10Self-test results
0x2fInformational exceptions (SMART)
0x30Performance counters (Hitachi)


Not sure about the -sign - maybe a wrap ? esp given 3068 + 4 == 3060 +  
12 == 3072

I hacked smartctl src to give me its count:

logSense:
pagelen =12  (# bytes driver stated would be returned)
Resid  = 3072   (#bytes actually returned)


Sooo I'm stuck and I still can't access my drive stats! :-(

Any help would be very gratefully received.

cheers

Kim

==
System Details

uname -a -> SunOS bigbertha 5.11 NexentaOS_20081207 i86pc i386 i86pc  
Solaris


motherboard: Intel Server  SE7320SP2
SATA card: Supermicro AOC-SAT2-MV8  PCI-X

prtpicl extract

...
 pci8086,25ae (pci, d80132)
 pci11ab,11ab (obp-device, d80155)
 disk (block, d8017f)
 disk (block, d801a2)
 disk (block, d801c5)
 disk (block, d801e8)
 disk (block, d8020b)
...
 pci-ide (pci-ide, d80405)
 ide (obp-device, d8042c)
 ide (obp-device, d80435)
 pci-ide (pci-ide, d8043e)
 ide (ide, d80466)
 cmdk (block, d80477)
 ide (ide, d8048e)
 sd (cdrom, d804a5)

/etc/path_to_inst

"/p...@0,0/pci8086,2...@1c/pci11ab,1...@3" 0 "marvell88sx"
"/p...@0,0/pci8086,2...@1c/pci11ab,1...@3/d...@0,0" 3 "sd"
"/p...@0,0/pci8086,2...@1c/pci11ab,1...@3/d...@2,0" 4 "sd"
"/p...@0,0/pci8086,2...@1c/pci11ab,1...@3/d...@3,0" 5 "sd"
"/p...@0,0/pci8086,2...@1c/pci11ab,1...@3/d...@4,0" 6 "sd"
"/p...@0,0/pci8086,2...@1c/pci11ab,1...@3/d...@1,0" 9 "sd"
"/p...@0,0/pci-...@1f,1" 0 "pci-ide"
"/p...@0,0/pci-...@1f,1/i...@0" 0 "ata"
"/p...@0,0/pci-...@1f,1/i...@0/s...@0,0" 1 "sd"
"/p...@0,0/pci-...@1f,1/i...@1" 1 "ata"
"/p...@0,0/pci-...@1f,2" 1 "pci-ide"
"/p...@0,0/pci-...@1f,2/i...@0" 2 "ata"
"/p...@0,0/pci-...@1f,2/i...@0/c...@0,0" 0 "cmdk"
"/p...@0,0/pci-...@1f,2/i...@0/s...@0,0" 7 "sd"
"/p...@0,0/pci-...@1f,2/i...@1" 3 "ata"
"/p...@0,0/pci-...@1f,2/i...@1/s...@0,0" 0 "sd"
"/p...@0,0/pci-...@1f,2/i...@1/s...@1,0" 8 "sd"
"/p...@0,0/pci-...@1f,2/i...@1/c...@0,0" 1 "cmdk"

and for what it's worth
85 -rwxr-xr-x 1 root sys 85592 Dec  8  2008 /kernel/drv/amd64/ 
marvell88sx

58 -rwxr-xr-x 1 root

Re: [zfs-discuss] Question about user/group quotas

2009-07-10 Thread Darren J Moffat

Greg Mason wrote:

Thanks for the link Richard,

I guess the next question is, how safe would it be to run snv_114 in
production? Running something that would be technically "unsupported"
makes a few folks here understandably nervous...


You mentioned you run Linux clients.  Are they all under a support 
contract ?  Do you actually have a support contract for OpenSolaris 
2009.06 (if not then personally I'd say zero difference but I'm an 
OpenSolaris developer and I'm used to living on the latest builds).


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss