Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread Robert Milkowski
Hello can,

Thursday, December 13, 2007, 12:02:56 AM, you wrote:

cyg On the other hand, there's always the possibility that someone
cyg else learned something useful out of this.  And my question about

To be honest - there's basically nothing useful in the thread,
perhaps except one thing - doesn't make any sense to listen to you.

You're just unable to talk to people.





-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Auto backup and auto restore of ZFS via Firewire drive

2007-12-13 Thread Ross
Hey folks,

This may not be the best place to ask this, but I'm so new to Solaris I really 
don't know anywhere better.  If anybody can suggest a better forum I'm all ears 
:)

I've heard of Tim Foster's autobackup utility, that can automatically backup a 
ZFS filesystem to a USB drive as it's connected.  What I'd like to know is if 
there's a way of doing something similar for a firewire drive?

Also, is it possible to run several backups at once onto that drive?  Is there 
any way I can script a bunch of commands to run as the drive is connected?

thanks,

Ross
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS NAS Cluster

2007-12-13 Thread Vic Cornell
Dear All,

First of all thanks for a fascinating list - its my first read of the
morning.

Secondly I would like to ask a question. We currently have an EMC Celerra
NAS which we use for CIFS, NFS and iSCSI. Its not our favourite piece of
hardware and it is nearing the limits of its capacity (Tb) . We have two
options: 

1) Expand the solution. Spend £££s, double the number of heads, double
the capacity and carry on as before.

2) Look for something else.

I have been watching ZFS for some time and have implemented it in
several niche applications. I would like to be able to consider using ZFS as
the basis of a NAS solution based around SAN storage, T{2,5}000 servers and
Sun Cluster.

Here is my wish list:

Flexible provisioning (thin if possible)
Hardware resilience/Transparent Failover
Asynchronous Replication to remote site (1km) providing DR cover.
NFS/CIFS/iSCSI
Snaps/Cloning
No single point of failure
Integration with Active Directory/NFS
Ability to restripe data onto widened pools.
Ability to migrate data between storage pools.

As I understand it the combination of ZFS and SunCluster will give me all of
the above. Has anybody done this? How mature/stable is it. I understand that
SunCluster/HA-ZFS is supported but there seems to be little that I can find
on the web about it. Any information would be gratefully received.

Best Regards,

Vic

-- 
Vic Cornell
UNIX Systems Administrator 
Landmark Information Group Limited

5-7 Abbey Court, Eagle Way, Sowton, 
Exeter, Devon, EX2 7HY

T:  01392 888690 
M: 07900 660266
F:  01392 441709

www.landmarkinfo.co.uk http://www.landmarkinfo.co.uk



Registered Office: 5-7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY
Registered Number 2892803 Registered in England  Wales 

The information contained in this e-mail is confidential and may be subject to
legal privilege. If you are not the intended recipient, you must not use,
copy, distribute or disclose the e-mail or any part of its contents or take
any action in reliance on it. If you have received this e-mail in error,
please e-mail the sender by replying to this message. All reasonable
precautions have been taken to ensure no viruses are present in this e-mail.
Landmark Information Group Limited cannot accept responsibility for loss or
damage arising from the use of this e-mail or attachments and recommend that
you subject these to your virus checking procedures prior to use. 

www.landmarkinfo.co.uk 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread MP
 this anti-raid-card movement is puzzling. 

I think you've misinterpreted my questions.
I queried the necessity of paying extra for an seemingly unnecessary RAID card 
for zfs. I didn't doubt that it could perform better.
Wasn't one of the design briefs of zfs, that it would provide it's feature set 
without expensive RAID hardware?
Of course, if you have the money then you can always go faster, but this is a 
zfs discussion thread (I know I've perpetuated the extravagant cross-posting of 
the OP).
Cheers.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mirror a slice

2007-12-13 Thread Shawn Joy
What are the commands? Everything I see is c1t0d0, c1t1d0.   no 
slice just the completed disk.



Robert Milkowski wrote:
 Hello Shawn,
 
 Thursday, December 13, 2007, 3:46:09 PM, you wrote:
 
 SJ Is it possible to bring one slice of a disk under zfs controller and 
 SJ leave the others as ufs?
 
 SJ A customer is tryng to mirror one slice using zfs.
 
 
 Yes, it's - it just works.
 
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread Eric Haycraft
People.. for the n-teenth time, there are only two ways to kill a troll. One 
involves a woodchipper and the possibility of an unwelcome visit from the FBI, 
and the other involves ignoring them. 

Internet Trolls:
http://en.wikipedia.org/wiki/Internet_troll
http://www.linuxextremist.com/?p=34

Another perspective:
http://sc.tri-bit.com/images/7/7e/greaterinternetfu#kwadtheory.jpg

The irony of this whole thing is that by feeding Bill's tollish tendencies, he 
has effectively eliminated himself from any job or contract where someone 
googles his name and thus will give him an enormous amount of time to troll 
forums. Who in their right mind would consciously hire someone who calls people 
idiots randomly to avoid the topic at hand. Being unemployed will just piss him 
off more and his trolling will only get worse. Hence, you don't feed trolls!!
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] mirror a slice

2007-12-13 Thread Shawn Joy
Is it possible to bring one slice of a disk under zfs controller and 
leave the others as ufs?

A customer is tryng to mirror one slice using zfs.

Please respond to me directly and to the alias.

Thanks,
Shawn

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mirror a slice

2007-12-13 Thread Cindy . Swearingen
Shawn,

Using slices for ZFS pools is generally not recommended so I think
we minimized any command examples with slices:

# zpool create tank mirror c1t0d0s0 c1t1d0s0

Keep in mind that using the slices from the same disk for both UFS
and ZFS makes administration more complex. Please see the ZFS BP
section here:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pools

* The recovery process of replacing a failed disk is more complex when 
disks contain both ZFS and UFS file systems on
slices.
  * ZFS pools (and underlying disks) that also contain UFS file systems 
on slices cannot be easily migrated to other
systems by using zpool import and export features.
  * In general, maintaining slices increases administration time and 
cost. Lower your administration costs by
simplifying your storage pool configuration model.

Cindy

Shawn Joy wrote:
 What are the commands? Everything I see is c1t0d0, c1t1d0.   no 
 slice just the completed disk.
 
 
 
 Robert Milkowski wrote:
 
Hello Shawn,

Thursday, December 13, 2007, 3:46:09 PM, you wrote:

SJ Is it possible to bring one slice of a disk under zfs controller and 
SJ leave the others as ufs?

SJ A customer is tryng to mirror one slice using zfs.


Yes, it's - it just works.


 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread can you guess?
 Hello can,
 
 Thursday, December 13, 2007, 12:02:56 AM, you wrote:
 
 cyg On the other hand, there's always the
 possibility that someone
 cyg else learned something useful out of this.  And
 my question about
 
 To be honest - there's basically nothing useful in
 the thread,
 perhaps except one thing - doesn't make any sense to
 listen to you.

I'm afraid you don't qualify to have an opinion on that, Robert - because you 
so obviously *haven't* really listened.  Until it became obvious that you never 
would, I was willing to continue to attempt to carry on a technical discussion 
with you, while ignoring the morons here who had nothing whatsoever in the way 
of technical comments to offer (but continued to babble on anyway).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding external USB disks

2007-12-13 Thread Eric Haycraft
You may want to peek here first. Tim has some scripts already and if not 
exactly what you want, I am sure it could be reverse engineered. 

http://blogs.sun.com/timf/entry/zfs_automatic_for_the_people


Eric
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mirror a slice

2007-12-13 Thread Toby Thain

On 13-Dec-07, at 1:56 PM, Shawn Joy wrote:

 What are the commands? Everything I see is c1t0d0, c1t1d0.   no
 slice just the completed disk.


I have used the following HOWTO. (Markup is TWiki, FWIW.)



Device names are for a 2-drive X2100. Other machines may differ, for  
example, X4100 drives may be =c3t2d0= and =c3t3d0=.

---++ Partitioning

This is done before installing Solaris 10, or after installing a new  
disk to replace a failed mirror disk.
* Run *format*, choose the correct disk device
* Enter *fdisk* from menu
* Delete any diagnostic partition, and existing Solaris partition
* Create one Solaris2 partition over 100% of the disk
* Exit *fdisk*; quit *format*

---++ Slice layout

|slice 0| root| 8192M| -- this is not really large enough :-)
|slice 1| swap| 2048M|
|slice 2| -||
|slice 3| SVM metadb| 16M|
|slice 4| zfs| 68200M|
|slice 5| SVM metadb| 16M|
|slice 6| -||
|slice 7| SVM metadb| 16M|

The final slice layout should be saved using =prtvtoc /dev/rdsk/ 
c1d0s2 vtoc=

The second (mirror) disk can be forced into the same layout using  
=fmthard -s vtoc /dev/rdsk/c2d0s2=
(Replacement drives must be partitioned in exactly the same way, so  
it is recommended that a copy of the vtoc be kept in a file.)

GRUB must also be installed on the second disk:
=/sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c2d0s0=

---++ Solaris Volume Manager setup

The root and swap slices will be mirrored using SVM. See:
* http://www.solarisinternals.com/wiki/index.php/ 
ZFS_Best_Practices_Guide#UFS.2FSVM
* http://sunsolve.sun.com/search/document.do?assetkey=1-9-83605-1

(As of Sol10U2 (June 06), ZFS is not supported for root partition.)

At this point the system has been installed on, and booted from the  
first disk, c1d0s0 (as root) and with swap from the same disk. The  
following steps set up SVM but don't interfere with currently mounted  
partitions. The second disk has already been partitioned identically  
to the first, and the data will be copied to the mirror after  
=metattach= below. Changing =/etc/vfstab= sets the machine to boot  
from the SVM mirror device in future.

* Create SVM metadata (slice 3) with redundant copies on slices 5  
and 7: %BR% =metadb -a -f c1d0s3 c2d0s3 c1d0s5 c2d0s5 c1d0s7 c2d0s7=
* Create submirrors on first disk (root and swap slices): %BR%  
=metainit -f d10 1 1 c1d0s0= %BR% =metainit -f d11 1 1 c1d0s1=
* Create submirrors on second disk: %BR% =metainit -f d20 1 1  
c2d0s0= %BR% =metainit -f d21 1 1 c2d0s1=
* Create the mirrors: %BR% =metainit d0 -m d10= %BR% =metainit d1  
-m d11=
* Take a backup copy of =/etc/vfstab=
* Define root slice: =metaroot d0= (this alters the mount device  
for / in =/etc/vfstab=, it should now be =/dev/md/dsk/d0=)
* Edit =/etc/vfstab= (changing device for swap to =/dev/md/dsk/d1=)
* Reboot to test. If there is a problem, use single user mode and  
revert vfstab. Confirm that root and swap devices are now the  
mirrored devices with =df= and =swap -l=
* Attach second halves to mirror: %BR% =metattach d0 d20= %BR%  
=metattach d1 d21=

Mirror will now begin to sync; progress can be checked with =metastat  
-c=

---+++ Also see

* [[http://slacksite.com/solaris/disksuite/disksuite.html  
recipe]] at slacksite.com

---++ ZFS setup

Slice 4 is set aside for the ZFS pool - the system's active data.

* Create pool: =zpool create pool mirror c1d0s4 c2d0s4=
* Create filesystem for home directories: =zfs create pool/home= % 
BR% (To make this active, move any existing home directories from =/ 
home= and into =/pool/home=; then =zfs set mountpoint=/home pool/ 
home=; log out; and log back in.)
* Set up regular scrub - Add to =crontab= a line such as: =0 4 1  
* * zpool scrub pool=
verbatim
bash-3.00# zpool create pool mirror c1d0s4 c2d0s4
bash-3.00# zpool status
   pool: pool
  state: ONLINE
  scrub: none requested
config:

 NAMESTATE READ WRITE CKSUM
 poolONLINE   0 0 0
   mirrorONLINE   0 0 0
 c1d0s4  ONLINE   0 0 0
 c2d0s4  ONLINE   0 0 0

errors: No known data errors
bash-3.00# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
pool  75.5K  65.5G  24.5K  /pool
bash-3.00#
/verbatim

---++ References
* [[http://docs.sun.com/app/docs/doc/819-5461 ZFS Admin Guide]]
* [[http://docs.sun.com/app/docs/doc/816-4520 SVM Admin Guide]]





 Robert Milkowski wrote:
 Hello Shawn,

 Thursday, December 13, 2007, 3:46:09 PM, you wrote:

 SJ Is it possible to bring one slice of a disk under zfs  
 controller and
 SJ leave the others as ufs?

 SJ A customer is tryng to mirror one slice using zfs.


 Yes, it's - it just works.


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread Jim Mauro
Would you two please SHUT THE F$%K UP.

Dear God, my kids don't go own like this.

Please - let it die already.

Thanks very much.

/jim


can you guess? wrote:
 Hello can,

 Thursday, December 13, 2007, 12:02:56 AM, you wrote:

 cyg On the other hand, there's always the
 possibility that someone
 cyg else learned something useful out of this.  And
 my question about

 To be honest - there's basically nothing useful in
 the thread,
 perhaps except one thing - doesn't make any sense to
 listen to you.
 

 I'm afraid you don't qualify to have an opinion on that, Robert - because you 
 so obviously *haven't* really listened.  Until it became obvious that you 
 never would, I was willing to continue to attempt to carry on a technical 
 discussion with you, while ignoring the morons here who had nothing 
 whatsoever in the way of technical comments to offer (but continued to babble 
 on anyway).

 - bill
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZIL and snapshots

2007-12-13 Thread Moore, Joe
I'm using an x4500 as a large data store for our VMware environment.  I
have mirrored the first 2 disks, and created a ZFS pool of the other 46:
22 pairs of mirrors, and 2 spares (optimizing for random I/O performance
rather than space).  Datasets are shared to the VMware ESX servers via
NFS.  We noticed that VMware mounts its NFS datastore with the SYNC
option, so every NFS write gets flagged with FILE_SYNC.  In testing,
syncronous writes are significantly slower than async, presumably
because of the strict ordering required for correctness (cache flushing
and ZIL).

Can anyone tell me if a ZFS snapshot taken when zil_disable=1 will be
crash-consistant with respect to the data written by VMware?  Are the
snapshot metadata updates serialized with pending non-metadata writes?
If an asyncronous write is issued before the snapshot is initiated, is
it guarenteed to be in the snapshot data, or can it be reordered to
after the snapshot?  Does a snapshot flush pending writes to disk?

To increase performance, the users are willing to lose an hour or two
of work (these are development/QA environments): In the event that the
x4500 crashes and loses the 16GB of cached (zil_disable=1) writes, we
roll back to the last hourly snapshot, and everyone's back to the way
they were.  However, I want to make sure that we will be able to boot a
crash-consistant VM from that rolled-back virtual disk.

Thanks for any knowledge you might have,
--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread Steve McKinty
I have a couple of questions and concerns about using ZFS in an environment 
where the underlying LUNs are replicated at a block level using products like 
HDS TrueCopy or EMC SRDF.  Apologies in advance for the length, but I wanted 
the explanation to be clear.

(I do realise that there are other possibilities such as zfs send/recv and 
there are technical and business pros and cons for the various options. I don't 
want to start a 'which is best' argument :) )

The CoW design of ZFS means that it goes to great lengths to always maintain 
on-disk self-consistency, and ZFS can make certain assumptions about state (e.g 
not needing fsck) based on that.  This is the basis of my questions. 

1) First issue relates to the überblock.  Updates to it are assumed to be 
atomic, but if the replication block size is smaller than the überblock then we 
can't guarantee that the whole überblock is replicated as an entity.  That 
could in theory result in a corrupt überblock at the
secondary. 

Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS 
just use an alternate überblock and rewrite the damaged one transparently?

2) Assuming that the replication maintains write-ordering, the secondary site 
will always have valid and self-consistent data, although it may be out-of-date 
compared to the primary if the replication is asynchronous, depending on link 
latency, buffering, etc. 

Normally most replication systems do maintain write ordering, [i]except[/i] for 
one specific scenario.  If the replication is interrupted, for example 
secondary site down or unreachable due to a comms problem, the primary site 
will keep a list of changed blocks.  When contact between the sites is 
re-established there will be a period of 'catch-up' resynchronization.  In 
most, if not all, cases this is done on a simple block-order basis.  
Write-ordering is lost until the two sites are once again in sync and routine 
replication restarts. 

I can see this has having major ZFS impact.  It would be possible for 
intermediate blocks to be replicated before the data blocks they point to, and 
in the worst case an updated überblock could be replicated before the block 
chains that it references have been copied.  This breaks the assumption that 
the on-disk format is always self-consistent. 

If a disaster happened during the 'catch-up', and the partially-resynchronized 
LUNs were imported into a zpool at the secondary site, what would/could happen? 
Refusal to accept the whole zpool? Rejection just of the files affected? System 
panic? How could recovery from this situation be achieved?

Obviously all filesystems can suffer with this scenario, but ones that expect 
less from their underlying storage (like UFS) can be fscked, and although data 
that was being updated is potentially corrupt, existing data should still be OK 
and usable.  My concern is that ZFS will handle this scenario less well. 

There are ways to mitigate this, of course, the most obvious being to take a 
snapshot of the (valid) secondary before starting resync, as a fallback.  This 
isn't always easy to do, especially since the resync is usually automatic; 
there is no clear trigger to use for the snapshot. It may also be difficult to 
synchronize the snapshot of all LUNs in a pool. I'd like to better understand 
the risks/behaviour of ZFS before starting to work on mitigation strategies. 

Thanks

Steve
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Richard Elling
MP wrote:
 this anti-raid-card movement is puzzling. 
 

 I think you've misinterpreted my questions.
 I queried the necessity of paying extra for an seemingly unnecessary RAID 
 card for zfs. I didn't doubt that it could perform better.
 Wasn't one of the design briefs of zfs, that it would provide it's feature 
 set without expensive RAID hardware?
   

In general, feature set != performance.  For example, a VIA 
x86-compatible processor
is not capable of beating the performance of a high-end Xeon, though the 
feature sets
are largely the same.  Additional examples abound.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to properly tell zfs of new GUID controller numbers after a firmware upgrade changes the IDs

2007-12-13 Thread Shawn Ferry
Jill,

I was recently looking for a similar solution to try and reconnect a
renumbered device while the pool was live.

e.g. zpool online mypool old target old target at new location
As in zpool replace but with the indication that this isn't a new  
device.

What I have been doing to deal with the renumbering is exactly the
export, import and clear.  Although I have been dealing with  
significantly
smaller devices and can't speak to the delay issues.

Shawn



On Dec 13, 2007, at 12:16 PM, Jill Manfield wrote:


 My customer's zfs pools and their 6540 disk array had a firmware  
 upgrade that changed GUIDs so we need a procedure to let the zfs  
 know it changed. They are getting errors as if they replaced  
 drives.  But I need to make sure you know they have not replaced  
 any drives, and no drives have failed or are bad. As such, they  
 have no interest in wiping any disks clean as indicated in 88130  
 info doc.

 Some background from customer:

 We have a large 6540 disk array, on which we have configured a  
 series of
 large RAID luns.  A few days ago, Sun sent a technician to upgrade the
 firmware of this array, which worked fine but which had the  
 deleterious
 effect of changing the Volume IDs associated with each lun.  So, the
 resulting luns now appear to our solaris 10 host (under mpxio) as  
 disks in
 /dev/rdsk with different 'target' components than they had before.

 Before the firmware upgrade we took the precaution of creating  
 duplicate
 luns on a different 6540 disk array, and using these to mirror each  
 of our
 zfs pools (as protection in case the firmware upgrade corrupted our  
 luns).

 Now, we simply want to ask zfs to find the devices under their new
 targets, recognize that they are existing zpool components, and have  
 it
 correct the configuration of each pool.  This would be similar to  
 having
 Veritas vxvm re-scan all disks with vxconfigd in the event of a
 controller renumbering event.

 The proper zfs method for doing this, I believe, is to simply do:

 zpool export mypool
 zpool import mypool

 Indeed, this has worked fine for me a few times today, and several  
 of our
 pools are now back to their original mirrored configuration.

 Here is a specific example, for the pool ospf.

 The zpool status after the upgrade:

 diamond:root[1105]-zpool status ospf
  pool: ospf
 state: DEGRADED
 status: One or more devices could not be opened.  Sufficient replicas
 exist for
the pool to continue functioning in a degraded state.
 action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: resilver completed with 0 errors on Tue Dec 11 18:26:53 2007
 config:

NAMESTATE READ  
 WRITE CKSUM
ospfDEGRADED  
 0 0 0
  mirrorDEGRADED  
 0 0 0
c27t600A0B8000292B024BDC4731A7B8d0  UNAVAIL   
 0 0 0  cannot open
c27t600A0B800032619A093747554A08d0  ONLINE
 0 0 0

 errors: No known data errors

 This is due to the fact that the LUN which used to appear as
 c27t600A0B8000292B024BDC4731A7B8d0 is now actually
 c27t600A0B8000292B024D5B475E6E90d0.  It's the same LUN, but  
 since the
 firmware changed the Volume ID, the target portion is different.

 Rather than treating this as a replaced disk (which would incur an
 entire mirror resilvering, and would require the trick you sent of
 obliterating the disk label so the in use safeguard could be  
 avoided),
 we simply want to ask zfs to re-read its configuration to find this  
 disk.

 So we do this:

 diamond:root[1110]-zpool export -f ospf
 diamond:root[]-zpool import ospf

 and sure enough:

 diamond:root[1112]-zpool status ospf
  pool: ospf
 state: ONLINE
 status: One or more devices is currently being resilvered.  The pool  
 will
continue to function, possibly in a degraded state.
 action: Wait for the resilver to complete.
 scrub: resilver in progress, 0.16% done, 2h53m to go
 config:

NAMESTATE READ  
 WRITE CKSUM
ospfONLINE
 0 0 0
  mirrorONLINE
 0 0 0
c27t600A0B8000292B024D5B475E6E90d0  ONLINE
 0 0 0
c27t600A0B800032619A093747554A08d0  ONLINE
 0 0 0

 errors: No known data errors

 (Note that it has self-initiated a resilvering, since in this case the
 mirror has been changed by users since the firmware upgrade.)

 The problem that Robert had was that when he initiated an export of  
 a pool
 (called bgp) it froze for quite some time.  The corresponding  
 import
 of the same pool took 12 hours to complete.  I have not been able to
 replicate this myself, but that was the essence of 

Re: [zfs-discuss] Finding external USB disks

2007-12-13 Thread Tim Foster

On Wed, 2007-12-12 at 21:35 -0600, David Dyer-Bennet wrote:
 What are the approaches to finding what external USB disks are currently 
 connected?

Would rmformat -l or eject -l fit the bill ?

 The external USB backup disks in question have ZFS filesystems on them, 
 which may make a difference in finding them perhaps?

Nice.

 I dug around a bit with this a while back, and I'm not sure hal 
friends are doing the right thing with zpools on removable devices just
yet.  I'd expect that we'd have a zpool import triggered on a device
being plugged, analogous to the way we have pcfs disks automatically
mounted by the system. Indeed there's 

/usr/lib/hal/hal-storage-zpool-export
/usr/lib/hal/hal-storage-zpool-import and
/etc/hal/fdi/policy/10osvendor/20-zfs-methods.fdi

but I haven't seen them actually doing anything useful when I insert a
disk with a pool on it. Does anyone know whether these should be working
now ? I'm not a hal expert...

 I've glanced at Tim Foster's autobackup and related scripts, and they're 
 all about being triggered by the plug connection being made; which is 
 not what I need.

Yep, fair enough.

   I don't actually want to start the big backup when I 
 plug in (or power on) the drive in the evening, it's supposed to wait 
 until late (to avoid competition with users).  (His autosnapshot script 
 may be just what I need for that part, though.)

The zfs-auto-snapshot service can perform a backup using a command set
in the zfs/backup-save-cmd property.

Setting that be a script that automagically selects a USB device (from a
known list, or one with free space?) and points the stream at a relevant
zfs recv command to the pool provided by your backup device might be
just what you're after.

Perhaps this is a project for the Christmas holidays :-)

cheers,
tim


-- 
Tim Foster, Sun Microsystems Inc, Solaris Engineering Ops
http://blogs.sun.com/timf

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mirror a slice

2007-12-13 Thread Richard Elling
[EMAIL PROTECTED] wrote:
 Shawn,

 Using slices for ZFS pools is generally not recommended so I think
 we minimized any command examples with slices:

 # zpool create tank mirror c1t0d0s0 c1t1d0s0
   

Cindy,
I think the term generally not recommended requires more context.  In 
the case
of a small system, particularly one which you would find on a laptop or 
desktop,
it is often the case that disks share multiple purposes, beyond ZFS.  I 
think the
way we have written this in the best practices wiki is fine, but perhaps 
we should
ask the group at large.  Thoughts anyone?

I do like the minimization for the examples, though.  If one were to 
actually
read any of the manuals, we clearly talk about how whole disks or slices
are fine.  However, on occasion someone will propagate the news that ZFS
only works with whole disks and we have to correct the confusion afterwards.
 -- richard
 Keep in mind that using the slices from the same disk for both UFS
 and ZFS makes administration more complex. Please see the ZFS BP
 section here:

 http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pools

 * The recovery process of replacing a failed disk is more complex when 
 disks contain both ZFS and UFS file systems on
 slices.
   * ZFS pools (and underlying disks) that also contain UFS file systems 
 on slices cannot be easily migrated to other
 systems by using zpool import and export features.
   * In general, maintaining slices increases administration time and 
 cost. Lower your administration costs by
 simplifying your storage pool configuration model.

 Cindy

 Shawn Joy wrote:
   
 What are the commands? Everything I see is c1t0d0, c1t1d0.   no 
 slice just the completed disk.



 Robert Milkowski wrote:

 
 Hello Shawn,

 Thursday, December 13, 2007, 3:46:09 PM, you wrote:

 SJ Is it possible to bring one slice of a disk under zfs controller and 
 SJ leave the others as ufs?

 SJ A customer is tryng to mirror one slice using zfs.


 Yes, it's - it just works.


   
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread MP
 Additional examples abound.

Doubtless :)

More usefully, can you confirm whether Solaris works on this chassis without 
the RAID controller?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mirror a slice

2007-12-13 Thread Toby Thain

On 13-Dec-07, at 3:54 PM, Richard Elling wrote:

 [EMAIL PROTECTED] wrote:
 Shawn,

 Using slices for ZFS pools is generally not recommended so I think
 we minimized any command examples with slices:

 # zpool create tank mirror c1t0d0s0 c1t1d0s0


 Cindy,
 I think the term generally not recommended requires more  
 context.  In
 the case
 of a small system, particularly one which you would find on a  
 laptop or
 desktop,
 it is often the case that disks share multiple purposes, beyond ZFS.


In particular in a 2-disk system that boots from UFS (that was my  
situation).

--Toby

 I
 think the
 way we have written this in the best practices wiki is fine, but  
 perhaps
 we should
 ask the group at large.  Thoughts anyone?

 I do like the minimization for the examples, though.  If one were to
 actually
 read any of the manuals, we clearly talk about how whole disks or  
 slices
 are fine.  However, on occasion someone will propagate the news  
 that ZFS
 only works with whole disks and we have to correct the confusion  
 afterwards.
  -- richard
 Keep in mind that using the slices from the same disk for both UFS
 and ZFS makes administration more complex. ...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What does dataset is busy actually mean?

2007-12-13 Thread Jim Klimov
I've hit the problem myself recently, and mounting the filesystem cleared 
something in the brains of ZFS and alowed me to snapshot.

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg00812.html

PS: I'll use Google before asking some questions,  a'la (C) Bart Simpson
That's how I found your question ;)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread can you guess?
 Would you two please SHUT THE F$%K UP.

Just for future reference, if you're attempting to squelch a public 
conversation it's often more effective to use private email to do it rather 
than contribute to the continuance of that public conversation yourself.

Have a nice day!

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?
 Are there benchmarks somewhere showing a RAID10
 implemented on an LSI card with, say, 128MB of cache
 being beaten in terms of performance by a similar
 zraid configuration with no cache on the drive
 controller?
 
 Somehow I don't think they exist. I'm all for data
 scrubbing, but this anti-raid-card movement is
 puzzling.

Oh, for joy - a chance for me to say something *good* about ZFS. rather than 
just try to balance out excessive enthusiasm.

Save for speeding up synchronous writes (if it has enough on-board NVRAM to 
hold them until it's convenient to destage them to disk), a RAID-10 card should 
not enjoy any noticeable performance advantage over ZFS mirroring.

By contrast, if extremely rare undetected and (other than via ZFS checksums) 
undetectable (or considerably more common undetected but detectable via disk 
ECC codes, *if* the data is accessed) corruption occurs, if the RAID card is 
used to mirror the data there's a good chance that even ZFS's validation scans 
won't see the problem (because the card happens to access the good copy for the 
scan rather than the bad one) - in which case you'll lose that data if the disk 
with the good data fails.  And in the case of (extremely rare) 
otherwise-undetectable corruption, if the card *does* return the bad copy then 
IIRC ZFS (not knowing that a good copy also exists) will just claim that the 
data is gone (though I don't know if it will then flag it such that you'll 
never have an opportunity to find the good copy).

If the RAID card scrubs its disks the difference (now limited to the extremely 
rare undetectable-via-disk-ECC corruption) becomes pretty negligible - but I'm 
not sure how many RAIDs below the near-enterprise category perform such scrubs.

In other words, if you *don't* otherwise scrub your disks then ZFS's 
checksums-plus-internal-scrubbing mechanisms assume greater importance:  it's 
only the contention that other solutions that *do* offer scrubbing can't 
compete with ZFS in effectively protecting your data that's somewhat over the 
top.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding external USB disks

2007-12-13 Thread Marion Hakanson
[EMAIL PROTECTED] said:
 What are the approaches to finding what external USB disks are currently
 connected?   I'm starting on backup scripts, and I need to check which
 volumes are present before I figure out what to back up to them.  I  
 . . .

In addition to what others have suggested so far, cfgadm -l lists usb-
and firewire-connected drives (even those plugged-in but not mounted).
So scripts can check that way as well.

Regards,

Marion



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread Richard Elling
Steve McKinty wrote:
 I have a couple of questions and concerns about using ZFS in an environment 
 where the underlying LUNs are replicated at a block level using products like 
 HDS TrueCopy or EMC SRDF.  Apologies in advance for the length, but I wanted 
 the explanation to be clear.

 (I do realise that there are other possibilities such as zfs send/recv and 
 there are technical and business pros and cons for the various options. I 
 don't want to start a 'which is best' argument :) )

 The CoW design of ZFS means that it goes to great lengths to always maintain 
 on-disk self-consistency, and ZFS can make certain assumptions about state 
 (e.g not needing fsck) based on that.  This is the basis of my questions. 

 1) First issue relates to the überblock.  Updates to it are assumed to be 
 atomic, but if the replication block size is smaller than the überblock then 
 we can't guarantee that the whole überblock is replicated as an entity.  That 
 could in theory result in a corrupt überblock at the
 secondary. 
   

The uberblock contains a circular queue of updates.  For all practical
purposes, this is COW.  The updates I measure are usually 1 block
(or, to put it another way, I don't recall seeing more than 1 block being
updated... I'd have to recheck my data)

 Will this be caught and handled by the normal ZFS checksumming? If so, does 
 ZFS just use an alternate überblock and rewrite the damaged one transparently?

   

The checksum should catch it.  To be safe, there are 4 copies of the 
uberblock.

 2) Assuming that the replication maintains write-ordering, the secondary site 
 will always have valid and self-consistent data, although it may be 
 out-of-date compared to the primary if the replication is asynchronous, 
 depending on link latency, buffering, etc. 

 Normally most replication systems do maintain write ordering, [i]except[/i] 
 for one specific scenario.  If the replication is interrupted, for example 
 secondary site down or unreachable due to a comms problem, the primary site 
 will keep a list of changed blocks.  When contact between the sites is 
 re-established there will be a period of 'catch-up' resynchronization.  In 
 most, if not all, cases this is done on a simple block-order basis.  
 Write-ordering is lost until the two sites are once again in sync and routine 
 replication restarts. 

 I can see this has having major ZFS impact.  It would be possible for 
 intermediate blocks to be replicated before the data blocks they point to, 
 and in the worst case an updated überblock could be replicated before the 
 block chains that it references have been copied.  This breaks the assumption 
 that the on-disk format is always self-consistent. 

 If a disaster happened during the 'catch-up', and the 
 partially-resynchronized LUNs were imported into a zpool at the secondary 
 site, what would/could happen? Refusal to accept the whole zpool? Rejection 
 just of the files affected? System panic? How could recovery from this 
 situation be achieved?
   

I think all of these reactions to the double-failure mode are possible.
The version of ZFS used will also have an impact as the later versions
are more resilient.  I think that in most cases, only the affected files
will be impacted.  zpool scrub will ensure that everything is consistent
and mark those files which fail to checksum properly.

 Obviously all filesystems can suffer with this scenario, but ones that expect 
 less from their underlying storage (like UFS) can be fscked, and although 
 data that was being updated is potentially corrupt, existing data should 
 still be OK and usable.  My concern is that ZFS will handle this scenario 
 less well. 
   

...databases too...
It might be easier to analyze this from the perspective of the transaction
group than an individual file.  Since ZFS is COW, you may have a
state where a transaction group is incomplete, but the previous data
state should be consistent.

 There are ways to mitigate this, of course, the most obvious being to take a 
 snapshot of the (valid) secondary before starting resync, as a fallback.  
 This isn't always easy to do, especially since the resync is usually 
 automatic; there is no clear trigger to use for the snapshot. It may also be 
 difficult to synchronize the snapshot of all LUNs in a pool. I'd like to 
 better understand the risks/behaviour of ZFS before starting to work on 
 mitigation strategies. 

   

I don't see how snapshots would help.  The inherent transaction group 
commits
should be sufficient.  Or, to look at this another way, a snapshot is 
really just
a metadata change.

I am more worried about how the storage admin sets up the LUN groups.
The human factor can really ruin my day...
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Frank Cusack
On December 13, 2007 9:47:00 AM -0800 MP [EMAIL PROTECTED] wrote:
 Additional examples abound.

 Doubtless :)

 More usefully, can you confirm whether Solaris works on this chassis
 without the RAID controller?

way back, i had Solaris working with a promise j200s (jbod sas) chassis,
to the extent that the sas driver at the time worked.  i can't IMAGINE
why this chassis would be any different from Solaris' perspective.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Frank Cusack
On December 13, 2007 11:34:54 AM -0800 can you guess? 
[EMAIL PROTECTED] wrote:
 By contrast, if extremely rare undetected and (other than via ZFS
 checksums) undetectable (or considerably more common undetected but
 detectable via disk ECC codes, *if* the data is accessed) corruption
 occurs, if the RAID card is used to mirror the data there's a good chance
 that even ZFS's validation scans won't see the problem (because the card
 happens to access the good copy for the scan rather than the bad one) -
 in which case you'll lose that data if the disk with the good data fails.
 And in the case of (extremely rare) otherwise-undetectable corruption, if
 the card *does* return the bad copy then IIRC ZFS (not knowing that a
 good copy also exists) will just claim that the data is gone (though I
 don't know if it will then flag it such that you'll never have an
 opportunity to find the good copy).

i like this answer, except for what you are implying by extremely rare.

 If the RAID card scrubs its disks the difference (now limited to the
 extremely rare undetectable-via-disk-ECC corruption) becomes pretty
 negligible - but I'm not sure how many RAIDs below the near-enterprise
 category perform such scrubs.

 In other words, if you *don't* otherwise scrub your disks then ZFS's
 checksums-plus-internal-scrubbing mechanisms assume greater importance:
 it's only the contention that other solutions that *do* offer scrubbing
 can't compete with ZFS in effectively protecting your data that's
 somewhat over the top.

the problem with your discounting of zfs checksums is that you aren't
taking into account that extremely rare is relative to the number of
transactions, which are extremely high.  in such a case even extremely
rare errors do happen, and not just to extremely few folks, but i would
say to all enterprises.  hell it happens to home users.

when the difference between an unrecoverable single bit error is not just
1 bit but the entire file, or corruption of an entire database row (etc),
those small and infrequent errors are an extremely big deal.

considering all the pieces, i would much rather run zfs on a jbod than
on a raid, wherever i could.  it gives better data protection, and it
is ostensibly cheaper.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool version 3 Uberblock version 9 , zpool upgrade only half succeeded?

2007-12-13 Thread kristof
We are currently experiencing a very huge perfomance drop on our zfs storage 
server.

We have 2 pools, pool 1 stor is a raidz out of 7 iscsi nodes, home is a local 
mirror pool. Recently we had some issues with one of the storagenodes, because 
of that the pool was degraded. Since we did not succeed in bringing this 
storagenode back online (on zfs level) we upgraded our nashead from opensolaris 
b57 to b77. After upgrade we succesfully resilvered the pool (resilver took 1 
week! - 14 TB). Finally we upgraded the pool to version 9 (comming from 
version 3). Now zpool is healty again, but perfomance realy s*cks. Accessing 
older data takes way to much time. Doing dtruss -a find . in a zfs filesystem 
on this b77 server is extremely slow, while it is fast in our backup location 
were we are still using opensolaris b57 and zpool version 3. 

Writing new data seems normal, we don't see huge issues here. The real problem 
is do ls, rm or find in filesystems with lots of files (+5, not in 1 
directory spread in multiple subfolders)

Today I found that not only zpool upgrade exist, but also zfs upgrade, most 
filesystems are still version 1 while some new are already version 3. 

Running zdb we also saw there is a mismatchs in version information, our 
storage pool is list as version 3 while the uberblock is at version 9, when we 
run zpool upgrade, it tells us all pools are upgraded to latest version.

below the zdb output: 

zdb stor 
version=3 
name='stor' 
state=0 
txg=6559447 
pool_guid=14464037545511218493 
hostid=341941495 
hostname='fileserver011' 
vdev_tree 
type='root' 
id=0 
guid=14464037545511218493 
children[0] 
type='raidz' 
id=0 
guid=179558698360846845 
nparity=1 
metaslab_array=13 
metaslab_shift=37 
ashift=9 
asize=20914156863488 
is_log=0 
children[0] 
type='disk' 
id=0 
guid=640233961847538260 
path='/dev/dsk/c2t3d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=36 
children[1] 
type='disk' 
id=1 
guid=7833573669820754721 
path='/dev/dsk/c2t4d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=22 
children[2] 
type='disk' 
id=2 
guid=13685988517147825972 
path='/dev/dsk/c2t5d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=17 
children[3] 
type='disk' 
id=3 
guid=13514021245008793227 
path='/dev/dsk/c2t6d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=21 
children[4] 
type='disk' 
id=4 
guid=15871506866153751690 
path='/dev/dsk/c2t9d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=20 
children[5] 
type='disk' 
id=5 
guid=11392907262189654902 
path='/dev/dsk/c2t7d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=19 
children[6] 
type='disk' 
id=6 
guid=8472117762643335828 
path='/dev/dsk/c2t8d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=18 
Uberblock 

magic = 00bab10c 
version = 9 
txg = 6692849 
guid_sum = 12266969233845513474 
timestamp = 1197546530 UTC = Thu Dec 13 12:48:50 2007 
fileserver

If we compare with zpool home (this pool was craeted after installing 

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?
...

 when the difference between an unrecoverable single
 bit error is not just
 1 bit but the entire file, or corruption of an entire
 database row (etc),
 those small and infrequent errors are an extremely
 big deal.

You are confusing unrecoverable disk errors (which are rare but orders of 
magnitude more common) with otherwise *undetectable* errors (the occurrence of 
which is at most once in petabytes by the studies I've seen, rather than once 
in terabytes), despite my attempt to delineate the difference clearly.  
Conventional approaches using scrubbing provide as complete protection against 
unrecoverable disk errors as ZFS does:  it's only the far rarer otherwise 
*undetectable* errors that ZFS catches and they don't.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL and snapshots

2007-12-13 Thread Ross
Heh, interesting to see somebody else using the sheer number of disks in the 
Thumper to their advantage :)

Have you thought of solid state cache for the ZIL?  There's a 16GB battery 
backed PCI card out there, I don't know how much it costs, but the blog where I 
saw it mentioned a 20x improvement in performance for small random writes.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?
...

  If the RAID card scrubs its disks
 
 A scrub without checksum puts a huge burden on disk
 firmware and  
 error reporting paths :-)

Actually, a scrub without checksum places far less burden on the disks and 
their firmware than ZFS-style scrubbing does, because it merely has to scan the 
disk sectors sequentially rather than follow a tree path to each relatively 
small leaf block.  Thus it also compromises runtime operation a lot less as 
well (though in both cases doing it infrequently in the background should 
usually reduce any impact to acceptable levels).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL and snapshots

2007-12-13 Thread Moore, Joe
 Have you thought of solid state cache for the ZIL?  There's a 
 16GB battery backed PCI card out there, I don't know how much 
 it costs, but the blog where I saw it mentioned a 20x 
 improvement in performance for small random writes.

Thought about it, looked in the Sun Store, couldn't find one, and cut
the PO.

Haven't gone back to get a new approval.  I did put a couple of the
MTron 32GB SSD drives on the christmas wishlist (aka 2008 budget)

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread can you guess?
Great questions.

 1) First issue relates to the überblock.  Updates to
 it are assumed to be atomic, but if the replication
 block size is smaller than the überblock then we
 can't guarantee that the whole überblock is
 replicated as an entity.  That could in theory result
 in a corrupt überblock at the
 secondary. 
 
 Will this be caught and handled by the normal ZFS
 checksumming? If so, does ZFS just use an alternate
 überblock and rewrite the damaged one transparently?

ZFS already has to deal with potential uberblock partial writes if it contains 
multiple disk sectors (and it might be prudent even if it doesn't, as Richard's 
response seems to suggest).  Common ways of dealing with this problem include 
dumping it into the log (in which case the log with its own internal recovery 
procedure becomes the real root of all evil) or cycling around at least two 
locations per mirror copy (Richard's response suggests that there are 
considerably more, and that perhaps each one is written in quadruplicate) such 
that the previous uberblock would still be available if the new write tanked.  
ZFS-style snapshots complicate both approaches unless special provisions are 
taken - e.g., copying the current uberblock on each snapshot and hanging a list 
of these snapshot uberblock addresses off the current uberblock, though even 
that might run into interesting complications under the scenario which you 
describe below.  Just using the 'queue' that Richard describes to accumulate 
snapshot uberblocks would limit the number of concurrent snapshots to less than 
the size of that queue.

In any event, as long as writes to the secondary copy don't continue after a 
write failure of the kind that you describe has occurred (save for the kind of 
catch-up procedure that you mention later), ZFS's internal facilities should 
not be confused by encountering a partial uberblock update at the secondary, 
any more than they'd be confused by encountering it on an unreplicated system 
after restart.

 
 2) Assuming that the replication maintains
 write-ordering, the secondary site will always have
 valid and self-consistent data, although it may be
 out-of-date compared to the primary if the
 replication is asynchronous, depending on link
 latency, buffering, etc. 
 
 Normally most replication systems do maintain write
 ordering, [i]except[/i] for one specific scenario.
 If the replication is interrupted, for example
 secondary site down or unreachable due to a comms
 problem, the primary site will keep a list of
 changed blocks.  When contact between the sites is
 re-established there will be a period of 'catch-up'
 resynchronization.  In most, if not all, cases this
 is done on a simple block-order basis.
 Write-ordering is lost until the two sites are once
  again in sync and routine replication restarts. 
 
 I can see this has having major ZFS impact.  It would
 be possible for intermediate blocks to be replicated
 before the data blocks they point to, and in the
 worst case an updated überblock could be replicated
 before the block chains that it references have been
 copied.  This breaks the assumption that the on-disk
 format is always self-consistent. 
 
 If a disaster happened during the 'catch-up', and the
 partially-resynchronized LUNs were imported into a
 zpool at the secondary site, what would/could happen?
 Refusal to accept the whole zpool? Rejection just of
 the files affected? System panic? How could recovery
 from this situation be achieved?

My inclination is to say By repopulating your environment from backups:  it 
is not reasonable to expect *any* file system to operate correctly, or to 
attempt any kind of comprehensive recovery (other than via something like fsck, 
with no guarantee of how much you'll get back), when the underlying hardware 
transparently reorders updates which the file system has explicitly ordered 
when it presented them.

But you may well be correct in suspecting that there's more potential for 
data loss should this occur in a ZFS environment than in update-in-place 
environments where only portions of the tree structure that were explicitly 
changed during the connection hiatus would likely be affected by such a 
recovery interruption (though even there if a directory changed enough to 
change its block structure on disk you could be in more trouble).

 
 Obviously all filesystems can suffer with this
 scenario, but ones that expect less from their
 underlying storage (like UFS) can be fscked, and
 although data that was being updated is potentially
 corrupt, existing data should still be OK and usable.
 My concern is that ZFS will handle this scenario
  less well. 
 
 There are ways to mitigate this, of course, the most
 obvious being to take a snapshot of the (valid)
 secondary before starting resync, as a fallback.

You're talking about an HDS- or EMC-level snapshot, right?

 This isn't always easy to do, especially since the
 resync is usually automatic; there is no clear
 

Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.

2007-12-13 Thread Henk Langeveld
J.P. King wrote:
 Wow, that a neat idea, and crazy at the same time. But the mknod's minor
 value can be 0-262143 so it probably would be doable with some loss of
 memory and efficiency. But maybe not :) (I would need one lofi dev per
 filesystem right?)

 Definitely worth remembering if I need to do something small/quick.
 
 You're confusing lofi and lofs, I think.  Have a look at man lofs.
 
 Now all _I_ would like is translucent options to that and I'd solve one of 
 my major headaches.

Check ast-open[1] for the 3d command that implements the nDFS, or 
multiple dimension file system, allowing you to overlay directories.
The 3d [2] utility allows you to run a command with all file system
calls intercepted.

Any writes will go into the top level directory, while reads pass
though until a matching file is found.

system calls are intercepted by an  LD_PRELOAD library, so each
process can have its own settings.


[1] http://www.research.att.com/~gsf/download/gen/ast-open.html
[2] http://www.research.att.com/~gsf/man/man1/3d.html
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Frank Cusack
On December 13, 2007 12:51:55 PM -0800 can you guess? 
[EMAIL PROTECTED] wrote:
 ...

 when the difference between an unrecoverable single
 bit error is not just
 1 bit but the entire file, or corruption of an entire
 database row (etc),
 those small and infrequent errors are an extremely
 big deal.

 You are confusing unrecoverable disk errors (which are rare but orders of
 magnitude more common) with otherwise *undetectable* errors (the
 occurrence of which is at most once in petabytes by the studies I've
 seen, rather than once in terabytes), despite my attempt to delineate the
 difference clearly.

No I'm not.  I know exactly what you are talking about.

  Conventional approaches using scrubbing provide as
 complete protection against unrecoverable disk errors as ZFS does:  it's
 only the far rarer otherwise *undetectable* errors that ZFS catches and
 they don't.

yes.  far rarer and yet home users still see them.

that the home user ever sees these extremely rare (undetectable) errors
may have more to do with poor connection (cables, etc) to the disk, and
less to do with disk media errors.  enterprise users probably have
better connectivity and see errors due to high i/o.  just thinking
out loud.

regardless, zfs on non-raid provides better protection than zfs on raid
(well, depending on raid configuration) so just from the data integrity
POV non-raid would generally be preferred.  the fact that the type of
error being prevented is rare doesn't change that and i was further
arguing that even though it's rare the impact can be high so you don't
want to write it off.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Marion Hakanson
[EMAIL PROTECTED] said:
 You are confusing unrecoverable disk errors (which are rare but orders of
 magnitude more common) with otherwise *undetectable* errors (the occurrence
 of which is at most once in petabytes by the studies I've seen, rather than
 once in terabytes), despite my attempt to delineate the difference clearly.

I could use a little clarification on how these unrecoverable disk errors
behave -- or maybe a lot, depending on one's point of view.

So, when one of these once in around ten (or 100) terabytes read events
occurs, my understanding is that a read error is returned by the drive,
and the corresponding data is lost as far as the drive is concerned.
Maybe just a bit is gone, maybe a byte, maybe a disk sector, it probably
depends on the disk, OS, driver, and/or the rest of the I/O hardware
chain.  Am I doing OK so far?


 Conventional approaches using scrubbing provide as complete protection
 against unrecoverable disk errors as ZFS does:  it's only the far rarer
 otherwise *undetectable* errors that ZFS catches and they don't. 

I found it helpful to my own understanding to try restating the above
in my own words.  Maybe others will as well.

If my assumptions are correct about how these unrecoverable disk errors
are manifested, then a dumb scrubber will find such errors by simply
trying to read everything on disk -- no additional checksum is required.
Without some form of parity or replication, the data is lost, but at
least somebody will know about it.

Now it seems to me that without parity/replication, there's not much
point in doing the scrubbing, because you could just wait for the error
to be detected when someone tries to read the data for real.  It's
only if you can repair such an error (before the data is needed) that
such scrubbing is useful.

For those well-versed in this stuff, apologies for stating the obvious.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread Ricardo M. Correia




Steve McKinty wrote:

  1) First issue relates to the überblock.  Updates to it are assumed to be atomic, but if the replication block size is smaller than the überblock then we can't guarantee that the whole überblock is replicated as an entity.  That could in theory result in a corrupt überblock at the
secondary. 

Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate überblock and rewrite the damaged one transparently?
  


Yes, ZFS uberblocks are self-checksummed with SHA-256 and when opening
the pool it uses the latest valid uberblock that it can find. So that
is not a problem.


  2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc. 

Normally most replication systems do maintain write ordering, [i]except[/i] for one specific scenario.  If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks.  When contact between the sites is re-established there will be a period of 'catch-up' resynchronization.  In most, if not all, cases this is done on a simple block-order basis.  Write-ordering is lost until the two sites are once again in sync and routine replication restarts. 

I can see this has having major ZFS impact.  It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated überblock could be replicated before the block chains that it references have been copied.  This breaks the assumption that the on-disk format is always self-consistent. 

If a disaster happened during the 'catch-up', and the partially-resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved?
  


I believe your understanding is correct. If you expect such a
double-failure, you cannot rely on being able to recover your pool at
the secondary site.

The newest uberblocks would be among the first blocks to be replicated
(2 of the uberblock arrays are situated at the start of the vdev) and
your whole block tree might be inaccessible if the latest Meta Object
Set blocks were not also replicated. You might be lucky and be able to
mount your filesystems because ZFS keeps 3 separate copies of the most
important metadata and it tries to keep apart each copy by about 1/8th
of the disk, but even then I wouldn't count on it.

If ZFS can't open the pool due to this kind of corruption, you would
get the following message:

status: The pool metadata is corrupted and the pool cannot be
opened.
action: Destroy and re-create the pool from a backup source.

At this point, you could try zeroing out the first 2 uberblock
arrays so that ZFS tries using an older uberblock from the last 2
arrays, but this might not work. As the message says, the only reliable
way to recover from this is restoring your pool from backups.


  There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback.  This isn't always easy to do, especially since the resync is usually automatic; there is no clear trigger to use for the snapshot. It may also be difficult to synchronize the snapshot of all LUNs in a pool. I'd like to better understand the risks/behaviour of ZFS before starting to work on mitigation strategies. 
  


If the replication process was interrupted for a sufficiently long time
and disaster strikes at the primary site *during resync*, I don't think
snapshots would save you even if you had took them at the right time.
Snapshots might increase your chances of recovery (by making ZFS not
free and reuse blocks), but AFAIK there wouldn't be any guarantee that
you'd be able to recover anything whatsoever since the most important
pool metadata is not part of the snapshots.

Regards,
Ricardo

-- 

  

  
  
  
  Ricardo Manuel Correia
  
Lustre Engineering
  Sun Microsystems, Inc.
Portugal
  
  

  




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.

2007-12-13 Thread Jorgen Lundman

NOC staff couldn't reboot it after the quotacheck crash, and I only just 
got around to going to the Datacenter.  This time I disabled NFS, and 
the rsync that was running, and ran just quotacheck and it completed 
successfully. The reason it didn't boot what that damned boot-archive 
again. Seriously!

Anyway, I did get a vmcore from the crash, but maybe it isn't so 
interesting. I will continue with the stress testing of UFS on zpool as 
it is the only solution that would be acceptable. Not given up yet, I 
have a few more weeks to keep trying. :)



-rw-r--r--   1 root root 2345863 Dec 14 09:57 unix.0
-rw-r--r--   1 root root 4741623808 Dec 14 10:05 vmcore.0

bash-3.00# adb -k unix.0 vmcore.0
physmem 3f9789
$c
top_end_sync+0xcb(ff0a5923d000, ff001f175524, b, 0)
ufs_fsync+0x1cb(ff62e757ad80, 1, fffedd6d2020)
fop_fsync+0x51(ff62e757ad80, 1, fffedd6d2020)
rfs3_setattr+0x3a3(ff001f1757c8, ff001f1758b8, ff1a0d942080,
ff001f175b20, fffedd6d2020)
common_dispatch+0x444(ff001f175b20, ff0a5a4baa80, 2, 4, 
f7c7ea78
, c06003d0)
rfs_dispatch+0x2d(ff001f175b20, ff0a5a4baa80)
svc_getreq+0x1c6(ff0a5a4baa80, fffec7eda6c0)
svc_run+0x171(ff62becb72a0)
svc_do_run+0x85(1)
nfssys+0x748(e, fecf0fc8)
sys_syscall32+0x101()


BAD TRAP: type=e (#pf Page fault) rp=ff001f175320 addr=0 occurred in 
module
unknown due to a NULL pointer dereference





-- 
Jorgen Lundman   | [EMAIL PROTECTED]
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.

2007-12-13 Thread Shawn Ferry
Jorgen,

You may want to try running 'bootadm update-archive'

Assuming that your boot-archive problem is an out of date boot-archive
message at boot and/or doing a clean reboot to let the system try to
write an up to date boot-archive.

I would also encourage you to connect the LOM to the network in case you
have such issues again, you should be able to recover remotely.

Shawn

On Dec 13, 2007, at 10:33 PM, Jorgen Lundman wrote:


 NOC staff couldn't reboot it after the quotacheck crash, and I only  
 just
 got around to going to the Datacenter.  This time I disabled NFS, and
 the rsync that was running, and ran just quotacheck and it completed
 successfully. The reason it didn't boot what that damned boot-archive
 again. Seriously!

 Anyway, I did get a vmcore from the crash, but maybe it isn't so
 interesting. I will continue with the stress testing of UFS on zpool  
 as
 it is the only solution that would be acceptable. Not given up yet, I
 have a few more weeks to keep trying. :)



 -rw-r--r--   1 root root 2345863 Dec 14 09:57 unix.0
 -rw-r--r--   1 root root 4741623808 Dec 14 10:05 vmcore.0

 bash-3.00# adb -k unix.0 vmcore.0
 physmem 3f9789
 $c
 top_end_sync+0xcb(ff0a5923d000, ff001f175524, b, 0)
 ufs_fsync+0x1cb(ff62e757ad80, 1, fffedd6d2020)
 fop_fsync+0x51(ff62e757ad80, 1, fffedd6d2020)
 rfs3_setattr+0x3a3(ff001f1757c8, ff001f1758b8,  
 ff1a0d942080,
 ff001f175b20, fffedd6d2020)
 common_dispatch+0x444(ff001f175b20, ff0a5a4baa80, 2, 4,
 f7c7ea78
 , c06003d0)
 rfs_dispatch+0x2d(ff001f175b20, ff0a5a4baa80)
 svc_getreq+0x1c6(ff0a5a4baa80, fffec7eda6c0)
 svc_run+0x171(ff62becb72a0)
 svc_do_run+0x85(1)
 nfssys+0x748(e, fecf0fc8)
 sys_syscall32+0x101()


 BAD TRAP: type=e (#pf Page fault) rp=ff001f175320 addr=0  
 occurred in
 module
 unknown due to a NULL pointer dereference





 -- 
 Jorgen Lundman   | [EMAIL PROTECTED]
 Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
 Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
 Japan| +81 (0)3 -3375-1767  (home)
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Shawn Ferry  shawn.ferry at sun.com
Senior Primary Systems Engineer
Sun Managed Operations
571.291.4898





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Anton B. Rang
 I could use a little clarification on how these unrecoverable disk errors
 behave -- or maybe a lot, depending on one's point of view.
 
 So, when one of these once in around ten (or 100) terabytes read events
 occurs, my understanding is that a read error is returned by the drive,
 and the corresponding data is lost as far as the drive is concerned.

Yes -- the data being one or more disk blocks.  (You can't lose a smaller
amount of data, from the drive's point of view, since the error correction
code covers the whole block.)

 If my assumptions are correct about how these unrecoverable disk errors
 are manifested, then a dumb scrubber will find such errors by simply
 trying to read everything on disk -- no additional checksum is required.
 Without some form of parity or replication, the data is lost, but at
 least somebody will know about it.

Right.  Generally if you have replication and scrubbing, then you'll also
re-write any data which was found to be unreadable, thus fixing the
problem (and protecting yourself against future loss of the second copy).

 Now it seems to me that without parity/replication, there's not much
 point in doing the scrubbing, because you could just wait for the error
 to be detected when someone tries to read the data for real.  It's
 only if you can repair such an error (before the data is needed) that
 such scrubbing is useful.

Pretty much, though if you're keeping backups, you could recover the
data from backup at this point. Of course, backups could be considered
a form of replication, but most of us in file systems don't think of them
that way.

Anton
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.

2007-12-13 Thread Jorgen Lundman


Shawn Ferry wrote:
 Jorgen,
 
 You may want to try running 'bootadm update-archive'
 
 Assuming that your boot-archive problem is an out of date boot-archive
 message at boot and/or doing a clean reboot to let the system try to
 write an up to date boot-archive.

Yeah, it is remembering to do so after something has changed that's 
hard. In this case, I had to break the mirror to install OpenSolaris. 
(shame that the CD/DVD, and miniroot, doesn't not have md driver).

It would be tempting to add the bootadm update-archive to the boot 
process, as I would rather have it come up half-assed, than not come up 
at all.

And yes, other servers are on remote access, but since was a temporary 
trial, we only ran 1 network cable, and 2x 200V cables. Should have done
a proper job at the start, I guess.

This time I made sure it was reboot-safe :)

Lund


 
 I would also encourage you to connect the LOM to the network in case you
 have such issues again, you should be able to recover remotely.
 
 Shawn
 
 On Dec 13, 2007, at 10:33 PM, Jorgen Lundman wrote:
 
 NOC staff couldn't reboot it after the quotacheck crash, and I only  
 just
 got around to going to the Datacenter.  This time I disabled NFS, and
 the rsync that was running, and ran just quotacheck and it completed
 successfully. The reason it didn't boot what that damned boot-archive
 again. Seriously!

 Anyway, I did get a vmcore from the crash, but maybe it isn't so
 interesting. I will continue with the stress testing of UFS on zpool  
 as
 it is the only solution that would be acceptable. Not given up yet, I
 have a few more weeks to keep trying. :)



 -rw-r--r--   1 root root 2345863 Dec 14 09:57 unix.0
 -rw-r--r--   1 root root 4741623808 Dec 14 10:05 vmcore.0

 bash-3.00# adb -k unix.0 vmcore.0
 physmem 3f9789
 $c
 top_end_sync+0xcb(ff0a5923d000, ff001f175524, b, 0)
 ufs_fsync+0x1cb(ff62e757ad80, 1, fffedd6d2020)
 fop_fsync+0x51(ff62e757ad80, 1, fffedd6d2020)
 rfs3_setattr+0x3a3(ff001f1757c8, ff001f1758b8,  
 ff1a0d942080,
 ff001f175b20, fffedd6d2020)
 common_dispatch+0x444(ff001f175b20, ff0a5a4baa80, 2, 4,
 f7c7ea78
 , c06003d0)
 rfs_dispatch+0x2d(ff001f175b20, ff0a5a4baa80)
 svc_getreq+0x1c6(ff0a5a4baa80, fffec7eda6c0)
 svc_run+0x171(ff62becb72a0)
 svc_do_run+0x85(1)
 nfssys+0x748(e, fecf0fc8)
 sys_syscall32+0x101()


 BAD TRAP: type=e (#pf Page fault) rp=ff001f175320 addr=0  
 occurred in
 module
 unknown due to a NULL pointer dereference





 -- 
 Jorgen Lundman   | [EMAIL PROTECTED]
 Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
 Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
 Japan| +81 (0)3 -3375-1767  (home)
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 --
 Shawn Ferry  shawn.ferry at sun.com
 Senior Primary Systems Engineer
 Sun Managed Operations
 571.291.4898
 
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 

-- 
Jorgen Lundman   | [EMAIL PROTECTED]
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?
...

  Now it seems to me that without parity/replication,
 there's not much
  point in doing the scrubbing, because you could
 just wait for the error
  to be detected when someone tries to read the data
 for real.  It's
  only if you can repair such an error (before the
 data is needed) that
  such scrubbing is useful.
 
 Pretty much

I think I've read (possibly in the 'MAID' descriptions) the contention that at 
least some unreadable sectors get there in stages, such that if you catch them 
early they will be only difficult to read rather than completely unreadable.  
In such a case, scrubbing is worthwhile even without replication, because it 
finds the problem early enough that the disk itself (or higher-level mechanisms 
if the disk gives up but the higher level is more persistent) will revector the 
sector when it finds it difficult (but not impossible) to read.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Will Murnane
On Dec 14, 2007 1:12 AM, can you guess? [EMAIL PROTECTED] wrote:
  yes.  far rarer and yet home users still see them.

 I'd need to see evidence of that for current hardware.
What would constitute evidence?  Do anecdotal tales from home users
qualify?  I have two disks (and one controller!) that generate several
checksum errors per day each.  I've also seen intermittent checksum
fails that go away once all the cables are wiggled.

 Unlikely, since transfers over those connections have been protected by 
 32-bit CRCs since ATA busses went to 33 or 66 MB/sec. (SATA has even stronger 
 protection)
The ATA/7 spec specifies a 32-bit CRC (older ones used a 16-bit CRC)
[1].  The serial ata protocol also specifies 32-bit CRCs beneath 8/10b
coding (1.0a p. 159)[2].  That's not much stronger at all.

Will

[1] http://www.t10.org/t13/project/d1532v3r4a-ATA-ATAPI-7.pdf
[2] http://www.ece.umd.edu/courses/enee759h.S2003/references/serialata10a.pdf
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss