Re: [zfs-discuss] Yager on ZFS
Hello can, Thursday, December 13, 2007, 12:02:56 AM, you wrote: cyg On the other hand, there's always the possibility that someone cyg else learned something useful out of this. And my question about To be honest - there's basically nothing useful in the thread, perhaps except one thing - doesn't make any sense to listen to you. You're just unable to talk to people. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Auto backup and auto restore of ZFS via Firewire drive
Hey folks, This may not be the best place to ask this, but I'm so new to Solaris I really don't know anywhere better. If anybody can suggest a better forum I'm all ears :) I've heard of Tim Foster's autobackup utility, that can automatically backup a ZFS filesystem to a USB drive as it's connected. What I'd like to know is if there's a way of doing something similar for a firewire drive? Also, is it possible to run several backups at once onto that drive? Is there any way I can script a bunch of commands to run as the drive is connected? thanks, Ross This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS NAS Cluster
Dear All, First of all thanks for a fascinating list - its my first read of the morning. Secondly I would like to ask a question. We currently have an EMC Celerra NAS which we use for CIFS, NFS and iSCSI. Its not our favourite piece of hardware and it is nearing the limits of its capacity (Tb) . We have two options: 1) Expand the solution. Spend £££s, double the number of heads, double the capacity and carry on as before. 2) Look for something else. I have been watching ZFS for some time and have implemented it in several niche applications. I would like to be able to consider using ZFS as the basis of a NAS solution based around SAN storage, T{2,5}000 servers and Sun Cluster. Here is my wish list: Flexible provisioning (thin if possible) Hardware resilience/Transparent Failover Asynchronous Replication to remote site (1km) providing DR cover. NFS/CIFS/iSCSI Snaps/Cloning No single point of failure Integration with Active Directory/NFS Ability to restripe data onto widened pools. Ability to migrate data between storage pools. As I understand it the combination of ZFS and SunCluster will give me all of the above. Has anybody done this? How mature/stable is it. I understand that SunCluster/HA-ZFS is supported but there seems to be little that I can find on the web about it. Any information would be gratefully received. Best Regards, Vic -- Vic Cornell UNIX Systems Administrator Landmark Information Group Limited 5-7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY T: 01392 888690 M: 07900 660266 F: 01392 441709 www.landmarkinfo.co.uk http://www.landmarkinfo.co.uk Registered Office: 5-7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY Registered Number 2892803 Registered in England Wales The information contained in this e-mail is confidential and may be subject to legal privilege. If you are not the intended recipient, you must not use, copy, distribute or disclose the e-mail or any part of its contents or take any action in reliance on it. If you have received this e-mail in error, please e-mail the sender by replying to this message. All reasonable precautions have been taken to ensure no viruses are present in this e-mail. Landmark Information Group Limited cannot accept responsibility for loss or damage arising from the use of this e-mail or attachments and recommend that you subject these to your virus checking procedures prior to use. www.landmarkinfo.co.uk ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
this anti-raid-card movement is puzzling. I think you've misinterpreted my questions. I queried the necessity of paying extra for an seemingly unnecessary RAID card for zfs. I didn't doubt that it could perform better. Wasn't one of the design briefs of zfs, that it would provide it's feature set without expensive RAID hardware? Of course, if you have the money then you can always go faster, but this is a zfs discussion thread (I know I've perpetuated the extravagant cross-posting of the OP). Cheers. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mirror a slice
What are the commands? Everything I see is c1t0d0, c1t1d0. no slice just the completed disk. Robert Milkowski wrote: Hello Shawn, Thursday, December 13, 2007, 3:46:09 PM, you wrote: SJ Is it possible to bring one slice of a disk under zfs controller and SJ leave the others as ufs? SJ A customer is tryng to mirror one slice using zfs. Yes, it's - it just works. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
People.. for the n-teenth time, there are only two ways to kill a troll. One involves a woodchipper and the possibility of an unwelcome visit from the FBI, and the other involves ignoring them. Internet Trolls: http://en.wikipedia.org/wiki/Internet_troll http://www.linuxextremist.com/?p=34 Another perspective: http://sc.tri-bit.com/images/7/7e/greaterinternetfu#kwadtheory.jpg The irony of this whole thing is that by feeding Bill's tollish tendencies, he has effectively eliminated himself from any job or contract where someone googles his name and thus will give him an enormous amount of time to troll forums. Who in their right mind would consciously hire someone who calls people idiots randomly to avoid the topic at hand. Being unemployed will just piss him off more and his trolling will only get worse. Hence, you don't feed trolls!! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] mirror a slice
Is it possible to bring one slice of a disk under zfs controller and leave the others as ufs? A customer is tryng to mirror one slice using zfs. Please respond to me directly and to the alias. Thanks, Shawn ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mirror a slice
Shawn, Using slices for ZFS pools is generally not recommended so I think we minimized any command examples with slices: # zpool create tank mirror c1t0d0s0 c1t1d0s0 Keep in mind that using the slices from the same disk for both UFS and ZFS makes administration more complex. Please see the ZFS BP section here: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pools * The recovery process of replacing a failed disk is more complex when disks contain both ZFS and UFS file systems on slices. * ZFS pools (and underlying disks) that also contain UFS file systems on slices cannot be easily migrated to other systems by using zpool import and export features. * In general, maintaining slices increases administration time and cost. Lower your administration costs by simplifying your storage pool configuration model. Cindy Shawn Joy wrote: What are the commands? Everything I see is c1t0d0, c1t1d0. no slice just the completed disk. Robert Milkowski wrote: Hello Shawn, Thursday, December 13, 2007, 3:46:09 PM, you wrote: SJ Is it possible to bring one slice of a disk under zfs controller and SJ leave the others as ufs? SJ A customer is tryng to mirror one slice using zfs. Yes, it's - it just works. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Hello can, Thursday, December 13, 2007, 12:02:56 AM, you wrote: cyg On the other hand, there's always the possibility that someone cyg else learned something useful out of this. And my question about To be honest - there's basically nothing useful in the thread, perhaps except one thing - doesn't make any sense to listen to you. I'm afraid you don't qualify to have an opinion on that, Robert - because you so obviously *haven't* really listened. Until it became obvious that you never would, I was willing to continue to attempt to carry on a technical discussion with you, while ignoring the morons here who had nothing whatsoever in the way of technical comments to offer (but continued to babble on anyway). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding external USB disks
You may want to peek here first. Tim has some scripts already and if not exactly what you want, I am sure it could be reverse engineered. http://blogs.sun.com/timf/entry/zfs_automatic_for_the_people Eric This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mirror a slice
On 13-Dec-07, at 1:56 PM, Shawn Joy wrote: What are the commands? Everything I see is c1t0d0, c1t1d0. no slice just the completed disk. I have used the following HOWTO. (Markup is TWiki, FWIW.) Device names are for a 2-drive X2100. Other machines may differ, for example, X4100 drives may be =c3t2d0= and =c3t3d0=. ---++ Partitioning This is done before installing Solaris 10, or after installing a new disk to replace a failed mirror disk. * Run *format*, choose the correct disk device * Enter *fdisk* from menu * Delete any diagnostic partition, and existing Solaris partition * Create one Solaris2 partition over 100% of the disk * Exit *fdisk*; quit *format* ---++ Slice layout |slice 0| root| 8192M| -- this is not really large enough :-) |slice 1| swap| 2048M| |slice 2| -|| |slice 3| SVM metadb| 16M| |slice 4| zfs| 68200M| |slice 5| SVM metadb| 16M| |slice 6| -|| |slice 7| SVM metadb| 16M| The final slice layout should be saved using =prtvtoc /dev/rdsk/ c1d0s2 vtoc= The second (mirror) disk can be forced into the same layout using =fmthard -s vtoc /dev/rdsk/c2d0s2= (Replacement drives must be partitioned in exactly the same way, so it is recommended that a copy of the vtoc be kept in a file.) GRUB must also be installed on the second disk: =/sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c2d0s0= ---++ Solaris Volume Manager setup The root and swap slices will be mirrored using SVM. See: * http://www.solarisinternals.com/wiki/index.php/ ZFS_Best_Practices_Guide#UFS.2FSVM * http://sunsolve.sun.com/search/document.do?assetkey=1-9-83605-1 (As of Sol10U2 (June 06), ZFS is not supported for root partition.) At this point the system has been installed on, and booted from the first disk, c1d0s0 (as root) and with swap from the same disk. The following steps set up SVM but don't interfere with currently mounted partitions. The second disk has already been partitioned identically to the first, and the data will be copied to the mirror after =metattach= below. Changing =/etc/vfstab= sets the machine to boot from the SVM mirror device in future. * Create SVM metadata (slice 3) with redundant copies on slices 5 and 7: %BR% =metadb -a -f c1d0s3 c2d0s3 c1d0s5 c2d0s5 c1d0s7 c2d0s7= * Create submirrors on first disk (root and swap slices): %BR% =metainit -f d10 1 1 c1d0s0= %BR% =metainit -f d11 1 1 c1d0s1= * Create submirrors on second disk: %BR% =metainit -f d20 1 1 c2d0s0= %BR% =metainit -f d21 1 1 c2d0s1= * Create the mirrors: %BR% =metainit d0 -m d10= %BR% =metainit d1 -m d11= * Take a backup copy of =/etc/vfstab= * Define root slice: =metaroot d0= (this alters the mount device for / in =/etc/vfstab=, it should now be =/dev/md/dsk/d0=) * Edit =/etc/vfstab= (changing device for swap to =/dev/md/dsk/d1=) * Reboot to test. If there is a problem, use single user mode and revert vfstab. Confirm that root and swap devices are now the mirrored devices with =df= and =swap -l= * Attach second halves to mirror: %BR% =metattach d0 d20= %BR% =metattach d1 d21= Mirror will now begin to sync; progress can be checked with =metastat -c= ---+++ Also see * [[http://slacksite.com/solaris/disksuite/disksuite.html recipe]] at slacksite.com ---++ ZFS setup Slice 4 is set aside for the ZFS pool - the system's active data. * Create pool: =zpool create pool mirror c1d0s4 c2d0s4= * Create filesystem for home directories: =zfs create pool/home= % BR% (To make this active, move any existing home directories from =/ home= and into =/pool/home=; then =zfs set mountpoint=/home pool/ home=; log out; and log back in.) * Set up regular scrub - Add to =crontab= a line such as: =0 4 1 * * zpool scrub pool= verbatim bash-3.00# zpool create pool mirror c1d0s4 c2d0s4 bash-3.00# zpool status pool: pool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM poolONLINE 0 0 0 mirrorONLINE 0 0 0 c1d0s4 ONLINE 0 0 0 c2d0s4 ONLINE 0 0 0 errors: No known data errors bash-3.00# zfs list NAME USED AVAIL REFER MOUNTPOINT pool 75.5K 65.5G 24.5K /pool bash-3.00# /verbatim ---++ References * [[http://docs.sun.com/app/docs/doc/819-5461 ZFS Admin Guide]] * [[http://docs.sun.com/app/docs/doc/816-4520 SVM Admin Guide]] Robert Milkowski wrote: Hello Shawn, Thursday, December 13, 2007, 3:46:09 PM, you wrote: SJ Is it possible to bring one slice of a disk under zfs controller and SJ leave the others as ufs? SJ A customer is tryng to mirror one slice using zfs. Yes, it's - it just works. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Would you two please SHUT THE F$%K UP. Dear God, my kids don't go own like this. Please - let it die already. Thanks very much. /jim can you guess? wrote: Hello can, Thursday, December 13, 2007, 12:02:56 AM, you wrote: cyg On the other hand, there's always the possibility that someone cyg else learned something useful out of this. And my question about To be honest - there's basically nothing useful in the thread, perhaps except one thing - doesn't make any sense to listen to you. I'm afraid you don't qualify to have an opinion on that, Robert - because you so obviously *haven't* really listened. Until it became obvious that you never would, I was willing to continue to attempt to carry on a technical discussion with you, while ignoring the morons here who had nothing whatsoever in the way of technical comments to offer (but continued to babble on anyway). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZIL and snapshots
I'm using an x4500 as a large data store for our VMware environment. I have mirrored the first 2 disks, and created a ZFS pool of the other 46: 22 pairs of mirrors, and 2 spares (optimizing for random I/O performance rather than space). Datasets are shared to the VMware ESX servers via NFS. We noticed that VMware mounts its NFS datastore with the SYNC option, so every NFS write gets flagged with FILE_SYNC. In testing, syncronous writes are significantly slower than async, presumably because of the strict ordering required for correctness (cache flushing and ZIL). Can anyone tell me if a ZFS snapshot taken when zil_disable=1 will be crash-consistant with respect to the data written by VMware? Are the snapshot metadata updates serialized with pending non-metadata writes? If an asyncronous write is issued before the snapshot is initiated, is it guarenteed to be in the snapshot data, or can it be reordered to after the snapshot? Does a snapshot flush pending writes to disk? To increase performance, the users are willing to lose an hour or two of work (these are development/QA environments): In the event that the x4500 crashes and loses the 16GB of cached (zil_disable=1) writes, we roll back to the last hourly snapshot, and everyone's back to the way they were. However, I want to make sure that we will be able to boot a crash-consistant VM from that rolled-back virtual disk. Thanks for any knowledge you might have, --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
I have a couple of questions and concerns about using ZFS in an environment where the underlying LUNs are replicated at a block level using products like HDS TrueCopy or EMC SRDF. Apologies in advance for the length, but I wanted the explanation to be clear. (I do realise that there are other possibilities such as zfs send/recv and there are technical and business pros and cons for the various options. I don't want to start a 'which is best' argument :) ) The CoW design of ZFS means that it goes to great lengths to always maintain on-disk self-consistency, and ZFS can make certain assumptions about state (e.g not needing fsck) based on that. This is the basis of my questions. 1) First issue relates to the überblock. Updates to it are assumed to be atomic, but if the replication block size is smaller than the überblock then we can't guarantee that the whole überblock is replicated as an entity. That could in theory result in a corrupt überblock at the secondary. Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate überblock and rewrite the damaged one transparently? 2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc. Normally most replication systems do maintain write ordering, [i]except[/i] for one specific scenario. If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks. When contact between the sites is re-established there will be a period of 'catch-up' resynchronization. In most, if not all, cases this is done on a simple block-order basis. Write-ordering is lost until the two sites are once again in sync and routine replication restarts. I can see this has having major ZFS impact. It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated überblock could be replicated before the block chains that it references have been copied. This breaks the assumption that the on-disk format is always self-consistent. If a disaster happened during the 'catch-up', and the partially-resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved? Obviously all filesystems can suffer with this scenario, but ones that expect less from their underlying storage (like UFS) can be fscked, and although data that was being updated is potentially corrupt, existing data should still be OK and usable. My concern is that ZFS will handle this scenario less well. There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback. This isn't always easy to do, especially since the resync is usually automatic; there is no clear trigger to use for the snapshot. It may also be difficult to synchronize the snapshot of all LUNs in a pool. I'd like to better understand the risks/behaviour of ZFS before starting to work on mitigation strategies. Thanks Steve This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
MP wrote: this anti-raid-card movement is puzzling. I think you've misinterpreted my questions. I queried the necessity of paying extra for an seemingly unnecessary RAID card for zfs. I didn't doubt that it could perform better. Wasn't one of the design briefs of zfs, that it would provide it's feature set without expensive RAID hardware? In general, feature set != performance. For example, a VIA x86-compatible processor is not capable of beating the performance of a high-end Xeon, though the feature sets are largely the same. Additional examples abound. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to properly tell zfs of new GUID controller numbers after a firmware upgrade changes the IDs
Jill, I was recently looking for a similar solution to try and reconnect a renumbered device while the pool was live. e.g. zpool online mypool old target old target at new location As in zpool replace but with the indication that this isn't a new device. What I have been doing to deal with the renumbering is exactly the export, import and clear. Although I have been dealing with significantly smaller devices and can't speak to the delay issues. Shawn On Dec 13, 2007, at 12:16 PM, Jill Manfield wrote: My customer's zfs pools and their 6540 disk array had a firmware upgrade that changed GUIDs so we need a procedure to let the zfs know it changed. They are getting errors as if they replaced drives. But I need to make sure you know they have not replaced any drives, and no drives have failed or are bad. As such, they have no interest in wiping any disks clean as indicated in 88130 info doc. Some background from customer: We have a large 6540 disk array, on which we have configured a series of large RAID luns. A few days ago, Sun sent a technician to upgrade the firmware of this array, which worked fine but which had the deleterious effect of changing the Volume IDs associated with each lun. So, the resulting luns now appear to our solaris 10 host (under mpxio) as disks in /dev/rdsk with different 'target' components than they had before. Before the firmware upgrade we took the precaution of creating duplicate luns on a different 6540 disk array, and using these to mirror each of our zfs pools (as protection in case the firmware upgrade corrupted our luns). Now, we simply want to ask zfs to find the devices under their new targets, recognize that they are existing zpool components, and have it correct the configuration of each pool. This would be similar to having Veritas vxvm re-scan all disks with vxconfigd in the event of a controller renumbering event. The proper zfs method for doing this, I believe, is to simply do: zpool export mypool zpool import mypool Indeed, this has worked fine for me a few times today, and several of our pools are now back to their original mirrored configuration. Here is a specific example, for the pool ospf. The zpool status after the upgrade: diamond:root[1105]-zpool status ospf pool: ospf state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: resilver completed with 0 errors on Tue Dec 11 18:26:53 2007 config: NAMESTATE READ WRITE CKSUM ospfDEGRADED 0 0 0 mirrorDEGRADED 0 0 0 c27t600A0B8000292B024BDC4731A7B8d0 UNAVAIL 0 0 0 cannot open c27t600A0B800032619A093747554A08d0 ONLINE 0 0 0 errors: No known data errors This is due to the fact that the LUN which used to appear as c27t600A0B8000292B024BDC4731A7B8d0 is now actually c27t600A0B8000292B024D5B475E6E90d0. It's the same LUN, but since the firmware changed the Volume ID, the target portion is different. Rather than treating this as a replaced disk (which would incur an entire mirror resilvering, and would require the trick you sent of obliterating the disk label so the in use safeguard could be avoided), we simply want to ask zfs to re-read its configuration to find this disk. So we do this: diamond:root[1110]-zpool export -f ospf diamond:root[]-zpool import ospf and sure enough: diamond:root[1112]-zpool status ospf pool: ospf state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress, 0.16% done, 2h53m to go config: NAMESTATE READ WRITE CKSUM ospfONLINE 0 0 0 mirrorONLINE 0 0 0 c27t600A0B8000292B024D5B475E6E90d0 ONLINE 0 0 0 c27t600A0B800032619A093747554A08d0 ONLINE 0 0 0 errors: No known data errors (Note that it has self-initiated a resilvering, since in this case the mirror has been changed by users since the firmware upgrade.) The problem that Robert had was that when he initiated an export of a pool (called bgp) it froze for quite some time. The corresponding import of the same pool took 12 hours to complete. I have not been able to replicate this myself, but that was the essence of
Re: [zfs-discuss] Finding external USB disks
On Wed, 2007-12-12 at 21:35 -0600, David Dyer-Bennet wrote: What are the approaches to finding what external USB disks are currently connected? Would rmformat -l or eject -l fit the bill ? The external USB backup disks in question have ZFS filesystems on them, which may make a difference in finding them perhaps? Nice. I dug around a bit with this a while back, and I'm not sure hal friends are doing the right thing with zpools on removable devices just yet. I'd expect that we'd have a zpool import triggered on a device being plugged, analogous to the way we have pcfs disks automatically mounted by the system. Indeed there's /usr/lib/hal/hal-storage-zpool-export /usr/lib/hal/hal-storage-zpool-import and /etc/hal/fdi/policy/10osvendor/20-zfs-methods.fdi but I haven't seen them actually doing anything useful when I insert a disk with a pool on it. Does anyone know whether these should be working now ? I'm not a hal expert... I've glanced at Tim Foster's autobackup and related scripts, and they're all about being triggered by the plug connection being made; which is not what I need. Yep, fair enough. I don't actually want to start the big backup when I plug in (or power on) the drive in the evening, it's supposed to wait until late (to avoid competition with users). (His autosnapshot script may be just what I need for that part, though.) The zfs-auto-snapshot service can perform a backup using a command set in the zfs/backup-save-cmd property. Setting that be a script that automagically selects a USB device (from a known list, or one with free space?) and points the stream at a relevant zfs recv command to the pool provided by your backup device might be just what you're after. Perhaps this is a project for the Christmas holidays :-) cheers, tim -- Tim Foster, Sun Microsystems Inc, Solaris Engineering Ops http://blogs.sun.com/timf ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mirror a slice
[EMAIL PROTECTED] wrote: Shawn, Using slices for ZFS pools is generally not recommended so I think we minimized any command examples with slices: # zpool create tank mirror c1t0d0s0 c1t1d0s0 Cindy, I think the term generally not recommended requires more context. In the case of a small system, particularly one which you would find on a laptop or desktop, it is often the case that disks share multiple purposes, beyond ZFS. I think the way we have written this in the best practices wiki is fine, but perhaps we should ask the group at large. Thoughts anyone? I do like the minimization for the examples, though. If one were to actually read any of the manuals, we clearly talk about how whole disks or slices are fine. However, on occasion someone will propagate the news that ZFS only works with whole disks and we have to correct the confusion afterwards. -- richard Keep in mind that using the slices from the same disk for both UFS and ZFS makes administration more complex. Please see the ZFS BP section here: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pools * The recovery process of replacing a failed disk is more complex when disks contain both ZFS and UFS file systems on slices. * ZFS pools (and underlying disks) that also contain UFS file systems on slices cannot be easily migrated to other systems by using zpool import and export features. * In general, maintaining slices increases administration time and cost. Lower your administration costs by simplifying your storage pool configuration model. Cindy Shawn Joy wrote: What are the commands? Everything I see is c1t0d0, c1t1d0. no slice just the completed disk. Robert Milkowski wrote: Hello Shawn, Thursday, December 13, 2007, 3:46:09 PM, you wrote: SJ Is it possible to bring one slice of a disk under zfs controller and SJ leave the others as ufs? SJ A customer is tryng to mirror one slice using zfs. Yes, it's - it just works. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
Additional examples abound. Doubtless :) More usefully, can you confirm whether Solaris works on this chassis without the RAID controller? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mirror a slice
On 13-Dec-07, at 3:54 PM, Richard Elling wrote: [EMAIL PROTECTED] wrote: Shawn, Using slices for ZFS pools is generally not recommended so I think we minimized any command examples with slices: # zpool create tank mirror c1t0d0s0 c1t1d0s0 Cindy, I think the term generally not recommended requires more context. In the case of a small system, particularly one which you would find on a laptop or desktop, it is often the case that disks share multiple purposes, beyond ZFS. In particular in a 2-disk system that boots from UFS (that was my situation). --Toby I think the way we have written this in the best practices wiki is fine, but perhaps we should ask the group at large. Thoughts anyone? I do like the minimization for the examples, though. If one were to actually read any of the manuals, we clearly talk about how whole disks or slices are fine. However, on occasion someone will propagate the news that ZFS only works with whole disks and we have to correct the confusion afterwards. -- richard Keep in mind that using the slices from the same disk for both UFS and ZFS makes administration more complex. ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What does dataset is busy actually mean?
I've hit the problem myself recently, and mounting the filesystem cleared something in the brains of ZFS and alowed me to snapshot. http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg00812.html PS: I'll use Google before asking some questions, a'la (C) Bart Simpson That's how I found your question ;) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Would you two please SHUT THE F$%K UP. Just for future reference, if you're attempting to squelch a public conversation it's often more effective to use private email to do it rather than contribute to the continuance of that public conversation yourself. Have a nice day! - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
Are there benchmarks somewhere showing a RAID10 implemented on an LSI card with, say, 128MB of cache being beaten in terms of performance by a similar zraid configuration with no cache on the drive controller? Somehow I don't think they exist. I'm all for data scrubbing, but this anti-raid-card movement is puzzling. Oh, for joy - a chance for me to say something *good* about ZFS. rather than just try to balance out excessive enthusiasm. Save for speeding up synchronous writes (if it has enough on-board NVRAM to hold them until it's convenient to destage them to disk), a RAID-10 card should not enjoy any noticeable performance advantage over ZFS mirroring. By contrast, if extremely rare undetected and (other than via ZFS checksums) undetectable (or considerably more common undetected but detectable via disk ECC codes, *if* the data is accessed) corruption occurs, if the RAID card is used to mirror the data there's a good chance that even ZFS's validation scans won't see the problem (because the card happens to access the good copy for the scan rather than the bad one) - in which case you'll lose that data if the disk with the good data fails. And in the case of (extremely rare) otherwise-undetectable corruption, if the card *does* return the bad copy then IIRC ZFS (not knowing that a good copy also exists) will just claim that the data is gone (though I don't know if it will then flag it such that you'll never have an opportunity to find the good copy). If the RAID card scrubs its disks the difference (now limited to the extremely rare undetectable-via-disk-ECC corruption) becomes pretty negligible - but I'm not sure how many RAIDs below the near-enterprise category perform such scrubs. In other words, if you *don't* otherwise scrub your disks then ZFS's checksums-plus-internal-scrubbing mechanisms assume greater importance: it's only the contention that other solutions that *do* offer scrubbing can't compete with ZFS in effectively protecting your data that's somewhat over the top. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding external USB disks
[EMAIL PROTECTED] said: What are the approaches to finding what external USB disks are currently connected? I'm starting on backup scripts, and I need to check which volumes are present before I figure out what to back up to them. I . . . In addition to what others have suggested so far, cfgadm -l lists usb- and firewire-connected drives (even those plugged-in but not mounted). So scripts can check that way as well. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
Steve McKinty wrote: I have a couple of questions and concerns about using ZFS in an environment where the underlying LUNs are replicated at a block level using products like HDS TrueCopy or EMC SRDF. Apologies in advance for the length, but I wanted the explanation to be clear. (I do realise that there are other possibilities such as zfs send/recv and there are technical and business pros and cons for the various options. I don't want to start a 'which is best' argument :) ) The CoW design of ZFS means that it goes to great lengths to always maintain on-disk self-consistency, and ZFS can make certain assumptions about state (e.g not needing fsck) based on that. This is the basis of my questions. 1) First issue relates to the überblock. Updates to it are assumed to be atomic, but if the replication block size is smaller than the überblock then we can't guarantee that the whole überblock is replicated as an entity. That could in theory result in a corrupt überblock at the secondary. The uberblock contains a circular queue of updates. For all practical purposes, this is COW. The updates I measure are usually 1 block (or, to put it another way, I don't recall seeing more than 1 block being updated... I'd have to recheck my data) Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate überblock and rewrite the damaged one transparently? The checksum should catch it. To be safe, there are 4 copies of the uberblock. 2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc. Normally most replication systems do maintain write ordering, [i]except[/i] for one specific scenario. If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks. When contact between the sites is re-established there will be a period of 'catch-up' resynchronization. In most, if not all, cases this is done on a simple block-order basis. Write-ordering is lost until the two sites are once again in sync and routine replication restarts. I can see this has having major ZFS impact. It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated überblock could be replicated before the block chains that it references have been copied. This breaks the assumption that the on-disk format is always self-consistent. If a disaster happened during the 'catch-up', and the partially-resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved? I think all of these reactions to the double-failure mode are possible. The version of ZFS used will also have an impact as the later versions are more resilient. I think that in most cases, only the affected files will be impacted. zpool scrub will ensure that everything is consistent and mark those files which fail to checksum properly. Obviously all filesystems can suffer with this scenario, but ones that expect less from their underlying storage (like UFS) can be fscked, and although data that was being updated is potentially corrupt, existing data should still be OK and usable. My concern is that ZFS will handle this scenario less well. ...databases too... It might be easier to analyze this from the perspective of the transaction group than an individual file. Since ZFS is COW, you may have a state where a transaction group is incomplete, but the previous data state should be consistent. There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback. This isn't always easy to do, especially since the resync is usually automatic; there is no clear trigger to use for the snapshot. It may also be difficult to synchronize the snapshot of all LUNs in a pool. I'd like to better understand the risks/behaviour of ZFS before starting to work on mitigation strategies. I don't see how snapshots would help. The inherent transaction group commits should be sufficient. Or, to look at this another way, a snapshot is really just a metadata change. I am more worried about how the storage admin sets up the LUN groups. The human factor can really ruin my day... -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
On December 13, 2007 9:47:00 AM -0800 MP [EMAIL PROTECTED] wrote: Additional examples abound. Doubtless :) More usefully, can you confirm whether Solaris works on this chassis without the RAID controller? way back, i had Solaris working with a promise j200s (jbod sas) chassis, to the extent that the sas driver at the time worked. i can't IMAGINE why this chassis would be any different from Solaris' perspective. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
On December 13, 2007 11:34:54 AM -0800 can you guess? [EMAIL PROTECTED] wrote: By contrast, if extremely rare undetected and (other than via ZFS checksums) undetectable (or considerably more common undetected but detectable via disk ECC codes, *if* the data is accessed) corruption occurs, if the RAID card is used to mirror the data there's a good chance that even ZFS's validation scans won't see the problem (because the card happens to access the good copy for the scan rather than the bad one) - in which case you'll lose that data if the disk with the good data fails. And in the case of (extremely rare) otherwise-undetectable corruption, if the card *does* return the bad copy then IIRC ZFS (not knowing that a good copy also exists) will just claim that the data is gone (though I don't know if it will then flag it such that you'll never have an opportunity to find the good copy). i like this answer, except for what you are implying by extremely rare. If the RAID card scrubs its disks the difference (now limited to the extremely rare undetectable-via-disk-ECC corruption) becomes pretty negligible - but I'm not sure how many RAIDs below the near-enterprise category perform such scrubs. In other words, if you *don't* otherwise scrub your disks then ZFS's checksums-plus-internal-scrubbing mechanisms assume greater importance: it's only the contention that other solutions that *do* offer scrubbing can't compete with ZFS in effectively protecting your data that's somewhat over the top. the problem with your discounting of zfs checksums is that you aren't taking into account that extremely rare is relative to the number of transactions, which are extremely high. in such a case even extremely rare errors do happen, and not just to extremely few folks, but i would say to all enterprises. hell it happens to home users. when the difference between an unrecoverable single bit error is not just 1 bit but the entire file, or corruption of an entire database row (etc), those small and infrequent errors are an extremely big deal. considering all the pieces, i would much rather run zfs on a jbod than on a raid, wherever i could. it gives better data protection, and it is ostensibly cheaper. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool version 3 Uberblock version 9 , zpool upgrade only half succeeded?
We are currently experiencing a very huge perfomance drop on our zfs storage server. We have 2 pools, pool 1 stor is a raidz out of 7 iscsi nodes, home is a local mirror pool. Recently we had some issues with one of the storagenodes, because of that the pool was degraded. Since we did not succeed in bringing this storagenode back online (on zfs level) we upgraded our nashead from opensolaris b57 to b77. After upgrade we succesfully resilvered the pool (resilver took 1 week! - 14 TB). Finally we upgraded the pool to version 9 (comming from version 3). Now zpool is healty again, but perfomance realy s*cks. Accessing older data takes way to much time. Doing dtruss -a find . in a zfs filesystem on this b77 server is extremely slow, while it is fast in our backup location were we are still using opensolaris b57 and zpool version 3. Writing new data seems normal, we don't see huge issues here. The real problem is do ls, rm or find in filesystems with lots of files (+5, not in 1 directory spread in multiple subfolders) Today I found that not only zpool upgrade exist, but also zfs upgrade, most filesystems are still version 1 while some new are already version 3. Running zdb we also saw there is a mismatchs in version information, our storage pool is list as version 3 while the uberblock is at version 9, when we run zpool upgrade, it tells us all pools are upgraded to latest version. below the zdb output: zdb stor version=3 name='stor' state=0 txg=6559447 pool_guid=14464037545511218493 hostid=341941495 hostname='fileserver011' vdev_tree type='root' id=0 guid=14464037545511218493 children[0] type='raidz' id=0 guid=179558698360846845 nparity=1 metaslab_array=13 metaslab_shift=37 ashift=9 asize=20914156863488 is_log=0 children[0] type='disk' id=0 guid=640233961847538260 path='/dev/dsk/c2t3d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/iscsi/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=36 children[1] type='disk' id=1 guid=7833573669820754721 path='/dev/dsk/c2t4d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/iscsi/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=22 children[2] type='disk' id=2 guid=13685988517147825972 path='/dev/dsk/c2t5d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/iscsi/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=17 children[3] type='disk' id=3 guid=13514021245008793227 path='/dev/dsk/c2t6d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/iscsi/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=21 children[4] type='disk' id=4 guid=15871506866153751690 path='/dev/dsk/c2t9d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/iscsi/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=20 children[5] type='disk' id=5 guid=11392907262189654902 path='/dev/dsk/c2t7d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/iscsi/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=19 children[6] type='disk' id=6 guid=8472117762643335828 path='/dev/dsk/c2t8d0s0' devid='id1,[EMAIL PROTECTED]/a' phys_path='/iscsi/[EMAIL PROTECTED],0:a' whole_disk=1 DTL=18 Uberblock magic = 00bab10c version = 9 txg = 6692849 guid_sum = 12266969233845513474 timestamp = 1197546530 UTC = Thu Dec 13 12:48:50 2007 fileserver If we compare with zpool home (this pool was craeted after installing
Re: [zfs-discuss] Nice chassis for ZFS server
... when the difference between an unrecoverable single bit error is not just 1 bit but the entire file, or corruption of an entire database row (etc), those small and infrequent errors are an extremely big deal. You are confusing unrecoverable disk errors (which are rare but orders of magnitude more common) with otherwise *undetectable* errors (the occurrence of which is at most once in petabytes by the studies I've seen, rather than once in terabytes), despite my attempt to delineate the difference clearly. Conventional approaches using scrubbing provide as complete protection against unrecoverable disk errors as ZFS does: it's only the far rarer otherwise *undetectable* errors that ZFS catches and they don't. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL and snapshots
Heh, interesting to see somebody else using the sheer number of disks in the Thumper to their advantage :) Have you thought of solid state cache for the ZIL? There's a 16GB battery backed PCI card out there, I don't know how much it costs, but the blog where I saw it mentioned a 20x improvement in performance for small random writes. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
... If the RAID card scrubs its disks A scrub without checksum puts a huge burden on disk firmware and error reporting paths :-) Actually, a scrub without checksum places far less burden on the disks and their firmware than ZFS-style scrubbing does, because it merely has to scan the disk sectors sequentially rather than follow a tree path to each relatively small leaf block. Thus it also compromises runtime operation a lot less as well (though in both cases doing it infrequently in the background should usually reduce any impact to acceptable levels). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL and snapshots
Have you thought of solid state cache for the ZIL? There's a 16GB battery backed PCI card out there, I don't know how much it costs, but the blog where I saw it mentioned a 20x improvement in performance for small random writes. Thought about it, looked in the Sun Store, couldn't find one, and cut the PO. Haven't gone back to get a new approval. I did put a couple of the MTron 32GB SSD drives on the christmas wishlist (aka 2008 budget) --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
Great questions. 1) First issue relates to the überblock. Updates to it are assumed to be atomic, but if the replication block size is smaller than the überblock then we can't guarantee that the whole überblock is replicated as an entity. That could in theory result in a corrupt überblock at the secondary. Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate überblock and rewrite the damaged one transparently? ZFS already has to deal with potential uberblock partial writes if it contains multiple disk sectors (and it might be prudent even if it doesn't, as Richard's response seems to suggest). Common ways of dealing with this problem include dumping it into the log (in which case the log with its own internal recovery procedure becomes the real root of all evil) or cycling around at least two locations per mirror copy (Richard's response suggests that there are considerably more, and that perhaps each one is written in quadruplicate) such that the previous uberblock would still be available if the new write tanked. ZFS-style snapshots complicate both approaches unless special provisions are taken - e.g., copying the current uberblock on each snapshot and hanging a list of these snapshot uberblock addresses off the current uberblock, though even that might run into interesting complications under the scenario which you describe below. Just using the 'queue' that Richard describes to accumulate snapshot uberblocks would limit the number of concurrent snapshots to less than the size of that queue. In any event, as long as writes to the secondary copy don't continue after a write failure of the kind that you describe has occurred (save for the kind of catch-up procedure that you mention later), ZFS's internal facilities should not be confused by encountering a partial uberblock update at the secondary, any more than they'd be confused by encountering it on an unreplicated system after restart. 2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc. Normally most replication systems do maintain write ordering, [i]except[/i] for one specific scenario. If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks. When contact between the sites is re-established there will be a period of 'catch-up' resynchronization. In most, if not all, cases this is done on a simple block-order basis. Write-ordering is lost until the two sites are once again in sync and routine replication restarts. I can see this has having major ZFS impact. It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated überblock could be replicated before the block chains that it references have been copied. This breaks the assumption that the on-disk format is always self-consistent. If a disaster happened during the 'catch-up', and the partially-resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved? My inclination is to say By repopulating your environment from backups: it is not reasonable to expect *any* file system to operate correctly, or to attempt any kind of comprehensive recovery (other than via something like fsck, with no guarantee of how much you'll get back), when the underlying hardware transparently reorders updates which the file system has explicitly ordered when it presented them. But you may well be correct in suspecting that there's more potential for data loss should this occur in a ZFS environment than in update-in-place environments where only portions of the tree structure that were explicitly changed during the connection hiatus would likely be affected by such a recovery interruption (though even there if a directory changed enough to change its block structure on disk you could be in more trouble). Obviously all filesystems can suffer with this scenario, but ones that expect less from their underlying storage (like UFS) can be fscked, and although data that was being updated is potentially corrupt, existing data should still be OK and usable. My concern is that ZFS will handle this scenario less well. There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback. You're talking about an HDS- or EMC-level snapshot, right? This isn't always easy to do, especially since the resync is usually automatic; there is no clear
Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.
J.P. King wrote: Wow, that a neat idea, and crazy at the same time. But the mknod's minor value can be 0-262143 so it probably would be doable with some loss of memory and efficiency. But maybe not :) (I would need one lofi dev per filesystem right?) Definitely worth remembering if I need to do something small/quick. You're confusing lofi and lofs, I think. Have a look at man lofs. Now all _I_ would like is translucent options to that and I'd solve one of my major headaches. Check ast-open[1] for the 3d command that implements the nDFS, or multiple dimension file system, allowing you to overlay directories. The 3d [2] utility allows you to run a command with all file system calls intercepted. Any writes will go into the top level directory, while reads pass though until a matching file is found. system calls are intercepted by an LD_PRELOAD library, so each process can have its own settings. [1] http://www.research.att.com/~gsf/download/gen/ast-open.html [2] http://www.research.att.com/~gsf/man/man1/3d.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
On December 13, 2007 12:51:55 PM -0800 can you guess? [EMAIL PROTECTED] wrote: ... when the difference between an unrecoverable single bit error is not just 1 bit but the entire file, or corruption of an entire database row (etc), those small and infrequent errors are an extremely big deal. You are confusing unrecoverable disk errors (which are rare but orders of magnitude more common) with otherwise *undetectable* errors (the occurrence of which is at most once in petabytes by the studies I've seen, rather than once in terabytes), despite my attempt to delineate the difference clearly. No I'm not. I know exactly what you are talking about. Conventional approaches using scrubbing provide as complete protection against unrecoverable disk errors as ZFS does: it's only the far rarer otherwise *undetectable* errors that ZFS catches and they don't. yes. far rarer and yet home users still see them. that the home user ever sees these extremely rare (undetectable) errors may have more to do with poor connection (cables, etc) to the disk, and less to do with disk media errors. enterprise users probably have better connectivity and see errors due to high i/o. just thinking out loud. regardless, zfs on non-raid provides better protection than zfs on raid (well, depending on raid configuration) so just from the data integrity POV non-raid would generally be preferred. the fact that the type of error being prevented is rare doesn't change that and i was further arguing that even though it's rare the impact can be high so you don't want to write it off. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
[EMAIL PROTECTED] said: You are confusing unrecoverable disk errors (which are rare but orders of magnitude more common) with otherwise *undetectable* errors (the occurrence of which is at most once in petabytes by the studies I've seen, rather than once in terabytes), despite my attempt to delineate the difference clearly. I could use a little clarification on how these unrecoverable disk errors behave -- or maybe a lot, depending on one's point of view. So, when one of these once in around ten (or 100) terabytes read events occurs, my understanding is that a read error is returned by the drive, and the corresponding data is lost as far as the drive is concerned. Maybe just a bit is gone, maybe a byte, maybe a disk sector, it probably depends on the disk, OS, driver, and/or the rest of the I/O hardware chain. Am I doing OK so far? Conventional approaches using scrubbing provide as complete protection against unrecoverable disk errors as ZFS does: it's only the far rarer otherwise *undetectable* errors that ZFS catches and they don't. I found it helpful to my own understanding to try restating the above in my own words. Maybe others will as well. If my assumptions are correct about how these unrecoverable disk errors are manifested, then a dumb scrubber will find such errors by simply trying to read everything on disk -- no additional checksum is required. Without some form of parity or replication, the data is lost, but at least somebody will know about it. Now it seems to me that without parity/replication, there's not much point in doing the scrubbing, because you could just wait for the error to be detected when someone tries to read the data for real. It's only if you can repair such an error (before the data is needed) that such scrubbing is useful. For those well-versed in this stuff, apologies for stating the obvious. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
Steve McKinty wrote: 1) First issue relates to the überblock. Updates to it are assumed to be atomic, but if the replication block size is smaller than the überblock then we can't guarantee that the whole überblock is replicated as an entity. That could in theory result in a corrupt überblock at the secondary. Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate überblock and rewrite the damaged one transparently? Yes, ZFS uberblocks are self-checksummed with SHA-256 and when opening the pool it uses the latest valid uberblock that it can find. So that is not a problem. 2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc. Normally most replication systems do maintain write ordering, [i]except[/i] for one specific scenario. If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks. When contact between the sites is re-established there will be a period of 'catch-up' resynchronization. In most, if not all, cases this is done on a simple block-order basis. Write-ordering is lost until the two sites are once again in sync and routine replication restarts. I can see this has having major ZFS impact. It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated überblock could be replicated before the block chains that it references have been copied. This breaks the assumption that the on-disk format is always self-consistent. If a disaster happened during the 'catch-up', and the partially-resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved? I believe your understanding is correct. If you expect such a double-failure, you cannot rely on being able to recover your pool at the secondary site. The newest uberblocks would be among the first blocks to be replicated (2 of the uberblock arrays are situated at the start of the vdev) and your whole block tree might be inaccessible if the latest Meta Object Set blocks were not also replicated. You might be lucky and be able to mount your filesystems because ZFS keeps 3 separate copies of the most important metadata and it tries to keep apart each copy by about 1/8th of the disk, but even then I wouldn't count on it. If ZFS can't open the pool due to this kind of corruption, you would get the following message: status: The pool metadata is corrupted and the pool cannot be opened. action: Destroy and re-create the pool from a backup source. At this point, you could try zeroing out the first 2 uberblock arrays so that ZFS tries using an older uberblock from the last 2 arrays, but this might not work. As the message says, the only reliable way to recover from this is restoring your pool from backups. There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback. This isn't always easy to do, especially since the resync is usually automatic; there is no clear trigger to use for the snapshot. It may also be difficult to synchronize the snapshot of all LUNs in a pool. I'd like to better understand the risks/behaviour of ZFS before starting to work on mitigation strategies. If the replication process was interrupted for a sufficiently long time and disaster strikes at the primary site *during resync*, I don't think snapshots would save you even if you had took them at the right time. Snapshots might increase your chances of recovery (by making ZFS not free and reuse blocks), but AFAIK there wouldn't be any guarantee that you'd be able to recover anything whatsoever since the most important pool metadata is not part of the snapshots. Regards, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.
NOC staff couldn't reboot it after the quotacheck crash, and I only just got around to going to the Datacenter. This time I disabled NFS, and the rsync that was running, and ran just quotacheck and it completed successfully. The reason it didn't boot what that damned boot-archive again. Seriously! Anyway, I did get a vmcore from the crash, but maybe it isn't so interesting. I will continue with the stress testing of UFS on zpool as it is the only solution that would be acceptable. Not given up yet, I have a few more weeks to keep trying. :) -rw-r--r-- 1 root root 2345863 Dec 14 09:57 unix.0 -rw-r--r-- 1 root root 4741623808 Dec 14 10:05 vmcore.0 bash-3.00# adb -k unix.0 vmcore.0 physmem 3f9789 $c top_end_sync+0xcb(ff0a5923d000, ff001f175524, b, 0) ufs_fsync+0x1cb(ff62e757ad80, 1, fffedd6d2020) fop_fsync+0x51(ff62e757ad80, 1, fffedd6d2020) rfs3_setattr+0x3a3(ff001f1757c8, ff001f1758b8, ff1a0d942080, ff001f175b20, fffedd6d2020) common_dispatch+0x444(ff001f175b20, ff0a5a4baa80, 2, 4, f7c7ea78 , c06003d0) rfs_dispatch+0x2d(ff001f175b20, ff0a5a4baa80) svc_getreq+0x1c6(ff0a5a4baa80, fffec7eda6c0) svc_run+0x171(ff62becb72a0) svc_do_run+0x85(1) nfssys+0x748(e, fecf0fc8) sys_syscall32+0x101() BAD TRAP: type=e (#pf Page fault) rp=ff001f175320 addr=0 occurred in module unknown due to a NULL pointer dereference -- Jorgen Lundman | [EMAIL PROTECTED] Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.
Jorgen, You may want to try running 'bootadm update-archive' Assuming that your boot-archive problem is an out of date boot-archive message at boot and/or doing a clean reboot to let the system try to write an up to date boot-archive. I would also encourage you to connect the LOM to the network in case you have such issues again, you should be able to recover remotely. Shawn On Dec 13, 2007, at 10:33 PM, Jorgen Lundman wrote: NOC staff couldn't reboot it after the quotacheck crash, and I only just got around to going to the Datacenter. This time I disabled NFS, and the rsync that was running, and ran just quotacheck and it completed successfully. The reason it didn't boot what that damned boot-archive again. Seriously! Anyway, I did get a vmcore from the crash, but maybe it isn't so interesting. I will continue with the stress testing of UFS on zpool as it is the only solution that would be acceptable. Not given up yet, I have a few more weeks to keep trying. :) -rw-r--r-- 1 root root 2345863 Dec 14 09:57 unix.0 -rw-r--r-- 1 root root 4741623808 Dec 14 10:05 vmcore.0 bash-3.00# adb -k unix.0 vmcore.0 physmem 3f9789 $c top_end_sync+0xcb(ff0a5923d000, ff001f175524, b, 0) ufs_fsync+0x1cb(ff62e757ad80, 1, fffedd6d2020) fop_fsync+0x51(ff62e757ad80, 1, fffedd6d2020) rfs3_setattr+0x3a3(ff001f1757c8, ff001f1758b8, ff1a0d942080, ff001f175b20, fffedd6d2020) common_dispatch+0x444(ff001f175b20, ff0a5a4baa80, 2, 4, f7c7ea78 , c06003d0) rfs_dispatch+0x2d(ff001f175b20, ff0a5a4baa80) svc_getreq+0x1c6(ff0a5a4baa80, fffec7eda6c0) svc_run+0x171(ff62becb72a0) svc_do_run+0x85(1) nfssys+0x748(e, fecf0fc8) sys_syscall32+0x101() BAD TRAP: type=e (#pf Page fault) rp=ff001f175320 addr=0 occurred in module unknown due to a NULL pointer dereference -- Jorgen Lundman | [EMAIL PROTECTED] Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Shawn Ferry shawn.ferry at sun.com Senior Primary Systems Engineer Sun Managed Operations 571.291.4898 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
I could use a little clarification on how these unrecoverable disk errors behave -- or maybe a lot, depending on one's point of view. So, when one of these once in around ten (or 100) terabytes read events occurs, my understanding is that a read error is returned by the drive, and the corresponding data is lost as far as the drive is concerned. Yes -- the data being one or more disk blocks. (You can't lose a smaller amount of data, from the drive's point of view, since the error correction code covers the whole block.) If my assumptions are correct about how these unrecoverable disk errors are manifested, then a dumb scrubber will find such errors by simply trying to read everything on disk -- no additional checksum is required. Without some form of parity or replication, the data is lost, but at least somebody will know about it. Right. Generally if you have replication and scrubbing, then you'll also re-write any data which was found to be unreadable, thus fixing the problem (and protecting yourself against future loss of the second copy). Now it seems to me that without parity/replication, there's not much point in doing the scrubbing, because you could just wait for the error to be detected when someone tries to read the data for real. It's only if you can repair such an error (before the data is needed) that such scrubbing is useful. Pretty much, though if you're keeping backups, you could recover the data from backup at this point. Of course, backups could be considered a form of replication, but most of us in file systems don't think of them that way. Anton This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.
Shawn Ferry wrote: Jorgen, You may want to try running 'bootadm update-archive' Assuming that your boot-archive problem is an out of date boot-archive message at boot and/or doing a clean reboot to let the system try to write an up to date boot-archive. Yeah, it is remembering to do so after something has changed that's hard. In this case, I had to break the mirror to install OpenSolaris. (shame that the CD/DVD, and miniroot, doesn't not have md driver). It would be tempting to add the bootadm update-archive to the boot process, as I would rather have it come up half-assed, than not come up at all. And yes, other servers are on remote access, but since was a temporary trial, we only ran 1 network cable, and 2x 200V cables. Should have done a proper job at the start, I guess. This time I made sure it was reboot-safe :) Lund I would also encourage you to connect the LOM to the network in case you have such issues again, you should be able to recover remotely. Shawn On Dec 13, 2007, at 10:33 PM, Jorgen Lundman wrote: NOC staff couldn't reboot it after the quotacheck crash, and I only just got around to going to the Datacenter. This time I disabled NFS, and the rsync that was running, and ran just quotacheck and it completed successfully. The reason it didn't boot what that damned boot-archive again. Seriously! Anyway, I did get a vmcore from the crash, but maybe it isn't so interesting. I will continue with the stress testing of UFS on zpool as it is the only solution that would be acceptable. Not given up yet, I have a few more weeks to keep trying. :) -rw-r--r-- 1 root root 2345863 Dec 14 09:57 unix.0 -rw-r--r-- 1 root root 4741623808 Dec 14 10:05 vmcore.0 bash-3.00# adb -k unix.0 vmcore.0 physmem 3f9789 $c top_end_sync+0xcb(ff0a5923d000, ff001f175524, b, 0) ufs_fsync+0x1cb(ff62e757ad80, 1, fffedd6d2020) fop_fsync+0x51(ff62e757ad80, 1, fffedd6d2020) rfs3_setattr+0x3a3(ff001f1757c8, ff001f1758b8, ff1a0d942080, ff001f175b20, fffedd6d2020) common_dispatch+0x444(ff001f175b20, ff0a5a4baa80, 2, 4, f7c7ea78 , c06003d0) rfs_dispatch+0x2d(ff001f175b20, ff0a5a4baa80) svc_getreq+0x1c6(ff0a5a4baa80, fffec7eda6c0) svc_run+0x171(ff62becb72a0) svc_do_run+0x85(1) nfssys+0x748(e, fecf0fc8) sys_syscall32+0x101() BAD TRAP: type=e (#pf Page fault) rp=ff001f175320 addr=0 occurred in module unknown due to a NULL pointer dereference -- Jorgen Lundman | [EMAIL PROTECTED] Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Shawn Ferry shawn.ferry at sun.com Senior Primary Systems Engineer Sun Managed Operations 571.291.4898 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Jorgen Lundman | [EMAIL PROTECTED] Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
... Now it seems to me that without parity/replication, there's not much point in doing the scrubbing, because you could just wait for the error to be detected when someone tries to read the data for real. It's only if you can repair such an error (before the data is needed) that such scrubbing is useful. Pretty much I think I've read (possibly in the 'MAID' descriptions) the contention that at least some unreadable sectors get there in stages, such that if you catch them early they will be only difficult to read rather than completely unreadable. In such a case, scrubbing is worthwhile even without replication, because it finds the problem early enough that the disk itself (or higher-level mechanisms if the disk gives up but the higher level is more persistent) will revector the sector when it finds it difficult (but not impossible) to read. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
On Dec 14, 2007 1:12 AM, can you guess? [EMAIL PROTECTED] wrote: yes. far rarer and yet home users still see them. I'd need to see evidence of that for current hardware. What would constitute evidence? Do anecdotal tales from home users qualify? I have two disks (and one controller!) that generate several checksum errors per day each. I've also seen intermittent checksum fails that go away once all the cables are wiggled. Unlikely, since transfers over those connections have been protected by 32-bit CRCs since ATA busses went to 33 or 66 MB/sec. (SATA has even stronger protection) The ATA/7 spec specifies a 32-bit CRC (older ones used a 16-bit CRC) [1]. The serial ata protocol also specifies 32-bit CRCs beneath 8/10b coding (1.0a p. 159)[2]. That's not much stronger at all. Will [1] http://www.t10.org/t13/project/d1532v3r4a-ATA-ATAPI-7.pdf [2] http://www.ece.umd.edu/courses/enee759h.S2003/references/serialata10a.pdf ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss