Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
Just out of curiosity - what Supermicro chassis did you get? I've got the following items shipping to me right now, with SSD drives and 2TB main drives coming as soon as the system boots and performs normally (using 8 extra 500GB Barracuda ES.2 drives as test drives). http://www.acmemicro.com/estore/merchant.ihtml?pid=5440&lastcatid=53&step=4 http://www.newegg.com/Product/Product.aspx?Item=N82E16820139043 http://www.acmemicro.com/estore/merchant.ihtml?pid=4518&step=4 http://www.acmemicro.com/estore/merchant.ihtml?pid=6708&step=4 http://www.newegg.com/Product/Product.aspx?Item=N82E16819117187 http://www.newegg.com/Product/Product.aspx?Item=N82E16835203002 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
No SSD Log device yet. I also tried disabling the ZIL, with no effect on performance. Also - what's the best way to test local performance? I'm _somewhat_ dumb as far as opensolaris goes, so if you could provide me with an exact command line for testing my current setup (exactly as it appears above) I'd love to report the local I/O readings. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Solaris Test Framework.
gt; > > > > > >>If there were a real-world device that tended to randomly flip bits, > >>or randomly replace swaths of LBA's with zeroes, but otherwise behave > >>normally (not return any errors, not slow down retrying reads, not > >>fail to attach), then copies=2 would be really valuable, but so far it > >>seems no such device exists. If you actually explore the errors that > >>really happen I venture there are few to no cases copies=2 would save > >>you. > > > >I had a device which had 256 bytes of the 32MB broken (some were "1", some > >were always "0"). But I never put it online because it was so broken. > > Of the 32MB cache, sorry. > > Casper > > > > -- > > Message: 5 > Date: Thu, 18 Feb 2010 11:55:07 +1100 > From: Daniel Carosone > To: Miles Nordin > Cc: zfs-discuss@opensolaris.org > Subject: Re: [zfs-discuss] Proposed idea for enhancement - damage >control > Message-ID: <20100218005507.gt27...@bcd.geek.com.au> > Content-Type: text/plain; charset="us-ascii" > > On Wed, Feb 17, 2010 at 02:38:04PM -0500, Miles Nordin wrote: > > copies=2 has proven to be mostly useless in practice. > > I disagree. Perhaps my cases fit under the weasel-word "mostly", but > single-disk laptops are a pretty common use-case. > > > If there were a real-world device that tended to randomly flip bits, > > or randomly replace swaths of LBA's with zeroes > > As an aside, there can be non-device causes of this, especially when > sharing disks with other operating systems, booting livecd's and etc. > > > * drives do not immediately turn red and start brrk-brrking when they > >go bad. In the real world, they develop latent sector errors, > >which you will not discover and mark the drive bad until you scrub > >or coincidentally happen to read the file that accumulated the > >error. > > Yes, exactly - at this point, with copies=1, you get a signal that > your drive is about to go bad, and that data has been lost. With > copies=2, you get a signal that your drive is about to go bad, but > less disruption and data loss to go with it. Note that pool metadata > is inherently using ditto blocks for precisely this reason. > > I dunno about BER spec, but I have seen sectors go unreadable many > times. Sometimes, as you note, in combination with other problems or > further deterioriation, sometimes not. Regardless of what you do in > response, and how soon you replace the drive, copies >1 can cover that > interval. > > I agree fully, copies=2 is not a substitute for backup replication of > whatever flavour you prefer. It is a useful supplement. > Misunderstanding this is dangerous. > > -- > Dan. > > -- next part -- > A non-text attachment was scrubbed... > Name: not available > Type: application/pgp-signature > Size: 194 bytes > Desc: not available > URL: < > http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100218/8cf99e80/attachment-0001.bin > > > > -- > > Message: 6 > Date: Wed, 17 Feb 2010 22:11:56 -0500 > From: Ethan > To: Ethan , zfs-discuss@opensolaris.org > Subject: Re: [zfs-discuss] Help with corrupted pool > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > On Wed, Feb 17, 2010 at 18:24, Daniel Carosone wrote: > > > On Wed, Feb 17, 2010 at 06:15:25PM -0500, Ethan wrote: > > > Success! > > > > Awesome. Let that scrub finish before celebrating completely, but > > this looks like a good place to stop and consider what you want for an > > end state. > > > > -- > > Dan. > > > > True. Thinking about where to end up - I will be staying on opensolaris > despite having no truecrypt. My paranoia likes having encryption, but it's > not really necessary for me, and it looks like encryption will be coming to > zfs itself soon enough. So, no need to consider getting things working on > zfs-fuse again. > > I should have a partition table, for one thing, I suppose. The partition > table is EFI GUID Partition Table, looking at the relevant documentation. > So, I'll need to somehow shift my zfs data down by 17408 bytes (34 512-byte > LBA's, the size of the GPT's stuff at the beginning of the disk) - perhaps > just by copying from the truecrypt volumes as I did before, but with offset > of 17408 bytes. Then I should be able to use format to make the correct > partition information, and use the s0 partition for each drive as seems to > be the standard w
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
On Wed, Feb 17, 2010 at 10:42 PM, Matt wrote: > > I've got a very similar rig to the OP showing up next week (plus an > infiniband card) I'd love to get this performing up to GB Ethernet speeds, > otherwise I may have to abandon the iSCSI project if I can't get it to > perform. Do you have an SSD log device? If not, try disabling the ZIL temporarily to see if that helps. Your workload will likely benefit from a log device. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
Just wanted to add that I'm in the exact same boat - I'm connecting from a Windows system and getting just horrid iSCSI transfer speeds. I've tried updating to COMSTAR (although I'm not certain that I'm actually using it) to no avail, and I tried updating to the latest DEV version of OpenSolaris. All that resulted from updating to the latest DEV version was a completely broken system that the I couldn't access the command line on. Fortunately i was able to roll back to the previous version and keep tinkering. Anyone have any ideas as to what could really be causing this slowdown? I've got 5-500GB Seagate Barracuda ES.2 drives that I'm using for my zpools, and I've done the following. 1 - zpool create data mirror c0t0d0 c0t1d0 2 - zfs create -s -V 600g data/iscsitarget 3 - sbdadm create-lu /dev/zvol/rdsk/data/iscsitarget 4 - stfadm add-view xx So I've got a 500GB RAID1 zpool, and I've created a 600GB sparse volume on top of it, shared it via iSCSI, and connected to it. Everything works stellar up until I copy files to it, then I get just sluggishness. I start to copy a file from my windows 7 system to the iSCSI target, then pull up IOSTAT using this command : zpool iostat -v data 10 It shows me this : capacity operationsbandwidth pool used avail read write read write -- - - - - - - data 895M 463G 0666 0 7.93M mirror 895M 463G 0666 0 7.93M c0t0d0 - - 0269 0 7.91M c0t1d0 - - 0272 0 7.93M -- - - - - - - So I figure, since ZFS is pretty sweet, how about I add some additional drives. That should bump up my performance. I execute this : zpool add data mirror c1t0d0 c1t1d0 It adds it to my zpool, and I run IOSTAT again, while the copy is still running. capacity operationsbandwidth pool used avail read write read write -- - - - - - - data1.17G 927G 0738 1.58K 8.87M mirror1.17G 463G 0390 1.58K 4.61M c0t0d0 - - 0172 1.58K 4.61M c0t1d0 - - 0175 0 4.61M mirror42.5K 464G 0348 0 4.27M c1t0d0 - - 0156 0 4.27M c1t1d0 - - 0159 0 4.27M -- - - - - - - I get a whopping extra 1MB/sec by adding two drives. It fluctuates a lot, sometimes dropping down to 4MB/sec, sometimes rocketing all the way up to 20MB/sec, but nothing consistent. Basically, my transfer rates are the same no matter how many drives I add to the zpool. Is there anything I am missing on this? BTW - "test" server specs AMD dual core 6000+ 2GB RAM Onboard Sata controller Onboard Ethernet (gigabit) I've got a very similar rig to the OP showing up next week (plus an infiniband card) I'd love to get this performing up to GB Ethernet speeds, otherwise I may have to abandon the iSCSI project if I can't get it to perform. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
Dan, Exactly what I meant. An allocation policy, that will help in distributing the data in a way that when one disk is lost (entire mirror) than some data remains fully accessible as opposed to not been able to access pieces all over the storage pool. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
Create a new empty pool on the solaris system, let it format the disks etc ie used the disk names cXtXd0 This should put the EFI label on the disks and then setup the partitions for you. Just encase here is an example. Go back to the Linux box, and see if you can use tools to see the same partition layout, if you can then dd it to the currect spot which in Solaris c5t2d0s0. (zfs send|zfs recv would be easier) -bash-4.0$ pfexec fdisk -R -W - /dev/rdsk/c5t2d0p0 * /dev/rdsk/c5t2d0p0 default fdisk table * Dimensions: *512 bytes/sector *126 sectors/track *255 tracks/cylinder * 60800 cylinders * * systid: *1: DOSOS12 * 238: EFI_PMBR * 239: EFI_FS * * IdAct Bhead Bsect BcylEhead Esect EcylRsect Numsect 238 025563 102325563 10231 1953525167 0 00 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 -bash-4.0$ pfexec prtvtoc /dev/rdsk/c5t2d0 * /dev/rdsk/c5t2d0 partition map * * Dimensions: * 512 bytes/sector * 1953525168 sectors * 1953525101 accessible sectors * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First SectorLast * Sector CountSector * 34 222 255 * * First Sector Last * Partition Tag FlagsSector Count Sector Mount Directory 0 4 00 2561953508495 1953508750 8 1100 1953508751 163841953525134 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusing
The links look fine, and i am pretty sure (though not 100%) that this is related to the vdev id assignment. What i am not sure is whether this is still an areca firmware issue or opensolaris issue. ls -l /dev/dsk/c7t1d?p0 lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d0p0 -> ../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,0:q lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d1p0 -> ../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,1:q lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d2p0 -> ../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,2:q lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d3p0 -> ../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,3:q lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d4p0 -> ../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,4:q lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d5p0 -> ../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,5:q lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d6p0 -> ../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,6:q lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d7p0 -> ../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,7:q -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
On Wed, Feb 17, 2010 at 23:21, Bob Friesenhahn wrote: > On Wed, 17 Feb 2010, Ethan wrote: > >> >> I should have a partition table, for one thing, I suppose. The partition >> table is EFI GUID Partition >> Table, looking at the relevant documentation. So, I'll need to somehow >> shift my zfs data down by 17408 >> bytes (34 512-byte LBA's, the size of the GPT's stuff at the beginning of >> the disk) - perhaps just by >> >> Does that sound correct / sensible? Am I missing or mistaking anything? >> > > It seems to me that you could also use the approach of 'zpool replace' for > each device in turn until all of the devices are re-written to normal > Solaris/zfs defaults. This would also allows you to expand the partition > size a bit for a larger pool. > > Bob > -- > Bob Friesenhahn > bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ That is true. It seems like it then have to rebuild from parity for every drive, though, which I think would take rather a long while, wouldn't it? I could put in a new drive to overwrite. Then the replace command would just copy from the old drive rather than rebuilding from parity (I think? that seems like the sensible thing for it to do, anyway.) But I don't have a spare drive for this - I have the original drives that still contain the truecrypt volumes, but I am disinclined to start overwriting these, this episode having given me a healthy paranoia about having good backups. I guess this question just comes down to weighing whether rebuilding each from parity or re-copying from the truecrypt volumes to a different offset is more of a hassle. -Ethan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Killing an EFI label
Since this seems to be a ubiquitous problem for people running ZFS, even though it's really a general Solaris admin issue, I'm guessing the expertise is actually here, so I'm asking here. I found lots of online pages telling how to do it. None of them were correct or complete. I think. I seem to have accomplished it in a somewhat hackish fashion, possibly not cleanly, and I'm now trying to really understand this (I've always found SunOS' idea of overlapping partitions so insanely stupid that it turns my brain off, and combining that with x86-style real disk partitions and calling them both the same thing except when we don't has probably induced permanent brain damage by this point). First step: invoke format -e and use "label" to write an SMI label to the disk. This part is sort-of documented, and if you let format present you the list of disks and choose one, everything works out. What about the other syntax, where you specify the "disk" to format on the command line? What are valid device files? Is it any file that ends up pointing to some portion of the correct physical disk? But after this, it appears to be necessary to manually set up partitions (slices). At least, without doing that, I couldn't attach the disk to my zpool, which was my goal. Am I missing something? And, when manually setting up partitions, I have no idea if what I did is right. Well, a bit of idea; I know that installgrub did NOT overwrite anything that a scrub detected, so that means I left enough blank space somewhere. Not sure it's the right place, though. Did I have to do this? Every way I tried to avoid this resulted in failure to attach, but none of the instructions listed this step. This is how format prints the partitions I created: partition> p Current partition table (original): Total disk cylinders available: 19454 + 2 (reserved cylinders) Part TagFlag Cylinders SizeBlocks 0 rootwm 1 - 19453 149.02GB(19453/0/0) 312512445 1 unassignedwm 00 (0/0/0) 0 2 backupwu 0 - 19453 149.03GB(19454/0/0) 312528510 3 unassignedwm 00 (0/0/0) 0 4 unassignedwm 00 (0/0/0) 0 5 unassignedwm 00 (0/0/0) 0 6 unassignedwm 00 (0/0/0) 0 7 unassignedwm 00 (0/0/0) 0 8 bootwu 0 - 07.84MB(1/0/0) 16065 9 unassignedwm 00 (0/0/0) 0 And here's the one on the other disk -- yikes, it looks like it ended up with a completely different geometry! (these are two identical drives). Current partition table (original): Total disk cylinders available: 152615 + 2 (reserved cylinders) Part TagFlag Cylinders SizeBlocks 0 rootwm 1 - 152614 149.04GB(152614/0/0) 312553472 1 swapwu 0 0 (0/0/0) 0 2 backupwu 0 - 152616 149.04GB(152617/0/0) 312559616 3 unassignedwm 0 0 (0/0/0) 0 4 unassignedwm 0 0 (0/0/0) 0 5 unassignedwm 0 0 (0/0/0) 0 6usrwm 1 - 152614 149.04GB(152614/0/0) 312553472 7 unassignedwm 0 0 (0/0/0) 0 8 bootwu 0 - 01.00MB (1/0/0) 2048 9 alternateswm 0 0 (0/0/0) 0 How do I fix that? For scsi disks (these are SATA disks on an SAS controller, does that count?) it's supposed to figure that out itself, I thought? I certainly never entered disk geometry figures. The pool is using s0: pool: rpool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t0d0s0 ONLINE 0 0 0 c4t1d0s0 ONLINE 0 0 0 c9t0d0s0 ONLINE 0 0 0 c9t2d0s0 ONLINE 0 0 0 errors: No known data errors Once I decide that these -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
Re: [zfs-discuss] Help with corrupted pool
On Wed, 17 Feb 2010, Ethan wrote: I should have a partition table, for one thing, I suppose. The partition table is EFI GUID Partition Table, looking at the relevant documentation. So, I'll need to somehow shift my zfs data down by 17408 bytes (34 512-byte LBA's, the size of the GPT's stuff at the beginning of the disk) - perhaps just by Does that sound correct / sensible? Am I missing or mistaking anything? It seems to me that you could also use the approach of 'zpool replace' for each device in turn until all of the devices are re-written to normal Solaris/zfs defaults. This would also allows you to expand the partition size a bit for a larger pool. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusing
On 2/17/2010 9:59 PM, Moshe Vainer wrote: I have another very weird one, looks like a reoccurance of the same issue but with the new firmware. We have the following disks: AVAILABLE DISK SELECTIONS: 0. c7t1d0 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,0 1. c7t1d1 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,1 2. c7t1d2 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,2 3. c7t1d3 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,3 4. c7t1d4 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,4 5. c7t1d5 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,5 6. c7t1d6 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,6 7. c7t1d7 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,7 rpool uses c7d1d7 # zpool status pool: rpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 c7t1d7s0 ONLINE 0 0 0 errors: No known data errors I tried to create the following tank: zpool create -f tank \ raidz2 \ c7t1d0 \ c7t1d1 \ c7t1d2 \ c7t1d3 \ c7t1d4 \ c7t1d5 \ spare \ c7t1d6 # ./mktank.sh invalid vdev specification the following errors must be manually repaired: /dev/dsk/c7t1d7s0 is part of active ZFS pool rpool. Please see zpool(1M). So clearly, it confuses one of the other drives with c7t1d7 What's even weirder - this is after a clean reinstall of solaris (it's a test box). Any ideas on how to clean the state? James, if you read this, is this the same issue? Well, I'd certainly chase through the symbolic links to find if the device files were pointing the wrong places in the end, or if the problem is lower in the stack than that. Since it's a clean install it's a Solaris bug at some level either way, sounds like. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusing
I have another very weird one, looks like a reoccurance of the same issue but with the new firmware. We have the following disks: AVAILABLE DISK SELECTIONS: 0. c7t1d0 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,0 1. c7t1d1 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,1 2. c7t1d2 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,2 3. c7t1d3 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,3 4. c7t1d4 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,4 5. c7t1d5 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,5 6. c7t1d6 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,6 7. c7t1d7 /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,7 rpool uses c7d1d7 # zpool status pool: rpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 c7t1d7s0 ONLINE 0 0 0 errors: No known data errors I tried to create the following tank: zpool create -f tank \ raidz2 \ c7t1d0 \ c7t1d1 \ c7t1d2 \ c7t1d3 \ c7t1d4 \ c7t1d5 \ spare \ c7t1d6 # ./mktank.sh invalid vdev specification the following errors must be manually repaired: /dev/dsk/c7t1d7s0 is part of active ZFS pool rpool. Please see zpool(1M). So clearly, it confuses one of the other drives with c7t1d7 What's even weirder - this is after a clean reinstall of solaris (it's a test box). Any ideas on how to clean the state? James, if you read this, is this the same issue? Thanks in advance, Moshe -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
On Wed, Feb 17, 2010 at 18:24, Daniel Carosone wrote: > On Wed, Feb 17, 2010 at 06:15:25PM -0500, Ethan wrote: > > Success! > > Awesome. Let that scrub finish before celebrating completely, but > this looks like a good place to stop and consider what you want for an > end state. > > -- > Dan. > True. Thinking about where to end up - I will be staying on opensolaris despite having no truecrypt. My paranoia likes having encryption, but it's not really necessary for me, and it looks like encryption will be coming to zfs itself soon enough. So, no need to consider getting things working on zfs-fuse again. I should have a partition table, for one thing, I suppose. The partition table is EFI GUID Partition Table, looking at the relevant documentation. So, I'll need to somehow shift my zfs data down by 17408 bytes (34 512-byte LBA's, the size of the GPT's stuff at the beginning of the disk) - perhaps just by copying from the truecrypt volumes as I did before, but with offset of 17408 bytes. Then I should be able to use format to make the correct partition information, and use the s0 partition for each drive as seems to be the standard way of doing things. Or maybe I can format (write the GPT) first, then get linux to recognize the GPT, and copy from truecrypt into the partition. Does that sound correct / sensible? Am I missing or mistaking anything? Thanks, -Ethan PS: scrub in progress for 4h4m, 65.43% done, 2h9m to go - no errors yet. Looking good. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
On Wed, Feb 17, 2010 at 02:38:04PM -0500, Miles Nordin wrote: > copies=2 has proven to be mostly useless in practice. I disagree. Perhaps my cases fit under the weasel-word "mostly", but single-disk laptops are a pretty common use-case. > If there were a real-world device that tended to randomly flip bits, > or randomly replace swaths of LBA's with zeroes As an aside, there can be non-device causes of this, especially when sharing disks with other operating systems, booting livecd's and etc. > * drives do not immediately turn red and start brrk-brrking when they >go bad. In the real world, they develop latent sector errors, >which you will not discover and mark the drive bad until you scrub >or coincidentally happen to read the file that accumulated the >error. Yes, exactly - at this point, with copies=1, you get a signal that your drive is about to go bad, and that data has been lost. With copies=2, you get a signal that your drive is about to go bad, but less disruption and data loss to go with it. Note that pool metadata is inherently using ditto blocks for precisely this reason. I dunno about BER spec, but I have seen sectors go unreadable many times. Sometimes, as you note, in combination with other problems or further deterioriation, sometimes not. Regardless of what you do in response, and how soon you replace the drive, copies >1 can cover that interval. I agree fully, copies=2 is not a substitute for backup replication of whatever flavour you prefer. It is a useful supplement. Misunderstanding this is dangerous. -- Dan. pgpzBR2d3KS9G.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
> > >>If there were a real-world device that tended to randomly flip bits, >>or randomly replace swaths of LBA's with zeroes, but otherwise behave >>normally (not return any errors, not slow down retrying reads, not >>fail to attach), then copies=2 would be really valuable, but so far it >>seems no such device exists. If you actually explore the errors that >>really happen I venture there are few to no cases copies=2 would save >>you. > >I had a device which had 256 bytes of the 32MB broken (some were "1", some >were always "0"). But I never put it online because it was so broken. Of the 32MB cache, sorry. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
>If there were a real-world device that tended to randomly flip bits, >or randomly replace swaths of LBA's with zeroes, but otherwise behave >normally (not return any errors, not slow down retrying reads, not >fail to attach), then copies=2 would be really valuable, but so far it >seems no such device exists. If you actually explore the errors that >really happen I venture there are few to no cases copies=2 would save >you. I had a device which had 256 bytes of the 32MB broken (some were "1", some were always "0"). But I never put it online because it was so broken. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
On Feb 17, 2010, at 12:34, Richard Elling wrote: I'm not sure how to connect those into the system (USB 3?), but when you build it, let us know how it works out. FireWire 3200 preferably. Anyone know if USB 3 sucks as much CPU as previous versions? If I'm burning CPU on I/O I'd rather having going to checksums than polling cheap-ass USB controllers that need to be baby sat. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
On Wed, Feb 17, 2010 at 06:15:25PM -0500, Ethan wrote: > Success! Awesome. Let that scrub finish before celebrating completely, but this looks like a good place to stop and consider what you want for an end state. -- Dan. pgph6ALkJoiw6.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
On Wed, Feb 17, 2010 at 17:44, Daniel Carosone wrote: > On Wed, Feb 17, 2010 at 04:48:23PM -0500, Ethan wrote: > > It looks like using p0 is exactly what I want, actually. Are s2 and p0 > both > > the entire disk? > > No. s2 depends on there being a solaris partition table (Sun or EFI), > and if there's also an fdisk partition table (disk shared with other > OS), s2 will only cover the solaris part of the disk. It also > typically doesn't cover the last 2 cylinders, which solaris calls > "reserved" for hysterical raisins. > > > The idea of symlinking to the full-disk devices from a directory and > using > > -d had crossed my mind, but I wasn't sure about it. I think that is > > something worth trying. > > Note, I haven't tried it either.. > > > I'm not too concerned about it not working at boot - > > I just want to get something working at all, at the moment. > > Yup. > > -- > Dan. Success! I made a directory and symlinked p0's for all the disks: et...@save:~/qdsk# ls -al total 13 drwxr-xr-x 2 root root 7 Feb 17 23:06 . drwxr-xr-x 21 ethanstaff 31 Feb 17 14:16 .. lrwxrwxrwx 1 root root 17 Feb 17 23:06 c9t0d0p0 -> /dev/dsk/c9t0d0p0 lrwxrwxrwx 1 root root 17 Feb 17 23:06 c9t1d0p0 -> /dev/dsk/c9t1d0p0 lrwxrwxrwx 1 root root 17 Feb 17 23:06 c9t2d0p0 -> /dev/dsk/c9t2d0p0 lrwxrwxrwx 1 root root 17 Feb 17 23:06 c9t4d0p0 -> /dev/dsk/c9t4d0p0 lrwxrwxrwx 1 root root 17 Feb 17 23:06 c9t5d0p0 -> /dev/dsk/c9t5d0p0 I attempt to import using -d: et...@save:~/qdsk# zpool import -d . pool: q id: 5055543090570728034 state: ONLINE status: The pool is formatted using an older on-disk version. action: The pool can be imported using its name or numeric identifier, though some features will not be available without an explicit 'zpool upgrade'. config: q ONLINE raidz1 ONLINE /export/home/ethan/qdsk/c9t4d0p0 ONLINE /export/home/ethan/qdsk/c9t5d0p0 ONLINE /export/home/ethan/qdsk/c9t2d0p0 ONLINE /export/home/ethan/qdsk/c9t1d0p0 ONLINE /export/home/ethan/qdsk/c9t0d0p0 ONLINE The pool is not imported. This does look promising though. I attempt to import using the name: et...@save:~/qdsk# zpool import -d . q it sits there for a while. I worry that it's going to hang forever like it did on linux. but then it returns! et...@save:~/qdsk# zpool status pool: q state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: scrub in progress for 0h2m, 0.43% done, 8h57m to go config: NAME STATE READ WRITE CKSUM q ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /export/home/ethan/qdsk/c9t4d0p0 ONLINE 0 0 0 /export/home/ethan/qdsk/c9t5d0p0 ONLINE 0 0 0 /export/home/ethan/qdsk/c9t2d0p0 ONLINE 0 0 0 /export/home/ethan/qdsk/c9t1d0p0 ONLINE 0 0 0 /export/home/ethan/qdsk/c9t0d0p0 ONLINE 0 0 0 errors: No known data errors All the filesystems are there, all the files are there. Life is good. Thank you all so much. -Ethan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
On Wed, Feb 17, 2010 at 04:44:19PM -0500, Ethan wrote: > There was no partitioning on the truecrypt disks. The truecrypt volumes > occupied the whole raw disks (1500301910016 bytes each). The devices that I > gave to the zpool on linux were the whole raw devices that truecrypt exposed > (1500301647872 bytes each). There were no partition tables on either the raw > disks or the truecrypt volumes, just truecrypt headers on the raw disk and > zfs on the truecrypt volumes. > I copied the data simply using > > dd if=/dev/mapper/truecrypt1 of=/dev/sdb Ok, then as you noted, you want to start with the ..p0 device, as the equivalent. > The labels 2 and 3 should be on the drives, but they are 262144 bytes > further from the end of slice 2 than zpool must be looking. I don't think so.. They're found by counting from the start; the end can move out further (LUN expansion), and with autoexpand the vdev can be extended (adding metaslabs) and the labels will be rewritten at the new end after the last metaslab. I think the issue is that there are no partitions on the devices that allow import to read that far. Fooling it into using p0 would work around this. > I could create a partition table on each drive, specifying a partition with > the size of the truecrypt volume, and re-copy the data onto this partition > (would have to re-copy as creating the partition table would overwrite zfs > data, as zfs starts at byte 0). Would this be preferable? Eventually, probably, yes - once you've confirmed all the speculation, gotten past the partitioning issue to whatever other damage is in the pool, resolved that, and have some kind of access to your data. There are other options as well, including using replace one at a time, or send|recv. > I was under some > impression that zpool devices were preferred to be raw drives, not > partitions, but I don't recall where I came to believe that much less > whether it's at all correct. Sort of. zfs commands can be handed bare disks; internally they put EFI labels on them automatically (though evidently not the fuse variants). ZFS mostly makes partitioning go away (hooray), but it still becomes important in cases like this - shared disks and migration between operating systems. > as for using import -F, I am on snv_111b, which I am not sure has -F for > import. Nope. > I tried to update to the latest dev build (using the instructions > at http://pkg.opensolaris.org/dev/en/index.shtml ) but things are behaving > very strangely. I get error messages on boot - "gconf-sanity-check-2 exited > with error status 256", and when I dismiss this and go into gnome, terminal > is messed up and doesn't echo anything I type, and I can't ssh in (error > message about not able to allocate a TTY). anyway, zfs mailing list isn't > really the place to be discussing that I suppose. Not really, but read the release notes. Alternately, if this is a new machine, you could just reinstall (or boot from) a current livecd/usb, download from genunix.org -- Dan. pgp4YNaiZ3HRo.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] false DEGRADED status based on "cannot open" device at boot.
On Wed, Feb 17, 2010 at 05:28:03PM -0500, Dennis Clarke wrote: > Good theory, however, this disk is fully external with its own power. It can still be commanded to offline state. -- Dan. pgpzmziAIXUx3.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
On Wed, Feb 17, 2010 at 04:48:23PM -0500, Ethan wrote: > It looks like using p0 is exactly what I want, actually. Are s2 and p0 both > the entire disk? No. s2 depends on there being a solaris partition table (Sun or EFI), and if there's also an fdisk partition table (disk shared with other OS), s2 will only cover the solaris part of the disk. It also typically doesn't cover the last 2 cylinders, which solaris calls "reserved" for hysterical raisins. > The idea of symlinking to the full-disk devices from a directory and using > -d had crossed my mind, but I wasn't sure about it. I think that is > something worth trying. Note, I haven't tried it either.. > I'm not too concerned about it not working at boot - > I just want to get something working at all, at the moment. Yup. -- Dan. pgprAVyX9wjRF.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] false DEGRADED status based on "cannot open" device at boot.
Hi Dennis, You might be running into this issue: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6856341 The workaround is to force load the drivers. Thanks, Cindy On 02/17/10 14:33, Dennis Clarke wrote: I find that some servers display a DEGRADED zpool status at boot. More troubling is that this seems to be silent and no notice is given on the console or via a snmp message or other notification process. Let me demonstrate : {0} ok boot -srv Sun Blade 2500 (Silver), No Keyboard Copyright 2005 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.17.3, 4096 MB memory installed, Serial #64510477. Ethernet address 0:3:ba:d8:5a:d, Host ID: 83d85a0d. Rebooting with command: boot -srv Boot device: /p...@1d,70/s...@4,1/d...@0,0:a File and args: -srv module /platform/sun4u/kernel/sparcv9/unix: text at [0x100, 0x10a3695] data at 0x180 module /platform/sun4u/kernel/sparcv9/genunix: text at [0x10a3698, 0x126bbf7] data at 0x1866840 module /platform/SUNW,Sun-Blade-2500/kernel/misc/sparcv9/platmod: text at [0x126bbf8, 0x126c1e7] data at 0x18bc0c8 . . . many lines of verbose messages . . dump on /dev/zvol/dsk/mercury_rpool/swap size 0 MB Loading smf(5) service descriptions: 2/2 Requesting System Maintenance Mode SINGLE USER MODE Root password for system maintenance (control-d to bypass): single-user privilege assigned to /dev/console. Entering System Maintenance Mode # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT mercury_rpool68G 27.4G 40.6G40% DEGRADED - # zpool status mercury_rpool pool: mercury_rpool state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: none requested config: NAME STATE READ WRITE CKSUM mercury_rpool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c3t0d0s0 ONLINE 0 0 0 c1t2d0s0 UNAVAIL 0 0 0 cannot open errors: No known data errors This is trivial to remedy : # zpool online mercury_rpool c1t2d0s0 # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT mercury_rpool68G 27.4G 40.6G40% ONLINE - # zpool status mercury_rpool pool: mercury_rpool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: resilver completed after 0h0m with 0 errors on Wed Feb 17 21:26:11 2010 config: NAME STATE READ WRITE CKSUM mercury_rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t0d0s0 ONLINE 0 0 0 c1t2d0s0 ONLINE 0 0 0 14.5M resilvered errors: No known data errors # I have many systems where I keep mirrors on multiple controllers, either fibre or SCSI. It seems that the SCSI devices don't get detected at boot on the Sparc systems. The x86/AMD64 systems do not seem to have this problem but I may be wrong. Is this a known bug or am I seeing something due to a missing line in /etc/system ? Oh, also, I should point out that it does not matter if I boot with init S or 3 or 6. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] false DEGRADED status based on "cannot open" device at boot.
> On Wed, 17 Feb 2010, Dennis Clarke wrote: >> >>NAME STATE READ WRITE CKSUM >>mercury_rpool ONLINE 0 0 0 >> mirror ONLINE 0 0 0 >>c3t0d0s0 ONLINE 0 0 0 >>c1t2d0s0 ONLINE 0 0 0 14.5M resilvered >> >> errors: No known data errors >> # >> >> I have many systems where I keep mirrors on multiple controllers, either >> fibre or SCSI. It seems that the SCSI devices don't get detected at boot >> on the Sparc systems. The x86/AMD64 systems do not seem to have this >> problem but I may be wrong. >> >> Is this a known bug or am I seeing something due to a missing line in >> /etc/system ? > > My Sun Blade 2500 (Red) does see both boot disks. However, I do > recall an issue at one time with the Solaris power management daemon > in that it shut down the second disk during boot so that it was not > seen. It was mentioned in the Solaris release notes (maybe U5 or U6?) > and it happened to me. A fix to /etc/power.conf was required. > Perhaps that is what is happening to you. Good theory, however, this disk is fully external with its own power. Strange. I'll go have a look at a V490 I have here ( snv_130 ) and install a few SCSI cards just to see what happens. Maybe this is specific to the SB2500 workstations. -- Dennis Clarke dcla...@opensolaris.ca <- Email related to the open source Solaris dcla...@blastwave.org <- Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] false DEGRADED status based on "cannot open" device at boot.
On Wed, 17 Feb 2010, Dennis Clarke wrote: NAME STATE READ WRITE CKSUM mercury_rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t0d0s0 ONLINE 0 0 0 c1t2d0s0 ONLINE 0 0 0 14.5M resilvered errors: No known data errors # I have many systems where I keep mirrors on multiple controllers, either fibre or SCSI. It seems that the SCSI devices don't get detected at boot on the Sparc systems. The x86/AMD64 systems do not seem to have this problem but I may be wrong. Is this a known bug or am I seeing something due to a missing line in /etc/system ? My Sun Blade 2500 (Red) does see both boot disks. However, I do recall an issue at one time with the Solaris power management daemon in that it shut down the second disk during boot so that it was not seen. It was mentioned in the Solaris release notes (maybe U5 or U6?) and it happened to me. A fix to /etc/power.conf was required. Perhaps that is what is happening to you. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
On Wed, Feb 17, 2010 at 16:25, Daniel Carosone wrote: > On Thu, Feb 18, 2010 at 08:14:03AM +1100, Daniel Carosone wrote: > > I think > > you probably want to make a slice 0 that spans the right disk sectors. > [..] > > you could try zdb -l on /dev/dsk/c...p[01234] as well. > > Depending on how and what you copied, you may have zfs data that start > at sector 0, with no space for any partitioning labels at all. If > zdb -l /dev/rdsk/c..p0 shows a full set, this is what has happened. > Trying to write partition tables may overwrite some of the zfs labels. > > zfs won't import such a pool by default (it doesn't check those > devices). You could cheat, by making a directory with symlinks to the > p0 devices, and using import -d, but this will not work at boot. It > would be a way to verify current state, so you can then plan next > steps. > > -- > Dan. > It looks like using p0 is exactly what I want, actually. Are s2 and p0 both the entire disk? The idea of symlinking to the full-disk devices from a directory and using -d had crossed my mind, but I wasn't sure about it. I think that is something worth trying. I'm not too concerned about it not working at boot - I just want to get something working at all, at the moment. -Ethan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
On Wed, Feb 17, 2010 at 16:14, Daniel Carosone wrote: > On Wed, Feb 17, 2010 at 03:37:59PM -0500, Ethan wrote: > > On Wed, Feb 17, 2010 at 15:22, Daniel Carosone wrote: > > I have not yet successfully imported. I can see two ways of making > progress > > forward. One is forcing zpool to attempt to import using slice 2 for each > > disk rather than slice 8. If this is how autoexpand works, as you say, it > > seems like it should work fine for this. But I don't know how, or if it > is > > possible to, make it use slice 2. > > Just get rid of 8? :-) > That sounds like an excellent idea, but, being very new to opensolaris, I have no idea how to do this. I'm reading through http://multiboot.solaris-x86.org/iv/3.html at the moment. You mention the 'format' utility below, which I will read more into. > > Normally, when using the whole disk, convention is that slice 0 is > used, and there's a small initial offset (for the EFI label). I think > you probably want to make a slice 0 that spans the right disk sectors. > > Were you using some other partitioning inside the truecrypt "disks"? > What devices were given to zfs-fuse, and what was their starting > offset? You may need to account for that, too. How did you copy the > data, and to what target device, on what platform? Perhaps the > truecrypt device's partition table is now at the start of the physical > disk, but solaris can't read it properly? If that's an MBR partition > table (which you look at with fdisk), you could try zdb -l on > /dev/dsk/c...p[01234] as well. > There was no partitioning on the truecrypt disks. The truecrypt volumes occupied the whole raw disks (1500301910016 bytes each). The devices that I gave to the zpool on linux were the whole raw devices that truecrypt exposed (1500301647872 bytes each). There were no partition tables on either the raw disks or the truecrypt volumes, just truecrypt headers on the raw disk and zfs on the truecrypt volumes. I copied the data simply using dd if=/dev/mapper/truecrypt1 of=/dev/sdb on linux, where /dev/mapper/truecrypt1 is the truecrypt volume for one hard disk (which was on /dev/sda) and /dev/sdb is a new blank drive of the same size as the old drive (but slightly larger than the truecrypt volume). And repeat likewise for each of the five drives. The labels 2 and 3 should be on the drives, but they are 262144 bytes further from the end of slice 2 than zpool must be looking. I could create a partition table on each drive, specifying a partition with the size of the truecrypt volume, and re-copy the data onto this partition (would have to re-copy as creating the partition table would overwrite zfs data, as zfs starts at byte 0). Would this be preferable? I was under some impression that zpool devices were preferred to be raw drives, not partitions, but I don't recall where I came to believe that much less whether it's at all correct. > > We're just guessing here.. to provide more concrete help, you'll need > to show us some of the specifics, both of what you did and what you've > ended up with. fdisk and format partition tables and zdb -l output > would be a good start. > > Figuring out what is different about the disk where s2 was used would > be handy too. That may be a synthetic label because something is > missing from that disk that the others have. > > > The other way is to make a slice that is the correct size of the volumes > as > > I had them before (262144 bytes less than the size of the disk). It seems > > like this should cause zpool to prefer to use this slice over slice 8, as > it > > can find all 4 labels, rather than just labels 0 and 1. I don't know how > to > > go about this either, or if it's possible. I have been starting to read > > documentation on slices in solaris but haven't had time to get far enough > to > > figure out what I need. > > format will let you examine and edit these. Start by making sure they > have all the same partitioning, flags, etc. > I will have a look at format, but if this operates on partition tables, well, my disks have none at the moment so I'll have to remedy that. > > > I also have my doubts about this solving my actual issues - the ones that > > caused me to be unable to import in zfs-fuse. But I need to solve this > issue > > before I can move forward to figuring out/solving whatever that issue > was. > > Yeah - my suspicion is that import -F may help here. That is a pool > recovery mode, where it rolls back progressive transactions until it > finds one that validates correctly. It was only added recently and is > probably missing from the fuse version. > > -- > Dan. > > as for using import -F, I am on snv_111b, which I am not sure has -F for import. I tried to update to the latest dev build (using the instructions at http://pkg.opensolaris.org/dev/en/index.shtml ) but things are behaving very strangely. I get error messages on boot - "gconf-sanity-check-2 exited with error status 256", and when I dismiss this and go into gnome,
[zfs-discuss] false DEGRADED status based on "cannot open" device at boot.
I find that some servers display a DEGRADED zpool status at boot. More troubling is that this seems to be silent and no notice is given on the console or via a snmp message or other notification process. Let me demonstrate : {0} ok boot -srv Sun Blade 2500 (Silver), No Keyboard Copyright 2005 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.17.3, 4096 MB memory installed, Serial #64510477. Ethernet address 0:3:ba:d8:5a:d, Host ID: 83d85a0d. Rebooting with command: boot -srv Boot device: /p...@1d,70/s...@4,1/d...@0,0:a File and args: -srv module /platform/sun4u/kernel/sparcv9/unix: text at [0x100, 0x10a3695] data at 0x180 module /platform/sun4u/kernel/sparcv9/genunix: text at [0x10a3698, 0x126bbf7] data at 0x1866840 module /platform/SUNW,Sun-Blade-2500/kernel/misc/sparcv9/platmod: text at [0x126bbf8, 0x126c1e7] data at 0x18bc0c8 . . . many lines of verbose messages . . dump on /dev/zvol/dsk/mercury_rpool/swap size 0 MB Loading smf(5) service descriptions: 2/2 Requesting System Maintenance Mode SINGLE USER MODE Root password for system maintenance (control-d to bypass): single-user privilege assigned to /dev/console. Entering System Maintenance Mode # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT mercury_rpool68G 27.4G 40.6G40% DEGRADED - # zpool status mercury_rpool pool: mercury_rpool state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: none requested config: NAME STATE READ WRITE CKSUM mercury_rpool DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c3t0d0s0 ONLINE 0 0 0 c1t2d0s0 UNAVAIL 0 0 0 cannot open errors: No known data errors This is trivial to remedy : # zpool online mercury_rpool c1t2d0s0 # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT mercury_rpool68G 27.4G 40.6G40% ONLINE - # zpool status mercury_rpool pool: mercury_rpool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: resilver completed after 0h0m with 0 errors on Wed Feb 17 21:26:11 2010 config: NAME STATE READ WRITE CKSUM mercury_rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t0d0s0 ONLINE 0 0 0 c1t2d0s0 ONLINE 0 0 0 14.5M resilvered errors: No known data errors # I have many systems where I keep mirrors on multiple controllers, either fibre or SCSI. It seems that the SCSI devices don't get detected at boot on the Sparc systems. The x86/AMD64 systems do not seem to have this problem but I may be wrong. Is this a known bug or am I seeing something due to a missing line in /etc/system ? Oh, also, I should point out that it does not matter if I boot with init S or 3 or 6. -- Dennis Clarke dcla...@opensolaris.ca <- Email related to the open source Solaris dcla...@blastwave.org <- Email related to open source for Solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
On Thu, Feb 18, 2010 at 08:14:03AM +1100, Daniel Carosone wrote: > I think > you probably want to make a slice 0 that spans the right disk sectors. [..] > you could try zdb -l on /dev/dsk/c...p[01234] as well. Depending on how and what you copied, you may have zfs data that start at sector 0, with no space for any partitioning labels at all. If zdb -l /dev/rdsk/c..p0 shows a full set, this is what has happened. Trying to write partition tables may overwrite some of the zfs labels. zfs won't import such a pool by default (it doesn't check those devices). You could cheat, by making a directory with symlinks to the p0 devices, and using import -d, but this will not work at boot. It would be a way to verify current state, so you can then plan next steps. -- Dan. pgp3HWt1wxGx7.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
On Wed, Feb 17, 2010 at 03:37:59PM -0500, Ethan wrote: > On Wed, Feb 17, 2010 at 15:22, Daniel Carosone wrote: > I have not yet successfully imported. I can see two ways of making progress > forward. One is forcing zpool to attempt to import using slice 2 for each > disk rather than slice 8. If this is how autoexpand works, as you say, it > seems like it should work fine for this. But I don't know how, or if it is > possible to, make it use slice 2. Just get rid of 8? :-) Normally, when using the whole disk, convention is that slice 0 is used, and there's a small initial offset (for the EFI label). I think you probably want to make a slice 0 that spans the right disk sectors. Were you using some other partitioning inside the truecrypt "disks"? What devices were given to zfs-fuse, and what was their starting offset? You may need to account for that, too. How did you copy the data, and to what target device, on what platform? Perhaps the truecrypt device's partition table is now at the start of the physical disk, but solaris can't read it properly? If that's an MBR partition table (which you look at with fdisk), you could try zdb -l on /dev/dsk/c...p[01234] as well. We're just guessing here.. to provide more concrete help, you'll need to show us some of the specifics, both of what you did and what you've ended up with. fdisk and format partition tables and zdb -l output would be a good start. Figuring out what is different about the disk where s2 was used would be handy too. That may be a synthetic label because something is missing from that disk that the others have. > The other way is to make a slice that is the correct size of the volumes as > I had them before (262144 bytes less than the size of the disk). It seems > like this should cause zpool to prefer to use this slice over slice 8, as it > can find all 4 labels, rather than just labels 0 and 1. I don't know how to > go about this either, or if it's possible. I have been starting to read > documentation on slices in solaris but haven't had time to get far enough to > figure out what I need. format will let you examine and edit these. Start by making sure they have all the same partitioning, flags, etc. > I also have my doubts about this solving my actual issues - the ones that > caused me to be unable to import in zfs-fuse. But I need to solve this issue > before I can move forward to figuring out/solving whatever that issue was. Yeah - my suspicion is that import -F may help here. That is a pool recovery mode, where it rolls back progressive transactions until it finds one that validates correctly. It was only added recently and is probably missing from the fuse version. -- Dan. pgpZ5Qt78UfHj.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] naming zfs disks
hello look at format - volname FORMAT MENU: disk - select a disk type - select (define) a disk type partition - select (define) a partition table current- describe the current disk format - format and analyze the disk fdisk - run the fdisk program repair - repair a defective sector show - translate a disk address label - write label to the disk analyze- surface analysis defect - defect list management backup - search for backup labels verify - read and display labels save - save new disk/partition definitions volname- set 8-character volume name gea www.napp-it.org zfs server -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
On 02/17/10 02:38 PM, Miles Nordin wrote: copies=2 has proven to be mostly useless in practice. Not true. Take an ancient PC with a mirrored root pool, no bus error checking and non-ECC memory, that flawlessly passes every known diagnostic (SMC included). Reboot with copies=1 and the same files in /usr/lib will get trashed every time and you'll have to reboot from some other media to repair it. Set copies=2 (copy all of /usr/lib, of course) and it will reboot every time with no problem, albeit with a varying number of repaired checksum errors, almost always on the same set of files. Without copies=2 this hardware would be useless (well, it ran Linux just fine), but with it, it has a new lease of life. There is an ancient CR about this, but AFAIK no one has any idea what the problem is or how to fix it. IMO it proves that copies=2 can help avoid data loss in the face of flaky buses and perhaps memory. I don't think you should be able to lose data on mirrored drives unless both drives fail simultaneously, but with ZFS you can. Certainly, on any machine without ECC memory, or buses without ECC (is parity good enough?) my suggestion would be to set copies=2, and I have it set for critical datasets even on machines with ECC on both. Just waiting for the bus that those SAS controllers are on to burp at the wrong moment... Is one counter-example enough? Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
On Wed, Feb 17, 2010 at 15:22, Daniel Carosone wrote: > On Wed, Feb 17, 2010 at 12:31:27AM -0500, Ethan wrote: > > And I just realized - yes, labels 2 and 3 are in the wrong place relative > to > > the end of the drive; I did not take into account the overhead taken up > by > > truecrypt when dd'ing the data. The raw drive is 1500301910016 bytes; the > > truecrypt volume is 1500301647872 bytes. Off by 262144 bytes - I need a > > slice that is sized like the truecrypt volume. > > It shouldn't matter if the slice is larger than the original; this is > how autoexpand works. 2 should be near the start (with 1), 3 should > be near the logical end (with 4). > > Did this resolve the issue? You didn't say, and I have my doubts. I'm > not sure this is your problem, but it seems you're on the track to > finding the real problem. > > In the labels you can see, are the txg's the same for all pool > members? If not, you may still need import -F, once all the > partitioning gets sorted out. > > Also, re-reading what I wrote above, I realised I was being ambiguous > in my use of "label". Sometimes I meant the zfs labels that zdb -l > prints, and sometimes I meant the vtoc that format uses for slices. In > the BSD world we call those labels too, and I didn't realise I was > mixing terms. Sorry for any confusion but it seems you figured out > what I meant :) > > -- > Dan. I have not yet successfully imported. I can see two ways of making progress forward. One is forcing zpool to attempt to import using slice 2 for each disk rather than slice 8. If this is how autoexpand works, as you say, it seems like it should work fine for this. But I don't know how, or if it is possible to, make it use slice 2. The other way is to make a slice that is the correct size of the volumes as I had them before (262144 bytes less than the size of the disk). It seems like this should cause zpool to prefer to use this slice over slice 8, as it can find all 4 labels, rather than just labels 0 and 1. I don't know how to go about this either, or if it's possible. I have been starting to read documentation on slices in solaris but haven't had time to get far enough to figure out what I need. I also have my doubts about this solving my actual issues - the ones that caused me to be unable to import in zfs-fuse. But I need to solve this issue before I can move forward to figuring out/solving whatever that issue was. txg is the same for every volume. -Ethan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help with corrupted pool
On Wed, Feb 17, 2010 at 12:31:27AM -0500, Ethan wrote: > And I just realized - yes, labels 2 and 3 are in the wrong place relative to > the end of the drive; I did not take into account the overhead taken up by > truecrypt when dd'ing the data. The raw drive is 1500301910016 bytes; the > truecrypt volume is 1500301647872 bytes. Off by 262144 bytes - I need a > slice that is sized like the truecrypt volume. It shouldn't matter if the slice is larger than the original; this is how autoexpand works. 2 should be near the start (with 1), 3 should be near the logical end (with 4). Did this resolve the issue? You didn't say, and I have my doubts. I'm not sure this is your problem, but it seems you're on the track to finding the real problem. In the labels you can see, are the txg's the same for all pool members? If not, you may still need import -F, once all the partitioning gets sorted out. Also, re-reading what I wrote above, I realised I was being ambiguous in my use of "label". Sometimes I meant the zfs labels that zdb -l prints, and sometimes I meant the vtoc that format uses for slices. In the BSD world we call those labels too, and I didn't realise I was mixing terms. Sorry for any confusion but it seems you figured out what I meant :) -- Dan. pgpAYV8T7I5RZ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] naming zfs disks
Is there anyway to assign a unique name or id to a disk part of a zpool? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
> "k" == Khyron writes: k> The RFE is out there. Just like SLOGs, I happen to think it a k> good idea, personally, but that's my personal opinion. If it k> makes dedup more usable, I don't see the harm. slogs and l2arcs, modulo the current longstanding ``cannot import pool with attached missing slog'' bug, are disposeable: You will lose either a little data or no data if the device goes away (once the bug is finally fixed). This makes them less ponderous because these days we are looking for raidz2 or raidz3 amount of redundancy, so in a seperate device that wasn't disposeable we'd need a 3- or 4-way mirror. It also makes their seperateness more seperable since they can go away at any time, so maybe they do deserve to be seperate. The two together make the complexity more bearable. Would an sddt be disposeable, or would it be a critical top-level vdev needed for import? If it's critical, well, that's kind of annoying, because now we need 3-way mirrors of sddt to match the minimum best-practice redundancy of the rest of the pool's redundancy, and my first reaction would be ``can you spread it throughout the normal raidz{,2,3} vdevs at least in backup form?'' once I say a copy should be kept in the main pool even afer it becomes an sddt, well, what's that imply? * In the read case it means cacheing, so it could go in the l2arc. How's DDT different from anything else in the l2arc? * In the write case it means sometimes commiting it quickly without waiting on the main pool so we can release some lock or answer some RPC and continue. Why not write it to the slog? Then if we lose the slog we can do what we always do without the slog and roll back to the last valid txg, losing whatever writes were associated with that lost ddt update. The two cases fit fine with the types of SSD's we're using for each role and the type of special error recovery we have if we lose the device. Why litter a landscape so full of special cases and tricks (like the ``import pool with missing slog'' bug that is taking so long to resolve) with yet another kind of vdev that will take 1 year to discover special cases and a halfdecade to address them? Maybe there is a reason. Are DDT write patterns different than slog write patterns? Is it possible to make a DDT read cache using less ARC for pointers than the l2arc currently uses? Is the DDT particularly hard on the l2arc by having small block sizes? Will the sddt be delivered with a separate offline ``not an fsck!!!'' tool for slowly regenerating it from pool data if it's lost, or maybe after an sddt goes bad the pool can be mounted space-wastingly as in like no dedup is done and deletes do not free space, with an empty DDT, and the sddt regenerated by a scrub? If the performance or recovery behavior is different than what we're working towards with optional-slog and persistent-l2arc then maybe sddt does deserve to be antoher vdev type. soi dunno. On one hand I'm clearly nowhere near informed enough to weigh in on an architectural decision like this and shouldn't even be discussing it, and the same applies to you Khyron, to my view, since our input seems obvious at best and misinformed at worst. On the other hand, other major architectural changes (slog) was delivered incomplete in a cripplingly bad and silly, trivial way for, AFAICT, nothing but lack of sufficient sysadmin bitching and moaning, leaving heaps of multi-terabyte naked pools out there for half a decade with fancy triple redundancy that will be totally lost if a single SSD + zpool.cache goes bad, so apparently thinking things through even at this trivial level might have some value to the ones actually doing the work. pgp8j6Y2dtxrq.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
> "ck" == Christo Kutrovsky writes: ck> I could always put "copies=2" (or more) to my important ck> datasets and take some risk and tolerate such a failure. copies=2 has proven to be mostly useless in practice. If there were a real-world device that tended to randomly flip bits, or randomly replace swaths of LBA's with zeroes, but otherwise behave normally (not return any errors, not slow down retrying reads, not fail to attach), then copies=2 would be really valuable, but so far it seems no such device exists. If you actually explore the errors that really happen I venture there are few to no cases copies=2 would save you. one case where such a device appears to exist but doesn't really, is what I often end up doing for family/friend laptops and external USB drives: wait for drives to start going bad, then rescue them with 'dd conv=noerror,sync', fsck, and hope for ``most of'' the data. copies=2 would help get more out of the rescued drive for some but not all of the times I've done this, but there is not much point: Time Machine or rsync backups, or NFS/iSCSI-booting, or zfs send|zfs recv replication to a backup pool, are smarter. I've never been recovering a stupidly-vulnerable drive like that in a situation where I had ZFS on it, so I'm not sure copies=2 will get used here much either. One particular case of doom: a lot of people want to make two unredundant vdevs and then set 'copies=2' and rely on ZFS's promise to spread the two copies out as much as possible. Then they expect to import the pool with only one of the two vdev's and read ``some but not all'' of the data---``I understand I won't get all of it but I just want ZFS to try it's best and we'll see.'' Maybe you want to do this instead of a mirror so you can have scratch datasets that consume space at 1/2 the rate they would on a mirror. Nope, nice try but won't happen. ZFS is full of all sorts of webbed assertions to ratchet you safely through sane pool states that are regression-testable and supportable so it will refuse to import a pool that isn't vdev-complete, and no negotiation is possible on this. The dream is a FAQ and the answer is a clear ``No'' followed by ``you'd better test with file vdevs next time you have such a dream.'' ck> What are the chances for a very specific drive to fail in 2 ck> way mirror? This may not be what you mean, but in general single device redundancy isn't ideal for two reasons: * speculation (although maybe not operational experience?) that modern drives are so huge that even a good drive will occasionally develop an unreadable spot and still be within its BER spec. so, without redundancy, you cannot read for sure at all, even if all the drives are ``good''. * drives do not immediately turn red and start brrk-brrking when they go bad. In the real world, they develop latent sector errors, which you will not discover and mark the drive bad until you scrub or coincidentally happen to read the file that accumulated the error. It's possible for a second drive to go bad in the interval you're waiting to discover the first. This usually gets retold as ``a drive went bad while I was resilvering! what bad luck. If only I could've resilvered faster to close this window of vulnerability I'd not be in such a terrible situation'' but the retelling's wrong: what's really happening is that resilver implies a scrub, so it uncovers the second bad drive you didn't yet know was bad at the time you discovered the first. pgpXXJ9hwSSUa.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
On Wed, 17 Feb 2010, Marty Scholes wrote: Bob, the vast majority of your post I agree with. At the same time, I might disagree with a couple of things. I don't really care how long a resilver takes (hours, days, months) given a couple things: * Sufficient protection exists on the degraded array during rebuild ** Put another way, the array is NEVER in danger * Rebuild takes a back seat to production demands Most data loss is due to human error. To me it seems like disks which take a week to resilver introduce more opportunity for human error. The maintenance operation fades from human memory while it is still underway. If an impeccable log book is not kept and understood, then it is up to (potentially) multiple administrators with varying levels of experience to correctly understand and interpret the output of 'zpool status'. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
I cant' stop myself; I have to respond. :-) Richard wrote: > The ideal pool has one inexpensive, fast, and reliable device :-) My ideal pool has become one inexpensive, fast and reliable "device" built on whatever I choose. > I'm not sure how to connect those into the system (USB 3?) Me neither, but if I had to start guessing about host connections, I would probably think FC. > but when you build it, let us know how it works out. While it would be a fun project, a toy like that would certainly exceed my feeble hardware experience and even more feeble budget. At the same time, I could make a compelling argument that this sort of arrangement: stripes of flash, is the future of tier-one storage. We already are seeing SSD devices which internally are stripes of flash. More and more disks farms are taking on the older roles of tape. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
On Feb 17, 2010, at 9:09 AM, Marty Scholes wrote: > Bob Friesenhahn wrote: >> It is unreasonable to spend more than 24 hours to resilver a single >> drive. It is unreasonable to spend more than 6 days resilvering all >> of the devices in a RAID group (the 7th day is reserved for the system >> administrator). It is unreasonable to spend very much time at all on >> resilvering (using current rotating media) since the resilvering >> process kills performance. > > Bob, the vast majority of your post I agree with. At the same time, I might > disagree with a couple of things. > > I don't really care how long a resilver takes (hours, days, months) given a > couple things: > * Sufficient protection exists on the degraded array during rebuild > ** Put another way, the array is NEVER in danger > * Rebuild takes a back seat to production demands > > Since I am on a rant, I suspect there is also room for improvement in the > scrub. Why would I rescrub a stripe that was read (and presumably validated) > 30 seconds ago by a production application? Because scrub reads all copies of the data, not just one. > Wouldn't it make more sense for scrub to "play nice" with production, moving > a leisurely pace and only scrubbing stripes not read in the past X > hours/days/weeks/whatever? Scrubs are done at the lowest priority and are throttled. I suppose there could be a knob to adjust the throttle, but then you get into the problem we had with SVM -- large devices could take weeks to (resilver) scrub. >From a reliability perspective, there is a weak argument for not scrubbing >recent writes. And scrubs are done in txg order (oldest first). So perhaps there is some merit to a scrub limit. However, it is not clear to me that this really buys you anything. Scrubs are rare events, so the impact of shaving a few minutes or hours off the scrub time is low. > I also agree that an ideal pool would be lowering the device capacity and > radically increasing the device count. The ideal pool has one inexpensive, fast, and reliable device :-) > In my perfect world, I would have a RAID set of 200+ cheap, low-latency, > low-capacity flash drives backed by an additional N% parity, e.g. 40-ish > flash drives. A setup like this would give massive throughput: 200x each > flash drive, amazing IOPS and incredible resiliancy. Rebuild times would be > low due to lower capacity. One could probably build such a beast in 1U using > MicroSDHC cards or some such thing. I'm not sure how to connect those into the system (USB 3?), but when you build it, let us know how it works out. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 15-17, 2010) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
Bob Friesenhahn wrote: > It is unreasonable to spend more than 24 hours to resilver a single > drive. It is unreasonable to spend more than 6 days resilvering all > of the devices in a RAID group (the 7th day is reserved for the system > administrator). It is unreasonable to spend very much time at all on > resilvering (using current rotating media) since the resilvering > process kills performance. Bob, the vast majority of your post I agree with. At the same time, I might disagree with a couple of things. I don't really care how long a resilver takes (hours, days, months) given a couple things: * Sufficient protection exists on the degraded array during rebuild ** Put another way, the array is NEVER in danger * Rebuild takes a back seat to production demands Since I am on a rant, I suspect there is also room for improvement in the scrub. Why would I rescrub a stripe that was read (and presumably validated) 30 seconds ago by a production application? Wouldn't it make more sense for scrub to "play nice" with production, moving a leisurely pace and only scrubbing stripes not read in the past X hours/days/weeks/whatever? I also agree that an ideal pool would be lowering the device capacity and radically increasing the device count. In my perfect world, I would have a RAID set of 200+ cheap, low-latency, low-capacity flash drives backed by an additional N% parity, e.g. 40-ish flash drives. A setup like this would give massive throughput: 200x each flash drive, amazing IOPS and incredible resiliancy. Rebuild times would be low due to lower capacity. One could probably build such a beast in 1U using MicroSDHC cards or some such thing. End rant. Cheers, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] getting tangled with recieved mountpoint properties
On Wed, February 17, 2010 03:35, Daniel Carosone wrote: > I'd be happy enough if none of these was mounted, but annoyingly in > this case, the canmount property is not inheritable, so I can't just > set this somewhere near the top and be done. My workaround so far: > # zfs list -t filesystem -o name | egrep ^dpool | xargs -n 1 zfs set > canmount=noauto > but this is annoying and manual and I sometimes forget when a new > filesystem is added and sent from the source host. That's exactly where I've ended up, except it's not manual, it's built into the send / receive command (mine are local USB disks rather than a remote system, so things are just a bit different). I haven't found a better solution yet, but I'm still back on build 111b, so I don't have the new property replication capabilities (or complexities) to deal with. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
On Wed, 17 Feb 2010, Daniel Carosone wrote: These small numbers just tell you to be more worried about defending against the other stuff. Let's not forget that the most common cause of data loss is human error! Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Plan for upgrading a ZFS based SAN
At the moment is just one pool with a plan to add the 500gb drives... What would be recommend? -Original Message- From: Brandon High Sent: 17 February 2010 01:00 To: Tiernan OToole Cc: Robert Milkowski ; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Plan for upgrading a ZFS based SAN On Tue, Feb 16, 2010 at 3:13 PM, Tiernan OToole wrote: > Cool... Thanks for the advice! Buy why would it be a good idea to change > layout on bigger disks? On top of the reasons Bob gave, your current layout will be very unbalanced after adding devices. You can't currently add more devices to a raidz vdev or remove a top level vdev from a pool, so you'll be stuck with 8 drives in a raidz2, 3 drives in a raidz, and any future additions in additional vdevs. When you say you have 2 pools, do you mean two vdevs in one pool, or actually two pools? -B -- Brandon High : bh...@freaks.com Indecision is the key to flexibility. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool import with failed ZIL device now possible ?
On 16/02/2010 23:59, Christo Kutrovsky wrote: Robert, That would be pretty cool especially if it makes into the 2010.02 release. I hope there are no weird special cases that pop-up from this improvement. I'm pretty sure it won't make 2010.03 Regarding workaround. That's not my experience, unless it behaves differently on ZVOLs and datasets. On ZVOLs it appears the setting kicks in life. I've tested this by turning it off/on and testing with iometer on an exported iSCSI device (iscsitgtd not comstar). I haven't looked at zvol's code handling zil_disable, but with datasets I'm sure I'm right. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
Ugh! If you received a direct response to me instead of via the list, apologies for that. Rob: I'm just reporting the news. The RFE is out there. Just like SLOGs, I happen to think it a good idea, personally, but that's my personal opinion. If it makes dedup more usable, I don't see the harm. Taemun: The issue, as I understand it, is not "use-lots-of-cpu" or "just dies from paging". I believe it is more to do with all of the small, random reads/writes in updating the DDT. Remember, the DDT is stored within the pool, just as the ZIL is if you don't have a SLOG. (The S in SLOG standing for "separate".) So all the DDT updates are in competition for I/O with the actual data deletion. If the DDT could be stored as a separate VDEV already, I'm sure a way would have been hacked together by someone (likely someone on this list). Hence, the need for the RFE to create this functionality where it does not currently exist. The DDT is separate from the ARC or L2ARC. Here's the bug: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913566 If I'm incorrect, someone please let me know. Markus: Yes, the issue would appear to be dataset size vs. RAM size. Sounds like an area ripe for testing, much like RAID Z3 performance. Cheers all! On Tue, Feb 16, 2010 at 00:20, taemun wrote: > The system in question has 8GB of ram. It never paged during the > import (unless I was asleep at that point, but anyway). > > It ran for 52 hours, then started doing 47% kernel cpu usage. At this > stage, dtrace stopped responding, and so iopattern died, as did > iostat. It was also increasing ram usage rapidly (15mb / minute). > After an hour of that, the cpu went up to 76%. An hour later, CPU > usage stopped. Hard drives were churning throughout all of this > (albeit at a rate that looks like each vdev is being controller by a > single threaded operation). > > I'm guessing that if you don't have enough ram, it gets stuck on the > use-lots-of-cpu phase, and just dies from too much paging. Of course, > I have absolutely nothing to back that up. > > Personally, I think that if L2ARC devices were persistent, we already > have the mechanism in place for storing the DDT as a "seperate vdev". > The problem is, there is nothing you can run at boot time to populate > the L2ARC, so the dedup writes are ridiculously slow until the cache > is warm. If the cache stayed warm, or there was an option to forcibly > warm up the cache, this could be somewhat alleviated. > > Cheers > -- "You can choose your friends, you can choose the deals." - Equity Private "If Linux is faster, it's a Solaris bug." - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] getting tangled with recieved mountpoint properties
I have a machine whose purpose is to be a backup server. It has a pool for holding backups from other machines, using zfs send|recv. Call the pool dpool. Inside there are datasets for hostname/poolname, for each of the received pools. All hosts have an rpool, some have other pools as well. So the upper level datasets look similar to: dpool/foo/rpool dpool/bar/rpool dpool/bar/lake dpool/bar/tank dpool/baz/rpool dpool/baz/pond and so on. Under each of those are the datasets from the respective pools, received using zfs recv -d The trouble comes from the received properties, mountpoint in particular. Every one of those rpools has a dataset rpool/export with mountpoint=/export, and home and so forth under there. There was an issue until recently that meant these mountpoints got some strangely-doubled prefix strings added, so they were mounted in odd places. That was fixed, and now with an upgrade to b132, I have many filesystems all with the same mountpoint property. At boot time, one of these wins, and the rest fail to mount, filesystem/local svc fails and not much else starts from there. With the new distinction between received and local properties, I was hoping for a solution. The zfs manpage says that I can use inherit to mask a received property with an inherited one. This flat-out seems not to work, at least for mountpoint. I'd be happy enough if none of these was mounted, but annoyingly in this case, the canmount property is not inheritable, so I can't just set this somewhere near the top and be done. My workaround so far: # zfs list -t filesystem -o name | egrep ^dpool | xargs -n 1 zfs set canmount=noauto but this is annoying and manual and I sometimes forget when a new filesystem is added and sent from the source host. I can't use the altroot pool property, it's not persistent and anyway would just prefix the same thing to all of the datasets, for them to collide under there. It would at least ensure I get the backup machine's own /export/home/dan and /var and some others, though :) The same issue applies to other properties as well, though they don't cause boot failures. For example, I might set copies=2 on some dataset on a single-disk laptop, but there's no need for that on the backup server. I still want that to be remembered for the restore case, but not to be applied on the backup server. How else can I solve this problem? What else have others done on their backup servers? Separately, if I override recvd properties, in the restore use case when I want to send the dataset back to the original host, is there some way I can tell zfs send to look at the received properties rather than the local override? I was hoping the new property source stuff would help with all this, but I'm not really clear on the whole process, it doesn't quite seem to be fully fleshed out yet. I can imagine something like a property-sed, that you could insert in between send and recv pipeline, as being helpful - but a properly thought-out framework would be much better. -- Dan. pgpqNN8B2TQv1.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
Dan, "loose" was a typo. I meant "lose". Interesting how a typo (write error) can cause a lot of confusion on what exactly I mean :) Resulting in corrupted interpretation. Note that my idea/proposal is targeted for a growing number of home users. To those, value for money usually is a much more significant factor than others. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss