Re: [zfs-discuss] Wishlist items
On 26/06/2007, at 12:08 PM, [EMAIL PROTECTED] wrote: I've been saving up a few wishlist items for zfs. Time to share. 1. A verbose (-v) option to the zfs commandline. In particular zfs sometimes takes a while to return from zfs snapshot -r tank/[EMAIL PROTECTED] in the case where there are a great many iscsi shared volumes underneath. A little progress feedback would go a long way. In general I feel the zfs tools lack sufficient feedback and/or logging of actions, and this'd be a great start. Since IIRC snapshot -r is supposed to be atomic (one TXG) I'm not sure that progress reports would be meaningful. Have you seen zpool history? 2. LUN management and/or general iscsi integration enhancement Some of these iscsi volumes I'd like to be under the same target but with different LUNs. A means for mapping that would be excellent. As would a means to specify the IQN explicitly, and the set of permitted initiators. 3. zfs rollback on clones. It should be possible to rollback a clone to the origin snapshot, yes? Right now the tools won't allow it. I know I can hack in a race-sensitive snapshot of the new volume immediately after cloning, but I already have many hundreds of entities and I'm trying not to proliferate them. Yes, since rollback only takes the snapshot as an argument there seems not to be a way to rollback a clone to the fork snapshot. You could, of course just blow away the clone and make a new one from the same snapshot Similarly the ability to do zfs send -i [clone origin snapshot1] snapshot2 in order to efficiently transmit/backup clones would be terrific. It seems that a way to use [EMAIL PROTECTED] as an alias for [EMAIL PROTECTED] would solve both of these problems, at least at the user interface level. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: ZFS - SAN and Raid
Victor Engle wrote: Roshan, As far as I know, there is no problem at all with using SAN storage with ZFS and it does look like you were having an underlying problem with either powerpath or the array. Correct. A write failed. The best practices guide on opensolaris does recommend replicated pools even if your backend storage is redundant. There are at least 2 good reasons for that. ZFS needs a replica for the self healing feature to work. Also there is no fsck like tool for ZFS so it is a good idea to make sure self healing can work. Yes, currently ZFS on Solaris will panic if a non-redundant write fails. This is known and being worked on, but there really isn't a good solution if a write fails, unless you have some ZFS-level redundancy. Why not? If O_DSYNC applies, a write() can still fail with EIO, right? And if O_DSYNC does not apply, an app could not assume that the written data was on stable storage anyway. Or the write() can just block until the problem is corrected (if correctable) or the system is rebooted. In any case, IMO there ought to be some sort of consistent behavior possible short of a panic. I've seen UFS based systems stay up even with their disks incommunicado for awhile, although they were hardly useful like that except insofar as activity strictly involving reading already cached pages was involved. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re[2]: Re: Re: Re: Snapshots impact on performance
Same problem here (snv_60). Robert, did you find any solutions? gino This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS usb keys
Shouldn't S10u3 just see the newer on-disk format and report that fact, rather than complain it is corrupt? Yep, I just tried it, and it refuses to zpool import the newer pool, telling me about the incompatible version. So I guess the pool format isn't the correct explanation for the Dick Davies' (number9) problem. On a S-x86 box running snv_68, ZFS version 7: # mkfile 256m /home/leo.nobackup/tmp/zpool_test.vdev # zpool create test_pool /home/leo.nobackup/tmp/zpool_test.vdev # zpool export test_pool On a S-sparc box running snv_61, ZFS version 3 (I get the same error on S-x86, running S10U2, ZFS version 2): # zpool import -d /home/leo.nobackup/tmp/ pool: test_pool id: 6231880247307261822 state: FAULTED status: The pool is formatted using an incompatible version. action: The pool cannot be imported. Access the pool on a system running newer software, or recreate the pool from backup. see: http://www.sun.com/msg/ZFS-8000-A5 config: test_pool UNAVAIL newer version /home/leo.nobackup/tmp//zpool_test.vdev ONLINE This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS usb keys
It would be really handy if whoever was responsible for the message at: http://www.sun.com/msg/ZFS-8000-A5 could add data about which zpool versions are supported at specific OS/patch releases. The current message doesn't help the user figure out how to accomplish their implied task, which is to import the pool on a different system. Adding the version number of the pool that couldn't be imported to the zpool import error message would be nice too. Shouldn't S10u3 just see the newer on-disk format and report that fact, rather than complain it is corrupt? Yep, I just tried it, and it refuses to zpool import the newer pool, telling me about the incompatible version. So I guess the pool format isn't the correct explanation for the Dick Davies' (number9) problem. On a S-x86 box running snv_68, ZFS version 7: # mkfile 256m /home/leo.nobackup/tmp/zpool_test.vdev # zpool create test_pool /home/leo.nobackup/tmp/zpool_test.vdev # zpool export test_pool On a S-sparc box running snv_61, ZFS version 3 (I get the same error on S-x86, running S10U2, ZFS version 2): # zpool import -d /home/leo.nobackup/tmp/ pool: test_pool id: 6231880247307261822 tate: FAULTED status: The pool is formatted using an incompatible version. action: The pool cannot be imported. Access the pool on a system running newer software, or recreate the pool from backup. http://www.sun.com/msg/ZFS-8000-A5 config: test_pool UNAVAIL newer version /home/leo.nobackup/tmp//zpool_test.vdev ONLINE This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re[2]: Re: Re: Re: Snapshots impact on performance
Gino wrote: Same problem here (snv_60). Robert, did you find any solutions? Couple of week ago I put together an implementation of space maps which completely eliminates loops and recursion from space map alloc operation, and allows to implement different allocation strategies quite easily (of which I put together 3 more). It looks like it works for me on thumper and my notebook with ZFS Root though I have almost no time to test it more these days due to year end. I haven't done SPARC build yet and I do not have test case to test against. Also, it comes at a price - I have to spend some more time (logarithmic, though) during all other operations on space maps and is not optimized now. victor ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS usb keys
On Wed, 27 Jun 2007, [UTF-8] Jürgen Keil wrote: Yep, I just tried it, and it refuses to zpool import the newer pool, telling me about the incompatible version. So I guess the pool format isn't the correct explanation for the Dick Davies' (number9) problem. Have you tried creating the pool on b61 and importing it into b68? Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New german white paper on ZFS
On Tue, Jun 19, 2007 at 05:19:05PM +0200, Constantin Gonzalez wrote: Hi, http://blogs.sun.com/constantin/entry/new_zfs_white_paper_in Excellent!!! I think it is a pretty good idea, to put the links for the paper and slides on the ZFS Documentation page aka http://www.opensolaris.org/os/community/zfs/docs/ Regards, jel. -- Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 39106 Magdeburg, Germany Tel: +49 391 67 12768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggestions on 30 drive configuration?
Richard Elling wrote: Rob Logan wrote: an array of 30 drives in a RaidZ2 configuration with two hot spares I don't want to mirror 15 drives to 15 drives ok, so space over speed... and are willing to toss somewhere between 4 and 15 drives for protection. raidz splits the (up to 128k) write/read recordsize into each element of the raidz set.. (ie: all drives must be touched and all must finish before the block request is complete) so with a 9 disk raid1z set that's (8 data + 1 parity (8+1)) or 16k per disk for a full 128k write. or for a smaller 4k block, that a single 512b sector per disk. on a 26+2 raid2z set that 4k block would still use 8 disks, with the other 18 disks unneeded but allocated. It is not so easy to predict. ZFS will coalesce writes. A single transaction group may have many different writes in it. Also, raidz[12] is dynamic, and will use what it needs, unlike separate volume managers who do not have any understanding of the context of the data. There is a good slide which illustrates how stripe width is selected dynamically in RAID-Z. Please see slide 13 in this slide deck: http://www.snia.org/events/past/sdc2006/zfs_File_Systems-bonwick-moore.pdf Yes, some space may be wasted (marked by X on the slide), but there are guidelines for the number of the devices in the RAID-Z(2) vdev which allows to avoid this waste. Btw, I believe there's no link to this presentation on opensolaris.org. unfortunately... victor so perhaps three sets of 8+2 would let three blocks be read/written to at once with a total of 6 disks for protection. but for twice the speed, six sets of 4+1 would be the same size, (same number of disks for protection) but isn't quite as safe for its 2x speed. Yes, need to follow your priorities, there are just too many options otherwise. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS usb keys
Thanks to everyone for the sanity check - I think it's a platform issue, but not an endian one. The stick was originally DOS-formatted, and the zpool was built on the first fdisk partition. So Sparcs aren't seeing it, but the x86/x64 boxes are. -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS usb keys
I had a similar situation between x86 and SPARC, version number. When I created the pool on the LOWER rev machine, it was seen by the HIGHER rev machine. This was a USB HDD, not a stick. I can now move the drive between boxes. HTH, Mike Dick Davies wrote: Thanks to everyone for the sanity check - I think it's a platform issue, but not an endian one. The stick was originally DOS-formatted, and the zpool was built on the first fdisk partition. So Sparcs aren't seeing it, but the x86/x64 boxes are. -- http://www.sun.com/solaris * Michael Lee * Area System Support Engineer *Sun Microsystems, Inc.* Phone x40782 / 866 877 8350 Email [EMAIL PROTECTED] http://www.sun.com/solaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New german white paper on ZFS
Jens, Someone already added it to the ZFS links page, here: http://opensolaris.org/os/community/zfs/links/ I just added a link to the links page from the zfs docs page so it is easier to find. Thanks, Cindy Jens Elkner wrote: On Tue, Jun 19, 2007 at 05:19:05PM +0200, Constantin Gonzalez wrote: Hi, http://blogs.sun.com/constantin/entry/new_zfs_white_paper_in Excellent!!! I think it is a pretty good idea, to put the links for the paper and slides on the ZFS Documentation page aka http://www.opensolaris.org/os/community/zfs/docs/ Regards, jel. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Suggestions on 30 drive configuration?
I have 8 SATA on the motherboard, 4 PCI cards with 4 SATA each, one PCIe 4x sata card with two, and one PCIe 1x with two. The operating system itself will be on a hard drive attached to one ATA 100 connector. Kind of like a poor man's data centre, except not that cheap... It still is estimated to come out at around 6 thousand dollars, which in retrospect for that amount of sorage these days is actually relatively good. I've weighed my options and I am thinking that 3 raidz2 sets is the best balance of data safety to free space. Also, in case any of you are wondering why I would need space, most of it will be HDV footage and render files. Thank you everyone who contributed here you have been of great assistance. On 6/25/07, Bryan Wagoner [EMAIL PROTECTED] wrote: What is the controller setup going to look like for the 30 drives? Is it going to be fibre channel, SAS, etc. and what will be the Controller-to-Disk ratio? ~Bryan This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: zfs and 2530 jbod
On June 26, 2007 2:13:54 PM -0700 Joel Miller [EMAIL PROTECTED] wrote: The 2500 series engineering team is talking with the ZFS folks to understand the various aspects of delivering a complete solution. (There is a lot more to it than it seems to work...). Great news, you made my day! Any ETA? -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS usb keys
William D. Hathaway wrote: It would be really handy if whoever was responsible for the message at: http://www.sun.com/msg/ZFS-8000-A5 could add data about which zpool versions are supported at specific OS/patch releases. Did you look at http://www.opensolaris.org/os/community/zfs/version/N Where 'N' is the version number? --matt $ zpool upgrade -v This system is currently running ZFS pool version 8. The following versions are supported: VER DESCRIPTION --- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 pool properties 7 Separate intent log devices 8 Delegated administration For more information on a particular version, including supported releases, see: http://www.opensolaris.org/os/community/zfs/version/N Where 'N' is the version number. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Drive Failure w/o Redundancy
Jef Pearlman wrote: Absent that, I was considering using zfs and just having a single pool. My main question is this: what is the failure mode of zfs if one of those drives either fails completely or has errors? Do I permanently lose access to the entire pool? Can I attempt to read other data? Can I zfs replace the bad drive and get some level of data recovery? Otherwise, by pooling drives am I simply increasing the probability of a catastrophic data loss? I apologize if this is addressed elsewhere -- I've read a bunch about zfs, but not come across this particular answer. We generally recommend a single pool, as long as the use case permits. But I think you are confused about what a zpool is. I suggest you look t the examples or docs. A good overview is the slide show http://www.opensolaris.org/os/community/zfs/docs/zfs_ last.pdf Perhaps I'm not asking my question clearly. I've already experimented a fair amount with zfs, including creating and destroying a number of pools with and without redundancy, replacing vdevs, etc. Maybe asking by example will clarify what I'm looking for or where I've missed the boat. The key is that I want a grow-as-you-go heterogenous set of disks in my pool: Let's say I start with a 40g drive and a 60g drive. I create a non-redundant pool (which will be 100g). At some later point, I run across an unused 30g drive, which I add to the pool. Now my pool is 130g. At some point after that, the 40g drive fails, either by producing read errors or my failing to spin up at all. What happens to my pool? Can I mount and access it at all (for the data not on or striped across the 40g drive)? Can I zfs replace the 40g drive with another drive and have it attempt to copy as much data over as it can? Or am I just out of luck? zfs seems like a great way to use old/unutilized drives to expand capacity, but sooner or later one of those drives will fail, and if it takes out the whole pool (which it might reasonably do), then it doesn't work out in the end. As a side-question, does anyone have a suggestion for an intelligent way to approach this goal? This is not mission-critical data, but I'd prefer not to make data loss _more_ probable. Perhaps some volume manager (like LVM on linux) has appropriate features? ZFS, mirrored pool will be the most performant and easiest to manage with better RAS than a raidz pool. The problem I've come across with using mirror or raidz for this setup is that (as far as I know) you can't add disks to mirror/raidz groups, and if you just add the disk to the pool, you end up in the same situation as above (with more space but no redundancy). Thanks for your help. -Jef This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Drive Failure w/o Redundancy
Perhaps I'm not asking my question clearly. I've already experimented a fair amount with zfs, including creating and destroying a number of pools with and without redundancy, replacing vdevs, etc. Maybe asking by example will clarify what I'm looking for or where I've missed the boat. The key is that I want a grow-as-you-go heterogenous set of disks in my pool: Let's say I start with a 40g drive and a 60g drive. I create a non-redundant pool (which will be 100g). At some later point, I run across an unused 30g drive, which I add to the pool. Now my pool is 130g. At some point after that, the 40g drive fails, either by producing read errors or my failing to spin up at all. What happens to my pool? Since you have created a non-redundant pool (or more specifically, a pool with non-redundant members), the pool will fail. The problem I've come across with using mirror or raidz for this setup is that (as far as I know) you can't add disks to mirror/raidz groups, and if you just add the disk to the pool, you end up in the same situation as above (with more space but no redundancy). You can't add to an existing mirror, but you can add new mirrors (or raidz) items to the pool. If so, there's no loss of redundancy. -- Darren Dunham [EMAIL PROTECTED] Senior Technical Consultant TAOShttp://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area This line left intentionally blank to confuse you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: [zfs-code] Space allocation failure
Hi, In brief, what I am trying to do is to use libzpool to access a zpool - like ztest does. Matthew Ahrens wrote: Manoj Joseph wrote: Hi, Replying to myself again. :) I see this problem only if I attempt to use a zpool that already exists. If I create one (using files instead of devices, don't know if it matters) like ztest does, it works like a charm. You should probably be posting on zfs-discuss. Switching from zfs-code to zfs-discuss. The pool you're trying to access is damaged. It would appear that one of the devices can not be written to. No, AFAIK, the pool is not damaged. But yes, it looks like the device can't be written to by the userland zfs. bash-3.00# zpool import test bash-3.00# zfs list test NAME USED AVAIL REFER MOUNTPOINT test85K 1.95G 24.5K /test bash-3.00# ./udmu test pool: test state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM testONLINE 0 0 0 c2t0d0ONLINE 0 0 0 errors: No known data errors Export the pool. cannot open 'test': no such pool Import the pool. error: ZFS: I/O failure (write on unknown off 0: zio 8265d80 [L0 unallocated] 4000L/400P DVA[0]=0:1000:400 DVA[1]=0:18001000:400 fletcher4 lzjb LE contiguous birth=245 fill=0 cksum=6bba8d3a44:2cfa96558ac7:c732e55bea858:2b86470f6a83373): error 28 Abort (core dumped) bash-3.00# zpool import test bash-3.00# zpool status test pool: test state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM testONLINE 0 0 0 c2t0d0ONLINE 0 0 0 errors: No known data errors bash-3.00# touch /test/z bash-3.00# sync bash-3.00# ls -l /test/z -rw-r--r-- 1 root root 0 Jun 28 04:18 /test/z bash-3.00# The userland zfs's export succeeds. But doing a system(zpool status test) right after the spa_export() succeeds shows that the the 'kernel zfs' still thinks it is imported. I guess that makes sense. Nothing has been told to the 'kernel zfs' about the export. But I still do not understand why the 'userland zfs' can't write to the pool. Regards, Manoj PS: The code I have be tinkering with is attached. --matt Any clue as to why this is so would be appreciated. Cheers Manoj Manoj Joseph wrote: Hi, I tried adding an spa_export();spa_import() to the code snippet. I get a similar crash while importing. I/O failure (write on unknown off 0: zio 822ed40 [L0 unallocated] 4000L/400P DVA[0]=0:1000:400 DVA[1]=0:18001000:400 fletcher4 lzjb LE contiguous birth=4116 fill=0 cksum=69c3a4acfc:2c42fdcaced5:c5231ffcb2285:2b8c1a5f2cb2bfd): error 28 Abort (core dumped) I thought ztest could use an existing pool. Is that assumption wrong? These are the stacks of interest. d11d78b9 __lwp_park (81c3e0c, 81c3d70, 0) + 19 d11d1ad2 cond_wait_queue (81c3e0c, 81c3d70, 0, 0) + 3e d11d1fbd _cond_wait (81c3e0c, 81c3d70) + 69 d11d1ffb cond_wait (81c3e0c, 81c3d70) + 24 d131e4d2 cv_wait (81c3e0c, 81c3d6c) + 5e d12fe2dd txg_wait_synced (81c3cc0, 1014, 0) + 179 d12f9080 spa_config_update (819dac0, 0) + c4 d12f467a spa_import (8047657, 8181f88, 0) + 256 080510c6 main (2, 804749c, 80474a8) + b2 08050f22 _start (2, 8047650, 8047657, 0, 804765c, 8047678) + 7a d131ed79 vpanic (d1341dbc, ca5cd248) + 51 d131ed9f panic(d1341dbc, d135a384, d135a724, d133a630, 0, 0) + 1f d131921d zio_done (822ed40) + 455 d131c15d zio_next_stage (822ed40) + 161 d1318b92 zio_wait_for_children (822ed40, 11, 822ef30) + 6a d1318c88 zio_wait_children_done (822ed40) + 18 d131c15d zio_next_stage (822ed40) + 161 d131ba83 zio_vdev_io_assess (822ed40) + 183 d131c15d zio_next_stage (822ed40) + 161 d1307011 vdev_mirror_io_done (822ed40) + 421 d131b8a2 zio_vdev_io_done (822ed40) + 36 d131c15d zio_next_stage (822ed40) + 161 d1318b92 zio_wait_for_children (822ed40, 11, 822ef30) + 6a d1318c88 zio_wait_children_done (822ed40) + 18 d1306be6 vdev_mirror_io_start (822ed40) + 1d2 d131b862 zio_vdev_io_start (822ed40) + 34e d131c313 zio_next_stage_async (822ed40) + 1ab d131bb47 zio_vdev_io_assess (822ed40) + 247 d131c15d zio_next_stage (822ed40) + 161 d1307011 vdev_mirror_io_done (822ed40) + 421 d131b8a2 zio_vdev_io_done (822ed40) + 36 d131c15d zio_next_stage (822ed40) + 161 d1318b92 zio_wait_for_children (822ed40, 11, 822ef30) + 6a d1318c88 zio_wait_children_done (822ed40) + 18 d1306be6 vdev_mirror_io_start (822ed40) + 1d2 d131b862 zio_vdev_io_start (822ed40) + 34e d131c15d zio_next_stage (822ed40) + 161 d1318dc1 zio_ready (822ed40) + 131 d131c15d zio_next_stage (822ed40) + 161 d131b41b zio_dva_allocate (822ed40) + 343 d131c15d zio_next_stage (822ed40) + 161 d131bdcb zio_checksum_generate (822ed40) + 123 d131c15d zio_next_stage (822ed40) + 161 d1319873 zio_write_compress (822ed40) + 4af d131c15d zio_next_stage (822ed40) + 161 d1318b92 zio_wait_for_children (822ed40, 1, 822ef28) + 6a d1318c68
Re: [zfs-discuss] Re: Drive Failure w/o Redundancy
Darren Dunham wrote: The problem I've come across with using mirror or raidz for this setup is that (as far as I know) you can't add disks to mirror/raidz groups, and if you just add the disk to the pool, you end up in the same situation as above (with more space but no redundancy). You can't add to an existing mirror, but you can add new mirrors (or raidz) items to the pool. If so, there's no loss of redundancy. Maybe I'm missing some context, but you can add to an existing mirror - see zpool attach. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ReiserFS4 like metadata/search
The only thing I haven't found in zfs yet, is metadata etc info. The previous 'next best thing' in FS was of course ReiserFS (4). Reiser3 was quite a nice thing, fast, journaled and all that, but Reiser4 promised to bring all those things that we see emerging now, like cross FS search, any document, audio recording etc could be instantly searched. True there is google desktop search, trackerd and what not, but those are 'afterthoughts', not supported by the underlying FS. So does ZFS support features like metadata and such? or is that for zfs2? :) oliver ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Drive Failure w/o Redundancy
Darren Dunham wrote: The problem I've come across with using mirror or raidz for this setup is that (as far as I know) you can't add disks to mirror/raidz groups, and if you just add the disk to the pool, you end up in the same situation as above (with more space but no redundancy). You can't add to an existing mirror, but you can add new mirrors (or raidz) items to the pool. If so, there's no loss of redundancy. Maybe I'm missing some context, but you can add to an existing mirror - see zpool attach. It depends on what you mean by add. :-) The original message was about increasing storage allocation. You can add redundancy to an existing mirror with attach, but you cannot increase the allocatable storage. -- Darren Dunham [EMAIL PROTECTED] Senior Technical Consultant TAOShttp://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area This line left intentionally blank to confuse you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Drive Failure w/o Redundancy
On Wed, 2007-06-27 at 14:50 -0700, Darren Dunham wrote: Darren Dunham wrote: The problem I've come across with using mirror or raidz for this setup is that (as far as I know) you can't add disks to mirror/raidz groups, and if you just add the disk to the pool, you end up in the same situation as above (with more space but no redundancy). You can't add to an existing mirror, but you can add new mirrors (or raidz) items to the pool. If so, there's no loss of redundancy. Maybe I'm missing some context, but you can add to an existing mirror - see zpool attach. It depends on what you mean by add. :-) The original message was about increasing storage allocation. You can add redundancy to an existing mirror with attach, but you cannot increase the allocatable storage. With mirrors, there is currently more flexibility than with raid-Z[2]. You can increase the allocatable storage size by replacing each disk in the mirror with a larger sized one (assuming you wait for a re-sync ;-P ) Thus, the _safe_ way to increase a mirrored vdev's size is: Disk A: 100GB Disk B: 100GB Disk C: 250GB Disk D: 250GB zpool create tank mirror A B (yank out A, put in C) (wait for resync) (yank out B, put in D) (wait for resync) and voila! tank goes from 100GB to 250GB of space. I believe this should also work if LUNs are used instead of actual disks - but I don't believe that resizing a LUN currently in a mirror will work (please, correct me on this), so, for a SAN-backed ZFS mirror, it would be: Assuming A = B C, and after resizing A, A = C B zpool create tank mirror A B zpool attach tank A C (where C is a new LUN of the new size desired) (wait for sync of C) zpool detach tank A (unmap LUN A from host, resize A to be the same as C, then map back) zpool attach C A (wait for sync of A) zpool detach B I believe that will now result in a mirror of the full size of C, not of B. I'd be interested to know if you could do this: zpool create tank mirror A B (resize LUN A and B to new size) without requiring a system reboot after resizing A B (that is, the reboot would be needed to update the new LUN size on the host). -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Drive Failure w/o Redundancy
Jef Pearlman wrote: Perhaps I'm not asking my question clearly. I've already experimented a fair amount with zfs, including creating and destroying a number of pools with and without redundancy, replacing vdevs, etc. Maybe asking by example will clarify what I'm looking for or where I've missed the boat. The key is that I want a grow-as-you-go heterogenous set of disks in my pool: The short answer: zpool add -- add a top-level vdev as a dynamic stripe column + available space is increased zpool attach -- add a mirror to an existing vdev + only works when the new mirror is the same size or larger than the existing vdev + available space is unchanged + redundancy (RAS) is increased zpool detach -- remove a mirror from an existing vdev + available space increases if removed mirror is smaller than vdev + redundancy (RAS) is decreased zpool replace -- functionally equivalent to attach followed by detach Let's say I start with a 40g drive and a 60g drive. I create a non-redundant pool (which will be 100g). At some later point, I run across an unused 30g drive, which I add to the pool. Now my pool is 130g. At some point after that, the 40g drive fails, either by producing read errors or my failing to spin up at all. What happens to my pool? Can I mount and access it at all (for the data not on or striped across the 40g drive)? Can I zfs replace the 40g drive with another drive and have it attempt to copy as much data over as it can? Or am I just out of luck? zfs seems like a great way to use old/unutilized drives to expand capacity, but sooner or later one of those drives will fail, and if it takes out the whole pool (which it might reasonably do), then it doesn't work out in the end. For non-redundant zpools, a device failure *may* cause the zpool to be unavailable. The actual availability depends on the nature of the failure. A more common scenario might be to add a 400 GByte drive, which you can use to replace the older drives, or keep online for redundancy. The zfs copies feature is a little bit harder to grok. It is difficult to predict how the system will be affected if you have copies=2 in your above scenario, because it depends on how the space is allocated. For more info, see my notes at: http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Drive Failure w/o Redundancy
On Wed, 2007-06-27 at 12:03 -0700, Jef Pearlman wrote: Jef Pearlman wrote: Absent that, I was considering using zfs and just having a single pool. My main question is this: what is the failure mode of zfs if one of those drives either fails completely or has errors? Do I permanently lose access to the entire pool? Can I attempt to read other data? Can I zfs replace the bad drive and get some level of data recovery? Otherwise, by pooling drives am I simply increasing the probability of a catastrophic data loss? I apologize if this is addressed elsewhere -- I've read a bunch about zfs, but not come across this particular answer. Pooling devices in a non-redundant mode (ie without a raidz or mirror vdev) increases your chance of losing data, just like every other RAID system out there. However, since ZFS doesn't do concatenation (it stripes), by losing one drive in a non-redundant stripe, you effectively corrupt the entire dataset, as virtually all files should have some portion of their data on the dead drive. We generally recommend a single pool, as long as the use case permits. But I think you are confused about what a zpool is. I suggest you look t the examples or docs. A good overview is the slide show http://www.opensolaris.org/os/community/zfs/docs/zfs_ last.pdf Perhaps I'm not asking my question clearly. I've already experimented a fair amount with zfs, including creating and destroying a number of pools with and without redundancy, replacing vdevs, etc. Maybe asking by example will clarify what I'm looking for or where I've missed the boat. The key is that I want a grow-as-you-go heterogenous set of disks in my pool: Let's say I start with a 40g drive and a 60g drive. I create a non-redundant pool (which will be 100g). At some later point, I run across an unused 30g drive, which I add to the pool. Now my pool is 130g. At some point after that, the 40g drive fails, either by producing read errors or my failing to spin up at all. What happens to my pool? Can I mount and access it at all (for the data not on or striped across the 40g drive)? Can I zfs replace the 40g drive with another drive and have it attempt to copy as much data over as it can? Or am I just out of luck? zfs seems like a great way to use old/unutilized drives to expand capacity, but sooner or later one of those drives will fail, and if it takes out the whole pool (which it might reasonably do), then it doesn't work out in the end. Nope. Your zpool is a stripe. As mentioned above, losing one disk in a stripe effectively destroys all data, just as with any other RAID system. As a side-question, does anyone have a suggestion for an intelligent way to approach this goal? This is not mission-critical data, but I'd prefer not to make data loss _more_ probable. Perhaps some volume manager (like LVM on linux) has appropriate features? ZFS, mirrored pool will be the most performant and easiest to manage with better RAS than a raidz pool. The problem I've come across with using mirror or raidz for this setup is that (as far as I know) you can't add disks to mirror/raidz groups, and if you just add the disk to the pool, you end up in the same situation as above (with more space but no redundancy). Thanks for your help. -Jef To answer the original question, you _have_ to create mirrors, which, if you have odd-sized disks, will end up with unused space. An example: Disk A: 20GB Disk B: 30GB Disk C: 40GB Disk D: 60GB Start with disk A B: zpool create tank mirror A B results in a 20GB pool. Later, add disks C D: zpool add tank mirror C D this results in a 2-wide stripe of 2 mirrors, which means there is a total capacity of 60GB (20GB for A B, 40GB for B C) of the pool. 10GB of the 30GB drive, and 20GB of the 60GB drive are currently unused. You can lose one drive from both pairs (i.e. A and C, A and D, B and C, or B and D) before any data loss. If you had known about the drive sizes beforehand, the you could have done something like this: Partition the drives as follows: A: 1 20GB partition B: 1 20gb 1 10GB partition C: 1 40GB partition D: 1 40GB partition 2 10GB paritions then you do: zpool create tank mirror Ap0 Bp0 mirror Cp0 Dp0 mirror Bp1 Dp1 and you get a total of 70GB of space. However, the performance on this is going to be bad (as you frequently need to write to both partitions on B D, causing head seek), though you can still lose up to 2 drives before experiencing data loss. -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggestions on 30 drive configuration?
On 28/06/2007, at 12:29 AM, Victor Latushkin wrote: It is not so easy to predict. ZFS will coalesce writes. A single transaction group may have many different writes in it. Also, raidz[12] is dynamic, and will use what it needs, unlike separate volume managers who do not have any understanding of the context of the data. There is a good slide which illustrates how stripe width is selected dynamically in RAID-Z. Please see slide 13 in this slide deck: http://www.snia.org/events/past/sdc2006/zfs_File_Systems-bonwick- moore.pdf [...] Btw, I believe there's no link to this presentation on opensolaris.org. unfortunately... Indeed. Is there any reason that the presentation at http:// www.opensolaris.org/os/community/zfs/docs/zfs_last.pdf Couldn't be updated to the one that Victor mentions? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Drive Failure w/o Redundancy
Erik Trimble wrote: If you had known about the drive sizes beforehand, the you could have done something like this: Partition the drives as follows: A: 1 20GB partition B: 1 20gb 1 10GB partition C: 1 40GB partition D: 1 40GB partition 2 10GB paritions then you do: zpool create tank mirror Ap0 Bp0 mirror Cp0 Dp0 mirror Bp1 Dp1 and you get a total of 70GB of space. However, the performance on this is going to be bad (as you frequently need to write to both partitions on B D, causing head seek), though you can still lose up to 2 drives before experiencing data loss. It is not clear to me that we can say performance will be bad for stripes on single disks. The reason is that ZFS dynamic striping does not use a fixed interleave. In other words, if I write a block of N bytes to a M-way dynamic stripe, it is not guaranteed that each device will get an I/O of N/M size. I've only done a few measurements of this, and I've not completed my analysis, but my data does not show the sort of thrashing one might expect from a fixed stripe with small interleave. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss